In 2014, I took part in a tournament for the Good Judgment Project. The goal of this project was to figure out whether certain individuals have habits of mind that make them better at predicting future events than others. About once a week, I would go onto their website and be presented with a series of questions about potential future events. I’d assign a probability to each event, but could skip as many questions as I wanted if I felt like I couldn’t offer a reasonable prediction. I have essentially no recollection of what events I was forecasting, but I remember that I was truly awful at it. Even though I had fun, my purely vibes-based strategy didn’t even put me in the top half of the leader board.
> Notably, if you test twenty different aggregation techniques, you would expect one of them to come in under the 0.05 p-value that usually designates statistically significant differences between groups due to random noise alone — even if there was no actual difference.
Nit: p-values quantify *sampling* error. If you take the best of 20 techniques on a single sample, it's hard to measure whether the result is over-fit without additional data. You need a better understanding of how independent the techniques are, or -- more likely -- a separate set of evaluation data to evaluate whether your model is over-tuned. [And in the linked paper (https://goodjudgment.io/docs/Goldstein-et-al-2015.pdf) the MPDDA result for "GJP Best Method" has p<0.001]
Interesting article, but I'm stuck on the 'Big if true' paragraph. I see someone else already pointed out the maths error, but I think the main point of the paragraph is also flawed.
Yes, if the event happens in the final month then your superforecaster will be penalised relative to the non-superforecaster. But how is this different from any other case where a low-probability event happens?
If we've reached the final month and the event hasn't yet happened, then 95% of the time it will continue to not happen, and the superforecaster will score much better than the non-superforecaster. Five percent of the time it will happen, and the superforecaster's score will suffer -- but that's the nature of prediction under uncertainty. Even the best possible predictor will sometimes be 'wrong', and their skill will only be clearly evident when the results of many predictions are aggregated.
I think you're suggesting this example is different, but I don't see how. The superforecaster has a higher expected score, whether we're looking at the year as a whole or any slice of it that begins with the result still undecided.
I don't think your methodology for investigating your 2nd hypothesis is answering your question.
Imagine you had a perfect forecaster (not an oracle who knows the future, but a sub-oracle who perfectly synthesizes current information to forecast with true probabilities). Let's say that a bunch of questions have true probability of 60%. For questions where the crowd's estimates were too low, the crowd will "beat" the perfect forecaster 40% of the time (and so will most subgroups of the crowd). For questions where the crowd's estimates were too high they're going to "beat" the perfect forecaster 60% of the time.
I think your methodology would say that the perfect forecaster did better or worse at these questions based on the results, even though in this case the forecasts are by assumption equally perfect.
Looking at whether a particular coin toss came up heads or tails doesn't tell you very much about the quality of any given forecast. Imperfect forecasters are often rewarded for their errors over small samples; that doesn't make their forecasts better than the perfect forecaster's, they just got lucky.
Great investigation, thanks for sharing!
> Notably, if you test twenty different aggregation techniques, you would expect one of them to come in under the 0.05 p-value that usually designates statistically significant differences between groups due to random noise alone — even if there was no actual difference.
Nit: p-values quantify *sampling* error. If you take the best of 20 techniques on a single sample, it's hard to measure whether the result is over-fit without additional data. You need a better understanding of how independent the techniques are, or -- more likely -- a separate set of evaluation data to evaluate whether your model is over-tuned. [And in the linked paper (https://goodjudgment.io/docs/Goldstein-et-al-2015.pdf) the MPDDA result for "GJP Best Method" has p<0.001]
Interesting article, but I'm stuck on the 'Big if true' paragraph. I see someone else already pointed out the maths error, but I think the main point of the paragraph is also flawed.
Yes, if the event happens in the final month then your superforecaster will be penalised relative to the non-superforecaster. But how is this different from any other case where a low-probability event happens?
If we've reached the final month and the event hasn't yet happened, then 95% of the time it will continue to not happen, and the superforecaster will score much better than the non-superforecaster. Five percent of the time it will happen, and the superforecaster's score will suffer -- but that's the nature of prediction under uncertainty. Even the best possible predictor will sometimes be 'wrong', and their skill will only be clearly evident when the results of many predictions are aggregated.
I think you're suggesting this example is different, but I don't see how. The superforecaster has a higher expected score, whether we're looking at the year as a whole or any slice of it that begins with the result still undecided.
This was very interesting. Thank you.
If the monthly probability of an event is 5%, then the annual probability is 1-(1-0.05)^12 = 46%
I don't think your methodology for investigating your 2nd hypothesis is answering your question.
Imagine you had a perfect forecaster (not an oracle who knows the future, but a sub-oracle who perfectly synthesizes current information to forecast with true probabilities). Let's say that a bunch of questions have true probability of 60%. For questions where the crowd's estimates were too low, the crowd will "beat" the perfect forecaster 40% of the time (and so will most subgroups of the crowd). For questions where the crowd's estimates were too high they're going to "beat" the perfect forecaster 60% of the time.
I think your methodology would say that the perfect forecaster did better or worse at these questions based on the results, even though in this case the forecasts are by assumption equally perfect.
Looking at whether a particular coin toss came up heads or tails doesn't tell you very much about the quality of any given forecast. Imperfect forecasters are often rewarded for their errors over small samples; that doesn't make their forecasts better than the perfect forecaster's, they just got lucky.