> Notably, if you test twenty different aggregation techniques, you would expect one of them to come in under the 0.05 p-value that usually designates statistically significant differences between groups due to random noise alone — even if there was no actual difference.
Nit: p-values quantify *sampling* error. If you take the best of 20 techniques on a single sample, it's hard to measure whether the result is over-fit without additional data. You need a better understanding of how independent the techniques are, or -- more likely -- a separate set of evaluation data to evaluate whether your model is over-tuned. [And in the linked paper (https://goodjudgment.io/docs/Goldstein-et-al-2015.pdf) the MPDDA result for "GJP Best Method" has p<0.001]
Interesting article, but I'm stuck on the 'Big if true' paragraph. I see someone else already pointed out the maths error, but I think the main point of the paragraph is also flawed.
Yes, if the event happens in the final month then your superforecaster will be penalised relative to the non-superforecaster. But how is this different from any other case where a low-probability event happens?
If we've reached the final month and the event hasn't yet happened, then 95% of the time it will continue to not happen, and the superforecaster will score much better than the non-superforecaster. Five percent of the time it will happen, and the superforecaster's score will suffer -- but that's the nature of prediction under uncertainty. Even the best possible predictor will sometimes be 'wrong', and their skill will only be clearly evident when the results of many predictions are aggregated.
I think you're suggesting this example is different, but I don't see how. The superforecaster has a higher expected score, whether we're looking at the year as a whole or any slice of it that begins with the result still undecided.
The way that I've thought about this (and I feel now like I didn't explain myself well in that paragraph!) is that calibration captures how well forecasters predict true probabilities, but that assessing discrimination is a much trickier problem. GJP, for example, used mean daily Brier score, which would be potentially deceptive in any case where the probability changes over time. Since it's rare to be able to determine the actual probability of an event even retrospectively, I just don't know how to assess discrimination and feel like it's an open question whether it's meaningful to do so. At the same time, discrimination is what makes some forecasts more useful than others and so it's an important parameter. That's why I used permutation to try to understand the relative value of relying on teams of superforecasters versus other teams.
I don't think your methodology for investigating your 2nd hypothesis is answering your question.
Imagine you had a perfect forecaster (not an oracle who knows the future, but a sub-oracle who perfectly synthesizes current information to forecast with true probabilities). Let's say that a bunch of questions have true probability of 60%. For questions where the crowd's estimates were too low, the crowd will "beat" the perfect forecaster 40% of the time (and so will most subgroups of the crowd). For questions where the crowd's estimates were too high they're going to "beat" the perfect forecaster 60% of the time.
I think your methodology would say that the perfect forecaster did better or worse at these questions based on the results, even though in this case the forecasts are by assumption equally perfect.
Looking at whether a particular coin toss came up heads or tails doesn't tell you very much about the quality of any given forecast. Imperfect forecasters are often rewarded for their errors over small samples; that doesn't make their forecasts better than the perfect forecaster's, they just got lucky.
Jon, if I'm understanding you correctly, it seems like you're saying that only calibration should matter for assessing performance, since calibration indicates alignment of forecasts with true probabilities. Is that right?
The idea that only calibration matters for forecasting is a position that I considered, because it's just so hard to know how to score accuracy. Two forecasters, though, can be equally well calibrated but have different levels of certainty about their forecasts -- one would have more points close to the 50% mark, while the other forecaster would have more points closer to 0% and 100%. The more certain forecaster is more useful to decision makers, even if it's hard to know how to quantify that benefit. That's why I tried to develop a pragmatic approach to determining how often superforecasters outperform others.
My hope is that, if nothing else, the analysis I did gives us a clue about what kinds of questions might allow superforecasters to have the biggest advantage over non-superforecasters (and vice versa!). The scenarios in which they have an advantage can help us prioritize questions that they'd be best at answering.
No, calibration isn't everything, since as you point out it doesn't capture discrimination.
Mostly I guess I'm just saying that the only way to judge single forecasts is versus the true probability, not the realized outcome. Usually even after the event occurs we don't know the true probability (especially if you don't want to defer to superforecaster judgment), so there's not much we can say about the quality of single forecasts.
We do have decent ways to score aggregates of questions though (Brier scores, log scores, etc.). Yes they're noisy, but with enough questions they become pretty meaningful. And it's possible to quantify how noisy they are (e.g. you can answer questions like, "if A's forecasts are perfect, what's the probability that B's forecasts will score better anyway?"), we don't have to give up on them just because of the noise.
If you did your same analysis but the superforecasters' forecasts were replaced by a perfect forecaster, wouldn't you reach the same conclusions? That the crowd did better than perfect on some questions and worse on others? It seems to me like the questions that you're saying the supers did poorly on are just ones that occurred where the crowd probabilities were higher than the supers, or events that didn't occur where the crowd probabilities were lower than supers. But in isolation I don't think that tells us anything about whether the crowd is more or less accurate than the supers in those cases - there will be questions with those outcomes whether supers are perfect or imperfect.
Great investigation, thanks for sharing!
> Notably, if you test twenty different aggregation techniques, you would expect one of them to come in under the 0.05 p-value that usually designates statistically significant differences between groups due to random noise alone — even if there was no actual difference.
Nit: p-values quantify *sampling* error. If you take the best of 20 techniques on a single sample, it's hard to measure whether the result is over-fit without additional data. You need a better understanding of how independent the techniques are, or -- more likely -- a separate set of evaluation data to evaluate whether your model is over-tuned. [And in the linked paper (https://goodjudgment.io/docs/Goldstein-et-al-2015.pdf) the MPDDA result for "GJP Best Method" has p<0.001]
Interesting article, but I'm stuck on the 'Big if true' paragraph. I see someone else already pointed out the maths error, but I think the main point of the paragraph is also flawed.
Yes, if the event happens in the final month then your superforecaster will be penalised relative to the non-superforecaster. But how is this different from any other case where a low-probability event happens?
If we've reached the final month and the event hasn't yet happened, then 95% of the time it will continue to not happen, and the superforecaster will score much better than the non-superforecaster. Five percent of the time it will happen, and the superforecaster's score will suffer -- but that's the nature of prediction under uncertainty. Even the best possible predictor will sometimes be 'wrong', and their skill will only be clearly evident when the results of many predictions are aggregated.
I think you're suggesting this example is different, but I don't see how. The superforecaster has a higher expected score, whether we're looking at the year as a whole or any slice of it that begins with the result still undecided.
The way that I've thought about this (and I feel now like I didn't explain myself well in that paragraph!) is that calibration captures how well forecasters predict true probabilities, but that assessing discrimination is a much trickier problem. GJP, for example, used mean daily Brier score, which would be potentially deceptive in any case where the probability changes over time. Since it's rare to be able to determine the actual probability of an event even retrospectively, I just don't know how to assess discrimination and feel like it's an open question whether it's meaningful to do so. At the same time, discrimination is what makes some forecasts more useful than others and so it's an important parameter. That's why I used permutation to try to understand the relative value of relying on teams of superforecasters versus other teams.
This was very interesting. Thank you.
If the monthly probability of an event is 5%, then the annual probability is 1-(1-0.05)^12 = 46%
Correct! Thanks for catching my mistake.
I don't think your methodology for investigating your 2nd hypothesis is answering your question.
Imagine you had a perfect forecaster (not an oracle who knows the future, but a sub-oracle who perfectly synthesizes current information to forecast with true probabilities). Let's say that a bunch of questions have true probability of 60%. For questions where the crowd's estimates were too low, the crowd will "beat" the perfect forecaster 40% of the time (and so will most subgroups of the crowd). For questions where the crowd's estimates were too high they're going to "beat" the perfect forecaster 60% of the time.
I think your methodology would say that the perfect forecaster did better or worse at these questions based on the results, even though in this case the forecasts are by assumption equally perfect.
Looking at whether a particular coin toss came up heads or tails doesn't tell you very much about the quality of any given forecast. Imperfect forecasters are often rewarded for their errors over small samples; that doesn't make their forecasts better than the perfect forecaster's, they just got lucky.
Jon, if I'm understanding you correctly, it seems like you're saying that only calibration should matter for assessing performance, since calibration indicates alignment of forecasts with true probabilities. Is that right?
The idea that only calibration matters for forecasting is a position that I considered, because it's just so hard to know how to score accuracy. Two forecasters, though, can be equally well calibrated but have different levels of certainty about their forecasts -- one would have more points close to the 50% mark, while the other forecaster would have more points closer to 0% and 100%. The more certain forecaster is more useful to decision makers, even if it's hard to know how to quantify that benefit. That's why I tried to develop a pragmatic approach to determining how often superforecasters outperform others.
My hope is that, if nothing else, the analysis I did gives us a clue about what kinds of questions might allow superforecasters to have the biggest advantage over non-superforecasters (and vice versa!). The scenarios in which they have an advantage can help us prioritize questions that they'd be best at answering.
No, calibration isn't everything, since as you point out it doesn't capture discrimination.
Mostly I guess I'm just saying that the only way to judge single forecasts is versus the true probability, not the realized outcome. Usually even after the event occurs we don't know the true probability (especially if you don't want to defer to superforecaster judgment), so there's not much we can say about the quality of single forecasts.
We do have decent ways to score aggregates of questions though (Brier scores, log scores, etc.). Yes they're noisy, but with enough questions they become pretty meaningful. And it's possible to quantify how noisy they are (e.g. you can answer questions like, "if A's forecasts are perfect, what's the probability that B's forecasts will score better anyway?"), we don't have to give up on them just because of the noise.
If you did your same analysis but the superforecasters' forecasts were replaced by a perfect forecaster, wouldn't you reach the same conclusions? That the crowd did better than perfect on some questions and worse on others? It seems to me like the questions that you're saying the supers did poorly on are just ones that occurred where the crowd probabilities were higher than the supers, or events that didn't occur where the crowd probabilities were lower than supers. But in isolation I don't think that tells us anything about whether the crowd is more or less accurate than the supers in those cases - there will be questions with those outcomes whether supers are perfect or imperfect.