9 Comments
Feb 21, 2023Liked by Nathaniel Hendrix

Great investigation, thanks for sharing!

> Notably, if you test twenty different aggregation techniques, you would expect one of them to come in under the 0.05 p-value that usually designates statistically significant differences between groups due to random noise alone — even if there was no actual difference.

Nit: p-values quantify *sampling* error. If you take the best of 20 techniques on a single sample, it's hard to measure whether the result is over-fit without additional data. You need a better understanding of how independent the techniques are, or -- more likely -- a separate set of evaluation data to evaluate whether your model is over-tuned. [And in the linked paper (https://goodjudgment.io/docs/Goldstein-et-al-2015.pdf) the MPDDA result for "GJP Best Method" has p<0.001]

Expand full comment
Feb 13, 2023·edited Feb 13, 2023

Interesting article, but I'm stuck on the 'Big if true' paragraph. I see someone else already pointed out the maths error, but I think the main point of the paragraph is also flawed.

Yes, if the event happens in the final month then your superforecaster will be penalised relative to the non-superforecaster. But how is this different from any other case where a low-probability event happens?

If we've reached the final month and the event hasn't yet happened, then 95% of the time it will continue to not happen, and the superforecaster will score much better than the non-superforecaster. Five percent of the time it will happen, and the superforecaster's score will suffer -- but that's the nature of prediction under uncertainty. Even the best possible predictor will sometimes be 'wrong', and their skill will only be clearly evident when the results of many predictions are aggregated.

I think you're suggesting this example is different, but I don't see how. The superforecaster has a higher expected score, whether we're looking at the year as a whole or any slice of it that begins with the result still undecided.

Expand full comment

This was very interesting. Thank you.

Expand full comment

If the monthly probability of an event is 5%, then the annual probability is 1-(1-0.05)^12 = 46%

Expand full comment
Feb 9, 2023·edited Feb 9, 2023

I don't think your methodology for investigating your 2nd hypothesis is answering your question.

Imagine you had a perfect forecaster (not an oracle who knows the future, but a sub-oracle who perfectly synthesizes current information to forecast with true probabilities). Let's say that a bunch of questions have true probability of 60%. For questions where the crowd's estimates were too low, the crowd will "beat" the perfect forecaster 40% of the time (and so will most subgroups of the crowd). For questions where the crowd's estimates were too high they're going to "beat" the perfect forecaster 60% of the time.

I think your methodology would say that the perfect forecaster did better or worse at these questions based on the results, even though in this case the forecasts are by assumption equally perfect.

Looking at whether a particular coin toss came up heads or tails doesn't tell you very much about the quality of any given forecast. Imperfect forecasters are often rewarded for their errors over small samples; that doesn't make their forecasts better than the perfect forecaster's, they just got lucky.

Expand full comment