In 2014, I took part in a tournament for the Good Judgment Project. The goal of this project was to figure out whether certain individuals have habits of mind that make them better at predicting future events than others. About once a week, I would go onto their website and be presented with a series of questions about potential future events. I’d assign a probability to each event, but could skip as many questions as I wanted if I felt like I couldn’t offer a reasonable prediction. I have essentially no recollection of what events I was forecasting, but I remember that I was truly awful at it. Even though I had fun, my purely vibes-based strategy didn’t even put me in the top half of the leader board.
This was the fourth season of the Good Judgment Project, which had grown out of Philip Tetlock’s experiments in forecasting that began in 1984. His work got a major boost in 2011 with funding from IARPA that sponsored the tournament that I participated in and which provided most of the material for his 2015 book, Superforecasting. These days, Good Judgment has seemingly evolved into a kind of oracle-for-hire consulting firm that sells the services of people that they’ve deemed superforecasters.
Superforecasters are claimed to be a share of about 1.5% of the population who have a consistent advantage over others in predicting the future. This is the result of avoiding biases that cloud others’ predictive abilities while also constructing better models of causality with more accurate base rates. One example of the difference in approaches between superforecasters and non-superforecasters can be seen in an example that Tetlock describes in Superforecasting. Participants in the Good Judgment Project had been asked to predict whether the Assad regime of Syria would fall. However, the time interval of the prediction was randomized, so some participants were asked to predict the probability of its fall over the next three months and others were asked to predict the probability over the next six. While non-superforecasters predicted essentially the same rate over both time intervals (40% versus 41% for the shorter and longer intervals, respectively), superforecasters’ predictions implied a nearly constant monthly probability of regime change — they predicted a probability of 15% over three months or 21% over six, thus avoiding the cognitive error of scope insensitivity.
Superforecasters can also skillfully sail between the shoals of over- and under-reaction to new information. They update their predictions at significantly smaller intervals than non-superforecasters. Each of a superforecaster’s updates differs from their previous prediction by just 3.5%, versus 5.9% for non-superforecasters. Superforecasters also tweak their forecasts more often, suggesting that they’re capable of detecting finer shifts in these events’ probabilities.
The evidence about superforecasters’ superior cognitive habits have been translated into impressive claims about superior predictive accuracy compared both to laypeople and domain experts. Superforecasters identified during the first year of the Good Judgment Project were almost twice as accurate as non-superforecasters in the second year of the Good Judgment Project. They were also 30% more accurate at predicting geopolitical events than intelligence analysts who had access to classified information. Even when laypeople had the help of advanced statistical tools and modeling aides, superforecasters were at least 20% more accurate.
Big if true, as they say! But I have my suspicions. For one thing, it’s really hard to know how to assess the accuracy of multiple predictions over time for a given event. Imagine a question in a year-long forecasting tournament asking competitors to predict a completely stochastic event that has a monthly probability of 5% or an annual probability of 46% (i.e., 1 - (1 - 0.05)^12 — thanks for the correction to my math, Twin!). Both the superforecaster and the non-superforecaster accurately predict an 80% probability at the beginning of the year. But while the superforecaster diligently decreases their prediction to match the current probability, the benighted non-superforecaster more or less forgets about the question and never updates their prediction. In the final month of the year — when the superforecaster gives the event a 5% probability of happening and the non-superforecaster hasn’t adjusted their original 80% prediction at all — the event happens. Using any common scoring metric, the superforecaster would be penalized, even though their prediction was correct. This challenge makes it hard for me to take seriously the estimates of how much more accurate superforecasters are than non-superforecasters.
Also, the claim about being improving on intelligence analysts predictions by 30% is only true when you use a specific algorithm to weight individual superforecasters’ predictions. This algorithm was one of twenty different aggregation methods that the researchers at Good Judgment tried. They settled on that particular algorithm after all the results were in and without adjusting their analysis for multiple comparisons, which means that there’s little reason to believe that the same aggregation method would be optimal for future forecasts. Notably, if you test twenty different aggregation techniques, you would expect one of them to come in under the 0.05 p-value that usually designates statistically significant differences between groups due to random noise alone — even if there was no actual difference. Without this particular aggregation method, the two groups have similar accuracy. This is not a trivial feat in and of itself, given the advantage of access to classified documents that intelligence analysts had, but it also raises the question of whether the superforecasters would add anything novel to future geopolitical forecasts.
At least to me, then, we’re in a situation where a) it’s hard to tell how much better superforecasters are than non-superforecasters, and b) it seems like superforecasters may have little if any advantage over people in specific domains. There’s enough evidence to convince me that superforecasters probably do have those habits of mind that at least should lead to better predictions. But the meaningful benefits to having even slightly better predictions about the future also make me think it’s worthwhile to pause and consider what questions and in what situations superforecasters can make the biggest impact in.
I’ve come up with two ideas to test that might help us understand whether and in what cases superforecasters are useful. I present these in the spirit of thinking in public and without any investment in their being right or wrong — I’d be happy to have one or both of my hypotheses disproven.
My first idea is that superforecasters are just better than others at knowing which questions are most amenable to forecasting. We’d expect to see superforecasters being more picky about the questions they answer if that were true. The corollary to this is that forecasting talent is itself evenly distributed, but that superforecasters have an improved meta-forecasting ability that allows them to know what they’d be good at predicting.
I also want to explore the idea superforecasters are not necessarily more accurate than non-superforecasters on the question level but are more reliable across all questions. If this hunch is correct, we’d expect to see superforecasters and aggregates of superforecasters underperform on individual questions but overperform in the context of whole tournaments.
What follows is my testing these ideas using the data from the 2022 Astral Codex Ten prediction contest. This is a great initial testing ground for these ideas first because the data are easily available, but also because the contest involved a one-time prediction, which allows me to gloss over the challenges of evaluating repeated predictions. Scott Alexander, author of Astral Codex Ten (ACX), also included some descriptions from top scorers of their forecasting processes.
The ACX prediction contest was relatively small as forecasting tournaments go: 12 superforecasters participated along with 496 non-superforecasters. Compare this to around 7,500 participants in the original Good Judgment Project, and over 14,000 who have taken part in the Good Judgment Open tournaments that followed the fourth IARPA-sponsored season of the Good Judgment Project. Contestants were also asked to not spend more than 5 minutes on any given questions. These limitations mean that my explorations are not conclusive, but could potentially lead to work on other datasets.
Do superforecasters just pick better questions to respond to?
From my experience in the Good Judgment Project, I knew that most forecasting tournaments don’t require that you make a prediction for every question. This is handled differently in different scoring systems, but in the ACX contest, skipped questions defaulted to Scott’s answer. You therefore have a kind of floor for your accuracy metric and an incentive to skip a question if you didn’t think you could generate a better guess than Scott did. Interestingly, the third place forecaster said that he skipped a lot of questions based on what he saw as his weaknesses.
We also know that some questions are easier to forecast than others, even if we can’t precisely characterize what makes them hard to forecast. For example, forecasters seriously underestimated the speed of development for COVID antivirals and the protective benefit of COVID vaccines. And intuitively, we know it’s easy to make errors about events that can happen through a variety of different causal paths or otherwise have a lot of moving parts.
To determine whether this hypothesis held in the ACX contest, I calculated the proportion of superforecasters and non-superforecasters who answered each question. (My replication code is here along with versions of the data where I removed blank rows for easier processing.) The results pretty decisively destroy my hypothesis. Superforecasters answered in higher proportions than non-superforecasters in all but 9 of the 70 unique questions. Ten of the questions were answered by all of the superforecasters, and the difference was especially notable in questions that were seemingly considered the hardest by non-superforecasters.
As I thought about how this analysis might go, I had imagined having to generate binomial confidence intervals to determine whether the findings were actually meaningful. But this is enough of a slam dunk for superforecasters that I won’t even bother.
Do superforecasters have an advantage at the question level or just the tournament level?
Most analyses I’ve seen that make general claims about superforecaster performance aggregate across questions. This is useful for making inferences about the habits of mind that theoretically would make someone a superforecaster, but it doesn’t say much about their usefulness in most real world applications. In most circumstances, the ideal would be to have an oracle that can reliably answer specific questions with better-than-average accuracy — even if it’s only a little better — rather than one that has very high variability in accuracy. The latter would be spot on for some questions, but could be worse than chance for others. They might be good for, say, picking a stock portfolio, but if I’m only interested in the answers to one or two very important questions, I’m not sure what I’d do with their forecasts.
This is why it’s important to look at superforecasters’ question-level performance. Doing so allows us to not only figure out whether superforecasters are useful in supporting decision making based on any given question, but it could also help us understand whether there are certain types of questions that they are generally better equipped to handle.
To estimate superforecasters’ question-level performance in the ACX prediction contest, I’ve compared the log-loss score of the mean of their predictions on each question to the log-loss score of 10,000 randomly selected groups of non-superforecasters. (The details of log-loss score aren’t particularly important here, other than to know that a lower score indicates better predictive accuracy.) There were 12 superforecasters included in the contest, and so each randomly selected group of non-superforecasters was also composed of 12 individuals. Just like the superforecasters’ predictions were averaged for each question, I averaged each of the 10,000 non-superforecaster groups’ predictions prior to scoring.
Note that 3 of the 70 questions hadn’t resolved or resolved ambiguously and were not included in these analyses. Note also that one of the main goals of the ACX analysis was to compare superforecaster performance to prediction markets, for which only 61 of the questions had comparable markets. Because my analysis excludes only 3 questions, the tournament-level log-loss scores in my analysis do not match the scores reported in the ACX analysis.
Below is the same data presented in a different way. This is a density plot of the probability that an aggregate of 12 randomly selected non-superforecasters will be more accurate than the aggregate of superforecasters on a given question. In total, decision makers would be better off trusting a randomly selected group of 12 non-superforecasters rather than the 12 superforecasters in 39% of the forecasting tasks included in the ACX prediction contest.
There are a few specific questions worth investigating in more detail because the superforecasters’s predictions were outliers. These questions might give us hints about the types of questions that superforecasters perform best and worst at. First, superforecasters outperformed all 10,000 randomly selected groups of non-superforecasters on question 26, which asked for the probability that “some new [COVID] variant not currently known is greater than 25% of cases.” Superforecasters also outperformed all but 0.07% of the non-superforecaster groups on question 64, which concerned the probability that Lula will be elected as president of Brazil. It’s worth noting that both of these resolved positively.
On the other end of the spectrum, superforecasters underperformed all but 0.21% of non-superforecaster groups on question 4: “PredictIt thinks Donald Trump is the mostly likely 2024 GOP nominee.” The superforecasters performed nearly as poorly on question 20, which asked for the probability that there will be “fewer than 10k daily average official cases in US in December 2022.” Again, it’s probably notable that both of these questions resolved negatively.
This discrepancy in performance across questions that resolve positively versus negatively makes me wonder whether we can see that divergence in a calibration plot. And, indeed, we do. Below is a plot where I took the average probability of an event occurring given a forecasted probability within each numeric range of width 5%, then smoothed the probability for each group using loess. The dashed line represents perfect calibration, which would mean that, for example, 20% of events assigned a 20% probability actually occur.
What we see is that the difference in performance at the upper extreme of estimated probability is the only meaningful difference we see between the calibration of the two groups. This suggests that we can identify the superforecaster advantage in this experiment as avoiding overconfidence about likely events.
Next, we’ll look at the distribution of tournament-level scores for the superforecaster team versus the randomly-selected teams of 12 non-superforecasters. In the original analysis posted on ACX, the aggregate of superforecasters’ predictions beat 97% of individual forecasters, but individually they ranked between the 34th and 99th percentile. The superforecaster aggregate decisively beat the simple aggregate of all non-superforecasters, which finished in the 84th percentile.
The chart below shows the density of the non-superforecaster groups’ scores across the whole tournament compared to the superforecaster group’s score. In total, only 1.7% of non-superforecaster groups beat the superforecasters’ overall score. Looking at the distribution below, we see that this is far better than chance alone.
So… are they useful or not?
Based only on this limited dataset, I’d say that superforecasters are most useful in the limited circumstance of calibrating expectations around an event that you a priori think is likely to happen. Superforecasters were far better at identifying small gradations of high-probability, whereas non-superforecasters seem to be unable to meaningfully distinguish between probabilities above about 75%.
This gives us an idea of the kind of questions that we might want to ask superforecasters, but it doesn’t really capture the variability in performance across questions. In 39% of questions, decision makers would get more reliable answers from a randomly selected group of non-superforecasters. Knowing this creates even more uncertainty for decision makers: it’s as if they get a forecast from superforecasters and then flip a coin to determine whether they should believe it or not. Limiting the questions you ask superforecasters to high-probability scenarios might bias that coin a bit, though, and make it easier to act on their predictions.
Their performance was good enough on the tournament level that it’s hard to imagine any systematic way of picking a better group. At the same time, performance across questions is only useful in a limited set of circumstances, and there are many cases where it might be better to prioritize low variance in performance over tournament-level accuracy.
My interest here has not been to deflate enthusiasm around superforecasters. I think it’d generally be a good thing for the world if we had more reliable information about the near future. It would allow us to better allocate resources, save money, and in some cases save lives too.
My concern instead is that the idea of forecasting in general and especially superforecasting might give us false ideas about how predictable the future really is. We saw the risk of this stance in the early days of COVID, where models like the ones from the Institute for Health Metrics and Evaluation (mis)guided policy decisions while radically underestimating the duration and impact of the pandemic.
In these circumstances, people tend to say that model building and forecasting are better than nothing. I don’t think that’s true. If we make decisions based on their predictions, we’re preparing for a specific scenario at the cost of building resilience to a wide variety of scenarios. This is what the scholars of deep uncertainty caution us to do: when we can’t even start to build a reasonable mental model of an event, we should stop trying to build a model and instead focus on building a range of maximally different mental models. Then, we can make decisions based not on what will optimize outcomes given a specific set of events but rather what will minimize regret no matter which way the events play out. An example of this is in how we plan on addressing the problems created by climate change. Climate is a complex system and it’s not entirely clear how it might change over the coming centuries. Rather than optimizing our adaptations for the scenario that looks most likely right now, it would make more sense to ensure that we have the capacity to adapt to the broadest possible range of plausible circumstances.
Superforecasting isn’t an unmixed good across all scenarios. The process of consulting superforecasters should start by understanding whether the problem you’re facing is best tackled through optimization or through bolstering resilience — in other words, maximizing utility or minimizing regret. Getting that wrong can mean sacrificing adaptability and resilience. You also need to consider how much high variability in performance across questions will matter: are there one or two key questions that are more important than any other, or will you be acting on the gestalt generated across questions? Only once you’ve decided that the problem is amenable to optimization and that high variance in accuracy across questions is tolerable should you submit the questions to forecasting, where they might best be framed in terms of highly likely outcomes.
Great investigation, thanks for sharing!
> Notably, if you test twenty different aggregation techniques, you would expect one of them to come in under the 0.05 p-value that usually designates statistically significant differences between groups due to random noise alone — even if there was no actual difference.
Nit: p-values quantify *sampling* error. If you take the best of 20 techniques on a single sample, it's hard to measure whether the result is over-fit without additional data. You need a better understanding of how independent the techniques are, or -- more likely -- a separate set of evaluation data to evaluate whether your model is over-tuned. [And in the linked paper (https://goodjudgment.io/docs/Goldstein-et-al-2015.pdf) the MPDDA result for "GJP Best Method" has p<0.001]
Interesting article, but I'm stuck on the 'Big if true' paragraph. I see someone else already pointed out the maths error, but I think the main point of the paragraph is also flawed.
Yes, if the event happens in the final month then your superforecaster will be penalised relative to the non-superforecaster. But how is this different from any other case where a low-probability event happens?
If we've reached the final month and the event hasn't yet happened, then 95% of the time it will continue to not happen, and the superforecaster will score much better than the non-superforecaster. Five percent of the time it will happen, and the superforecaster's score will suffer -- but that's the nature of prediction under uncertainty. Even the best possible predictor will sometimes be 'wrong', and their skill will only be clearly evident when the results of many predictions are aggregated.
I think you're suggesting this example is different, but I don't see how. The superforecaster has a higher expected score, whether we're looking at the year as a whole or any slice of it that begins with the result still undecided.