Assessing and Improving Expert Forecasts

What follows is a guest post by Kelsey Woerner, a soon-to-graduate senior at Dartmouth College double-majoring in government and psychology. She completed the research described below as part of her senior honors thesis in Dartmouth’s Department of Government under the guidance of her advisor (and my colleague), Ben Valentino. Her thesis is a terrific piece of research on judgment-based forecasting, and I’m excited to have her share her findings here.

The title of this blog alludes to the unfortunate state of expert political forecasting today.  The only research that has systematically assessed the accuracy of expert political judgment leads us to the sorry conclusion that experts often perform little better than chance. We might as well be “dart-throwing chimps.”

Lots of people have lamented the state of forecasting, but no research to date has asked whether it’s possible to improve the accuracy of our predictions and systematically compared strategies for doing so. That’s what I aimed to do in the project I describe in this post. The findings offer a beacon of hope on the bleak horizon of the forecasting frontier. It turns out that two simple strategies improve accuracy by 10 to 15 percent.

In an original month-long, four-wave panel experiment, I gave forecasters different types of information and watched to see how their accuracy changed as a result. In particular, I looked at the effects of providing forecasters with information about the historical frequencies (base rates) of the phenomena they are predicting and simple feedback about their performance in previous waves. For each of four weeks, participants made probabilistic forecasts in four domains relating to domestic and international politics and economics: the Dow Jones Industrial Average stock index, the national unemployment rate, Obama’s presidential job approval ratings, and the price of crude oil. I randomly assigned 308 participants to one of three groups. The base rate group received information about how frequently changes of various magnitude in these variables occurred in the previous year; the performance feedback group received information about how far off their predictions were the previous week; the control group received no extra information. I also recruited an “expert” subgroup of people with backgrounds in finance and economics in order to look at the effects of expertise on accuracy of Dow predictions, and I distributed these 72 experts evenly among the three groups.

The results are very encouraging.  Both strategies significantly improved forecasting accuracy. On average, participants who received base rate or performance feedback information were 10 or 15 percent more accurate than those who did not. At the end of this post, I’ve included graphs and tables of these results, as well as some extra explanation.

A few other key findings might also be of interest to readers.  Forecasters who received either type of information were able to “learn,” or improve their own accuracy, over the course of the month. In addition, the experiment tells us about how forecasters used the information they received. Base-rate information tended to moderate participants’ forecasts, while feedback produced more aggressive forecasts. Experts predicted with slightly better accuracy than nonexperts in their domain of expertise (the Dow), and they appeared to use treatment information more carefully and effectively than nonexperts.

Of course, these findings would be mildly interesting but not very relevant or useful if expert political forecasters were already employing these kinds of information in their forecasting. In order to know whether these strategies might help forecasters in the real world, I needed to assess whether they already use feedback and base rates, either formally or informally, when making predictions. With a combination of case studies and surveys, I looked at whether expert political analysts: 1) make falsifiable predictions that can be evaluated for accuracy and used for feedback, and 2) have a firm understanding of base rate information.

In case studies of two premier NGOs that aim to warn policy makers of political instability–International Crisis Group (ICG) and Fund for Peace (FFP)–I randomly sampled 20 analytical products from each organization, counted predictions, and scored them according to a set of falsifiability criteria. Not one of the 89 ICG predictions or 27 FFP predictions provided the three pieces of information necessary to evaluate accuracy: 1) a clearly defined event, 2) a clearly defined time frame, and 3) an assessment of the probability of the given event in the given time frame. Expert political forecasters are making predictions in their analytical reports, but these predictions aren’t falsifiable and therefore can’t generate feedback.

I also used two small-scale surveys of analysts at comparable NGOs to assess whether these analysts demonstrate an understanding of the base rates for the types of events they often predict. Analysts were asked to estimate the average annual onsets of rare political events such as coups, civil wars, and episodes of mass killing, and to assess the likelihood of similar events over the course of the coming year in at-risk countries. These experts significantly over- or under-estimated base rates and consistently assigned extremely high probabilities to very unlikely outcomes.

These case studies and surveys indicate that expert political analysts today don’t use feedback and base rate information when they formulate forecasts. Combined with the experimental findings described above, this strongly suggests that analysts could significantly improve their accuracy with these straightforward and inexpensive strategies. For policy makers receiving daily briefings and analysts producing regular warning reports, the promise of improved accuracy holds considerable value.

The greatest impediment to improving predictive judgment is the reluctance of forecasters to make falsifiable predictions and systematically assess performance. This research suggests new ways to encourage forecasters to assess their own performance. It shows that tracking performance is in forecasters’ own best interest because it offers the very tangible deliverable of substantially improved forecasting accuracy.

Let’s hope for the day when one’s predictive judgment is valued because it is supported by an established track record. The unfortunate reality is that such a day is far away, or at least farther than we would like (and I hope that this prediction is proven wrong!). We don’t need to sit idly by, however, and simply wait for that day to come. The strategies explored in this project set us at the beginning of a long but promising path towards minimizing controllable forecasting error and thereby improving our predictive judgment.

Supplemental Materials

The following graphs and tables show the significant and substantive effects of the base rate and feedback treatments. I calculated accuracy scores according to two scoring rules: a quadratic score (often called the Brier score) and a linear score. Although most forecasting literature uses quadratic scoring because it rewards people more for predicting according to their true beliefs, I used both rules for a few reasons. First, my feedback treatment group received feedback according to a linear rule–easier to understand from the forecaster’s perspective–so it made sense to evaluate them according to that rule as well. Second, both rules almost always indicate the same direction of a trend, but sometimes results are significant under one but not the other. This is evident in the following graphs.

Some readers might also be curious to see the breakdown of these results by domain. The following graphs include this information. They illustrate clear improvements in the treatment groups in the Oil and Presidential Approval (Pres) domains as well as the ambiguous effect of the treatments in the Unemployment (Unem) and Dow domains. In these latter domains–the more volatile, difficult-to-predict ones–where treatment information did not appear to help, all of the groups did much worse and claimed relatively similar poor performance as indicated by the much higher accuracy scores. It is important to note that significance values are low and generally observable using only one of the scoring rules in these volatile domains. Treatment information may not have improved accuracy of predictions in the Unem and Dow domains, but it did not hurt.

  • Author

  • Follow me on Twitter

  • Follow Dart-Throwing Chimp on
  • Enter your email address to follow this blog and receive notifications of new posts by email.

    Join 13,612 other followers

  • Archives

%d bloggers like this: