A Chimp’s-Eye View of a Forecasting Experiment

For the past several months, I’ve been participating in a forecasting tournament as one of hundreds of “experts” making and updating predictions about dozens of topics in international political and economic affairs. This tournament is funded by IARPA, a research arm of the U.S. government’s Office of the Director of National Intelligence, and it’s really a grand experiment designed to find better ways to elicit, combine, and present probabilistic forecasts from groups of knowledgeable people.

There are several teams participating in this tournament; I happen to be part of the Good Judgment team that is headed by psychologist Phil Tetlock. Good Judgment “won” the first year of the competition, but that win came before I started participating, so alas, I can’t claim even a tiny sliver of the credit for that.

I’ve been prognosticating as part of my job for more than a decade, but almost all of the forecasting I’ve done in the past was based on statistical models designed to assess risks of a specific rare event (say, a coup attempt, or the onset of a civil war) across all countries worldwide. The Good Judgment Project is my first experience with routinely making calls about the likelihood of many different events based almost entirely on my subjective beliefs. Now that I’m a few months into this exercise, I thought I’d write something about how I’ve approached the task, because I think my experiences speak to generic difficulties in forecasting rare political and economic events.

By way of background, here’s how the forecasting process works for the Good Judgment team: I start by logging into a web site that lists a bunch of questions on an odd mix of topics, everything from the Euro-to-dollar exchange rate to the outcome of the recent election in Venezuela. I click on a question and am expected to assign a numeric probability to each of the two or more possible outcomes listed (e.g., “yes or no,” or “A, B, or C”). Those outcomes are always exhaustive, so the probabilities I assign must always sum to 1. Whenever I feel like it, I can log back in and update any of the forecasts I’ve already made. Then, when the event of interest happens (e.g., results of the Venezuelan election are announced), the question is closed, and the accuracy of my forecast for that question and for all questions closed to date is summarized with a statistic called the Brier score. I can usually see my score pretty soon after the question closes, and I can also see average scores on each question for my whole team and the cumulative scores for the top 10 percent or so of the team’s leader board.

So, how do you perform this task? In an ideal world, my forecasting process would go as follows:

1. Read the question carefully. What does “attack,” “take control of,” or “lose office” mean here, and how will it be determined? To make the best possible forecast, I need to understand exactly what it is I’m being asked to predict.

2. Forecast. In Unicorn World, I would have an ensemble of well-designed statistical models to throw at each and every question. In the real world, I’m lucky if there’s a single statistical model that applies even indirectly. Absent a sound statistical forecast, I would try to identify the class of events to which the question belongs and then determine a base rate for that class. The “base rate” is just the historical frequency of the event’s occurrence in comparable cases—say, how often civil wars end in their first year, or how often incumbents lose elections in Africa. Where feasible, I would also check prediction markets, look for relevant polling data, seek out published predictions, and ask knowledgeable people.

In all cases, the idea is to find empirical evidence to which I can anchor my forecast, and to get a sense of how much uncertainty there is about this particular case. When there’s a reliable forecast from a statistical model or even good base-rate information, I would weight that most heavily and would only adjust far away from that prediction in cases where subject-matter experts make a compelling case about why this instance will be different. (As for what makes a case compelling, well…) If I can’t find any useful information or the information I find is all over the map, I would admit that I have no clue and do the probabilistic equivalent of a coin flip or die roll, splitting the predicted probability evenly across all of the possible outcomes.

3. Update. As Nate Silver argues in The Signal and The Noise, forecasters should adjust their prediction whenever we are presented with relevant new information. There’s no shame in reconsidering your views as new information becomes available; that’s called learning. Ideally, I would be disciplined about how I update my forecasts, using Bayes’ rule to check the human habit of leaning too hard on the freshest tidbit and take full advantage of all the information I had before.

4. Learn. Over time, I would see areas where I was doing well or poorly and would use those outcomes to deepen confidence in, or to try to improve, my mental models. For the questions I get wrong, what did I overlook? What were the forecasters who did better than I thinking about that I wasn’t? For the questions I answer well, was I right for the right reasons, or was it possibly just dumb luck? The more this process gets repeated, the better I should be able to do, within the limits determined by the basic predictability of the phenomenon in question.

To my mind, that is how I should be making forecasts. Now here are few things I’ve noticed so far about what I’m actually doing.

1. I’m lazy. Most of the time, I see the question, make an assumption about what it means without checking the resolution criteria, and immediately formulate a gut response on simple four-point scale: very likely, likely, unlikely, or very unlikely. I’d like to say that “I have no clue” is also on the menu of immediate responses, but my brain almost always makes an initial guess, even if it’s a topic about which I know nothing—for example, the resolution of a particular dispute before the World Trade Organization. What’s worse, it’s hard to dislodge that gut response once I’ve had it, even when I think I have better anchoring information.

2. I’m motivated by the competition, but that motivation doesn’t necessarily make me more attentive. As I said earlier, in this tournament, we can see our own scores and the scores of all the top performers as we go. With such a clear yardstick, it’s hard not to get pulled into seeing other forecasters as competitors and trying to beat them. You’d think that urge would motivate me to be more attentive to the task, more rigorously following the idealized process I described above. Most of the time, though, it just means that I calibrate my answers to the oddities of the yardstick—the Brier score involves squaring your error term, so the cost of being wrong isn’t distributed evenly across the range of possible values—and that I check the updated scores soon after questions close.

3. It’s hard to distinguish likelihood from timing. Some of the questions can get pretty specific about the timing of the event of interest. For example, we might be asked something like: Will Syrian president Bashar al-Assad fall before the end of October 2012; before the end of December 2012; before April 2013; or some time thereafter? I find these questions excruciatingly hard to answer, and it took me a little while to figure out why.

After thinking through how I had approached a few examples, I realized that my brain was conflating probability with proximity. In other words, the more likely I thought the event was, the sooner I expected it to occur. That makes sense for some situations, but it doesn’t always, and a careful consideration of timing will usually require lots of additional information. For example, I might look at structural conditions in Syria and conclude that Assad can’t win the civil war he’s started, but how long it’s going to take for him to lose will depend on a host of other things with complex dynamics, like the price of oil, the flow of weapons, and the logistics of military campaigns. Interestingly, even though I’m now aware of this habit, I’m still finding it hard to break.

4. I’m eager to learn from feedback, but feedback is hard to come by. This project isn’t like weather forecasting or equities trading, where you make a prediction, see how you did, tweak your model, and try again, over and over. Most of the time, the questions are pretty idiosyncratic, so you’ll have just one or a few chances to make a certain prediction. What’s more, the answers are usually categorical, so even when you do get more than one shot, it’s hard to tell how wrong or right you were. In this kind of environment, it’s really hard to build on your experience. In the several months I’ve been participating, I think I’ve learned far more about the peculiarities of the Brier score than I have about the generative process underlying any of the phenomena about which I’ve been asked to predict.

And that, for whatever it’s worth, is one dart-throwing chimp’s initial impressions of this particular experiment. As it happens, I’ve been doing pretty well so far—I’ve been at or near the top of the team’s accuracy rankings since scores for the current phase started to come in—but I still feel basically clueless most of the time. It’s like a golf tournament where good luck tosses some journeyman to the top of the leader board after the first or second round, and I keep waiting for the inevitable regression toward the mean to kick in. I’d like to think I can pull off a Tin Cup, but I know enough about statistics to expect that I almost certainly won’t.

18 Comments

by Jay Ulfelder on October 15, 2012 • Permalink

Posted in Forecasting

Tagged Good Judgment Project, IARPA, Philip Tetlock

Posted by Jay Ulfelder on October 15, 2012

https://dartthrowingchimp.wordpress.com/2012/10/15/a-chimps-eye-view-of-a-forecasting-experiment/

18 Comments

Jamie Pett
/ October 15, 2012

It was really fascinating to read this post at the same time as I’m reading ‘Thinking, fast and slow’ by Daniel Kahnemann. You’ve identified a lot of the biases mentioned in the book especially the tendency to know how to make a rational decision but to allow yourself to take the easy way out anyway (i.e. your System 2 finds it hard to overrule your System 1).

Reply
Rex Brynen
/ October 15, 2012

Great post, Jay. I had to withdraw as a team member from the project because I just couldn’t give it any significant attention, or find time to update my predictions. #beingabadbayesian

Reply
- dartthrowingchimp
  / October 15, 2012
  
  Thanks, Rex. I suspect attention plays a big role in the results. There are a lot of questions on a lot of very different topics, and it’s really hard to stay reasonably informed on all of them.
  
  Reply
Richard Lum
/ October 16, 2012

Have you been watching the Wikistrat guys? I’m not really sure how they are doing their work, but they say they have a global network of analysts and that they are ”crowdsourcing” strategic analysis and forecasting.

Reply
- dartthrowingchimp
  / October 16, 2012
  
  No, I haven’t. I’ll take a look. Without knowing what they’re up to, though, I will say that all crowdsourcing is not created equal. From my conversations with the pros, I gather that the procedures for eliciting and combining expert forecasts make a big difference in the accuracy of the results. So I’ll be especially interested to see how they handle those aspects of the process.
  
  Reply
Oral Hazard
/ October 18, 2012

What is considered a “good Brier score” for this kind of thing?

Reply
- dartthrowingchimp
  / October 19, 2012
  
  That’s really hard to say, actually. The problem is that some questions are a lot easier than others. For example, a question that asks if Dictator X in an apparently stable country will still be in power at the end of the year is a lot easier to get right than one about the outcome of a hotly contested election. For easy questions, it’s reasonable to expect a cumulative Brier score close to 0. For really hard questions, though, just doing a bit better than chance on average might actually be pretty good. In short, what counts as a “good” score depends on the degree of forecasting difficulty, so there’s no general answer to your question.
  
  Reply
  - David Mandel
    / October 30, 2012
    
    You could adjust Brier scores to take variations in forecasting difficulty into account. See, e.g., Winkler, R. (1994). Evaluating probabilities: Asymmetric scoring rules. Management Science, 40, 1395-1405. Also the technical appendix of Tetlock’s 2005 book on expert political judgment provides the adjustment formula. You could also partition Brier scores into calibration and discrimination indices and the latter (DI) could be normalized so that it is readily interpreted as the proportion of variance in the outcome accounted for by the forecasts. Formulas for partitioning the Brier score and adjusting DI can be found in Yaniv, I., Yates, J.F., & Smith, J.E.K. (1991). Measures of discrimination skill in probabilistic judgment. Psychological Bulletin, 110(3), 611-617.
  - dartthrowingchimp
    / October 31, 2012
    
    Thanks very much, David!
Kim Roosevelt
/ October 28, 2012

I’m on your team, I think, but I don’t see how to find out my or the teams Brier score. Where is that on the trading platform? And I don’t assign probabilities; I buy and sell shares. Is there more than one good judgment team?

Reply
- dartthrowingchimp
  / October 28, 2012
  
  I think there’s both a prediction market and a survey-like version under the same roof, and participants in one don’t see the other.
  
  Reply
David Mandel
/ October 31, 2012

It just occurred to me that some readers of this thread may be interested in attending a symposium on monitoring forecasting quality in intelligence on Nov 6 as part of the Five Eyes/ Intelligence Community Centers for Academic Excellence Analytic Workshops (see http://5eyes.olemiss.edu/workshop/ and enter 5eyesreg to register for free). The symposium will include talks by Tom Wallsten and Charles Twardy (each of whom are investigators on other ACE teams). I will be talking about my field study of strategic intelligence forecasting quality, and Jason Matheny, who manages the ACE program at IARPA and my colleague, Alan Barnes, will be providing commentary. The workshop is held at the unclassified level and will be held in Linthicum MD.

Reply
- dartthrowingchimp
  / November 1, 2012
  
  Looks really good, thanks. I’m going to attempt to make it to your session.
  
  Reply
- Richard Lum
  / November 2, 2012
  
  David, any chance it will be taped or streamed?
  
  Reply
  - David Mandel
    / November 5, 2012
    
    Sorry for the late reply. No, I don’t think it will be taped or streamed. At least that was never mentioned by the organizers.