For the past several months, I’ve been participating in a forecasting tournament as one of hundreds of “experts” making and updating predictions about dozens of topics in international political and economic affairs. This tournament is funded by IARPA, a research arm of the U.S. government’s Office of the Director of National Intelligence, and it’s really a grand experiment designed to find better ways to elicit, combine, and present probabilistic forecasts from groups of knowledgeable people.
There are several teams participating in this tournament; I happen to be part of the Good Judgment team that is headed by psychologist Phil Tetlock. Good Judgment “won” the first year of the competition, but that win came before I started participating, so alas, I can’t claim even a tiny sliver of the credit for that.
I’ve been prognosticating as part of my job for more than a decade, but almost all of the forecasting I’ve done in the past was based on statistical models designed to assess risks of a specific rare event (say, a coup attempt, or the onset of a civil war) across all countries worldwide. The Good Judgment Project is my first experience with routinely making calls about the likelihood of many different events based almost entirely on my subjective beliefs. Now that I’m a few months into this exercise, I thought I’d write something about how I’ve approached the task, because I think my experiences speak to generic difficulties in forecasting rare political and economic events.
By way of background, here’s how the forecasting process works for the Good Judgment team: I start by logging into a web site that lists a bunch of questions on an odd mix of topics, everything from the Euro-to-dollar exchange rate to the outcome of the recent election in Venezuela. I click on a question and am expected to assign a numeric probability to each of the two or more possible outcomes listed (e.g., “yes or no,” or “A, B, or C”). Those outcomes are always exhaustive, so the probabilities I assign must always sum to 1. Whenever I feel like it, I can log back in and update any of the forecasts I’ve already made. Then, when the event of interest happens (e.g., results of the Venezuelan election are announced), the question is closed, and the accuracy of my forecast for that question and for all questions closed to date is summarized with a statistic called the Brier score. I can usually see my score pretty soon after the question closes, and I can also see average scores on each question for my whole team and the cumulative scores for the top 10 percent or so of the team’s leader board.
So, how do you perform this task? In an ideal world, my forecasting process would go as follows:
1. Read the question carefully. What does “attack,” “take control of,” or “lose office” mean here, and how will it be determined? To make the best possible forecast, I need to understand exactly what it is I’m being asked to predict.
2. Forecast. In Unicorn World, I would have an ensemble of well-designed statistical models to throw at each and every question. In the real world, I’m lucky if there’s a single statistical model that applies even indirectly. Absent a sound statistical forecast, I would try to identify the class of events to which the question belongs and then determine a base rate for that class. The “base rate” is just the historical frequency of the event’s occurrence in comparable cases—say, how often civil wars end in their first year, or how often incumbents lose elections in Africa. Where feasible, I would also check prediction markets, look for relevant polling data, seek out published predictions, and ask knowledgeable people.
In all cases, the idea is to find empirical evidence to which I can anchor my forecast, and to get a sense of how much uncertainty there is about this particular case. When there’s a reliable forecast from a statistical model or even good base-rate information, I would weight that most heavily and would only adjust far away from that prediction in cases where subject-matter experts make a compelling case about why this instance will be different. (As for what makes a case compelling, well…) If I can’t find any useful information or the information I find is all over the map, I would admit that I have no clue and do the probabilistic equivalent of a coin flip or die roll, splitting the predicted probability evenly across all of the possible outcomes.
3. Update. As Nate Silver argues in The Signal and The Noise, forecasters should adjust their prediction whenever we are presented with relevant new information. There’s no shame in reconsidering your views as new information becomes available; that’s called learning. Ideally, I would be disciplined about how I update my forecasts, using Bayes’ rule to check the human habit of leaning too hard on the freshest tidbit and take full advantage of all the information I had before.
4. Learn. Over time, I would see areas where I was doing well or poorly and would use those outcomes to deepen confidence in, or to try to improve, my mental models. For the questions I get wrong, what did I overlook? What were the forecasters who did better than I thinking about that I wasn’t? For the questions I answer well, was I right for the right reasons, or was it possibly just dumb luck? The more this process gets repeated, the better I should be able to do, within the limits determined by the basic predictability of the phenomenon in question.
To my mind, that is how I should be making forecasts. Now here are few things I’ve noticed so far about what I’m actually doing.
1. I’m lazy. Most of the time, I see the question, make an assumption about what it means without checking the resolution criteria, and immediately formulate a gut response on simple four-point scale: very likely, likely, unlikely, or very unlikely. I’d like to say that “I have no clue” is also on the menu of immediate responses, but my brain almost always makes an initial guess, even if it’s a topic about which I know nothing—for example, the resolution of a particular dispute before the World Trade Organization. What’s worse, it’s hard to dislodge that gut response once I’ve had it, even when I think I have better anchoring information.
2. I’m motivated by the competition, but that motivation doesn’t necessarily make me more attentive. As I said earlier, in this tournament, we can see our own scores and the scores of all the top performers as we go. With such a clear yardstick, it’s hard not to get pulled into seeing other forecasters as competitors and trying to beat them. You’d think that urge would motivate me to be more attentive to the task, more rigorously following the idealized process I described above. Most of the time, though, it just means that I calibrate my answers to the oddities of the yardstick—the Brier score involves squaring your error term, so the cost of being wrong isn’t distributed evenly across the range of possible values—and that I check the updated scores soon after questions close.
3. It’s hard to distinguish likelihood from timing. Some of the questions can get pretty specific about the timing of the event of interest. For example, we might be asked something like: Will Syrian president Bashar al-Assad fall before the end of October 2012; before the end of December 2012; before April 2013; or some time thereafter? I find these questions excruciatingly hard to answer, and it took me a little while to figure out why.
After thinking through how I had approached a few examples, I realized that my brain was conflating probability with proximity. In other words, the more likely I thought the event was, the sooner I expected it to occur. That makes sense for some situations, but it doesn’t always, and a careful consideration of timing will usually require lots of additional information. For example, I might look at structural conditions in Syria and conclude that Assad can’t win the civil war he’s started, but how long it’s going to take for him to lose will depend on a host of other things with complex dynamics, like the price of oil, the flow of weapons, and the logistics of military campaigns. Interestingly, even though I’m now aware of this habit, I’m still finding it hard to break.
4. I’m eager to learn from feedback, but feedback is hard to come by. This project isn’t like weather forecasting or equities trading, where you make a prediction, see how you did, tweak your model, and try again, over and over. Most of the time, the questions are pretty idiosyncratic, so you’ll have just one or a few chances to make a certain prediction. What’s more, the answers are usually categorical, so even when you do get more than one shot, it’s hard to tell how wrong or right you were. In this kind of environment, it’s really hard to build on your experience. In the several months I’ve been participating, I think I’ve learned far more about the peculiarities of the Brier score than I have about the generative process underlying any of the phenomena about which I’ve been asked to predict.
And that, for whatever it’s worth, is one dart-throwing chimp’s initial impressions of this particular experiment. As it happens, I’ve been doing pretty well so far—I’ve been at or near the top of the team’s accuracy rankings since scores for the current phase started to come in—but I still feel basically clueless most of the time. It’s like a golf tournament where good luck tosses some journeyman to the top of the leader board after the first or second round, and I keep waiting for the inevitable regression toward the mean to kick in. I’d like to think I can pull off a Tin Cup, but I know enough about statistics to expect that I almost certainly won’t.