Silver writes the widely read FiveThirtyEight blog for the New York Times and is the closest thing to a celebrity that statistical forecasting of politics has ever produced. Silver uses a set of statistical models to produce daily forecasts of the outcome of the upcoming presidential and Congressional elections, and I suspect that Hounshell was primarily interested in how the accuracy of those forecasts might solidify or diminish Silver’s deservedly stellar reputation.
What caught my mind’s eye in Hounshell’s tweet, though, was what it suggested about how we conventionally assess a forecast’s accuracy. The question at the head of this post seems easy enough to answer: a forecast is wrong when it says one thing is going to happen and then something else does. For example, if I predict that a flipped coin is going to come up heads but it lands on tails, my forecast was incorrect. Or, if I say Obama will win in November but Romney does, I was obviously wrong.
But here’s the thing: few forecasters worth their salt will make that kind of unambiguous prediction. Silver won’t try to call the race one way or the other; instead, he’ll estimate the probabilities of all possible outcomes. As of today, his model of the presidential contest pegs Obama’s chances of re-election at about 70 percent—not exactly a toss-up, but hardly a done deal, either. Over at Votamatic, Drew Linzer’s model gives Obama a stronger chance of re-election—better than 95 percent—but even that estimate doesn’t entirely eliminate the possibility of a Romney win.
So when is a forecast like Silver’s or Linzer’s wrong? If a meteorologist says there’s a 20 percent chance of rain, and it rains, was he or she wrong? If an analyst tells you there probably won’t be a revolution in Tunisia this year and then there is one, was that a “miss”?
The important point here is that these forecasts are probabilities, not absolutes, and we really ought to evaluate them as such. The world is inherently uncertain, and sound forecasts will reflect that uncertainty instead of pretending to eliminate it. As Linzer said in a recent blog post,
It’s not realistic to expect any model to get exactly the right answer—the world is just too noisy, and the data are too sparse and (sadly) too low quality. But we can still assess whether the errors in the model estimates were small enough to warrant confidence in that model, and make its application useful and worthwhile.
So, what kinds of errors should we look for? Statistical forecasters draw a helpful distinction between discrimination and calibration. Discrimination refers to a model’s ability to distinguish accurately between cases in different categories: heads or tails, incumbent or challenger, coup or no coup. Models designed to forecast events that can be classed this way should be able to distinguish the one from the other in some useful way. Exactly what constitutes “useful,” though, will depends on the nature of the forecasting problem. For example, if one of two outcomes occurs very rarely—say, the start of a war between two countries—it’s often going to be very hard to do better at discrimination than a naïve forecast of “It’ll never happen.” If two possible outcomes occur with similar frequency, however, then a coin flip offers a more appropriate analogy.
For models basing forecasts of the presidential election on the aggregation of state-level results, we might ask, “In how many states did the model identify the eventual winner as the favorite?” Of course, some states are easier to call than others—it’s not much of a gamble to identify Obama as the likely winner in my home state of Maryland this year—so we’d want to pay special attention to the swing states, asking if the model does better than random chance at identifying the likely winner in those toss-up situations without screwing up the easier calls.
Calibration is a little different. When an event-forecasting process is working well, the probabilities it produces will closely track the real-world incidence of the event of interest—that is, the frequency with which it occurs. Across all situations where a well-calibrated model tells us there’s a 20- to 30-percent chance of a rebellion occurring, we should see rebellions occur roughly one of every four times. As consumers of well-calibrated weather forecasts, we should know that a 30-percent chance of rain doesn’t mean it’s not going to rain; it means it probably won’t, but approximately three of every 10 times we see that forecast, it will. For election models that try directly to pick a winner, we can see how closely the predicted probabilities track the frequencies of the observed outcomes (see here for one example). For election models that try to identify a likely winner by forecasting vote shares (popular or Electoral College), we can see how closely the predicted shares match the observed ones.
A theme common to both yardsticks is that we can’t properly assess a forecast’s accuracy without first identifying a realistic baseline. In a world where crystal balls don’t exist, the proper goal is to be better than the competition, not oracular.
And, in many cases, that bar will be set pretty low. For most of the forms of political change I’ve tried to forecast over the past 15 years—things like coup attempts or the occurrence of mass atrocities—the main competition is occasional and usually ambiguous statements from experts. The ambiguity of these statements makes it very hard to assess their accuracy, but evidence from carefully structured studies of expert judgment suggests that they usually aren’t a whole lot more accurate than random guessing. When that, and not clairvoyance, is the state of the art, it’s not as hard to make useful forecasts as I think we conventionally suppose.
Given that reality, I think we’re often tougher on forecasters than we should be. Instead of judging forecasters according to the entirety of their track records and comparing those records to a realistic baseline, we succumb to the availability heuristic and lionize or dismiss forecasters on the basis of the last big call they made. This tendency is, I suspect, at least part of what Hounshell was thinking about when he tweeted his question about Silver.
What we need to understand, though, is that this reflex means we often get worse forecasts than we otherwise might. When forecasters’ reputations can collapse from a single wrong call, there’s not much incentive to get into the business in the first place, and once in, there’s a strong incentive to be as ambiguous as possible as a hedge against a career-defining error. Those strategies might make professional sense, but they don’t lead to more useful information.