Common Predictive Analytics Screw-Ups, Social Science Edition

Computerworld ran a great piece last week called “12 Predictive Analytics Screw-Ups” that catalogs some mistakes commonly made in statistical forecasting projects. Unfortunately, the language and examples are all from “industry”—what we used to call the private sector, I guess—so social scientists might read it and struggle to see the relevance to their work. To make that relevance clearer, I thought I’d give social-science-specific examples of the blunders that looked most familiar to me.

1. Begin without the end in mind.

I read this one as an admonition to avoid forecasting something just because you can, even if it’s not clear what those forecasts are useful for. More generally, though, I think it can also be read as a warning against fishing expeditions. If you poke around enough in a large data set on interstate wars, you’re probably going to find some variables that really boost your R-squared or Area under the ROC Curve,  but the models you get from that kind of dredging will often perform a lot worse when you try to use them to forecast in real time.

2. Define the project around a foundation your data can’t support.

It’s really important to think early and often about how your forecasting process will work in real time. The data you have in hand when generating a forecast will often be incomplete and noisier than the nice, tidy matrix you had when estimated the model(s) you’re applying, and it’s a good idea to plan around that fact when you’re doing the estimating.

Here’s the sort of thing I have in mind: Let’s say you discover that a country’s score on CIRI‘s physical integrity index at the end of one year is a useful predictor of its risk of civil war onset during the next year. Awesome…until you remember that CIRI isn’t updated until late in the calendar year. Now what? You can lag it further, but that’s liable to weaken the predictive signal if the variable is dynamic and recent changes are informative. Alternatively, you can keep the one-year lag and try to update by hand, but then you risk adding even more noise to an already-noisy set of inputs. There’s a reason you use the data made by people who’ve spent a lot of time working on the topic and the coding procedures.

Unfortunately, there’s no elegant escape from this dilemma. The only general rules I can see are 1) to try to anticipate these problems and address them in the design phase and, where possible, 2) to check the impact of these choices on the accuracy your forecasts and revise when the evidence suggests something else would work better.

3. Don’t proceed until your data is the best it can be.

4. When reviewing data quality, don’t bother to take out the garbage.

6. Don’t just proceed but rush the process because you know your data is perfect.

These three screw-ups underscore the tension between the need to avoid garbage in, garbage out (GIGO) modeling on the one hand and the need to be opportunistic on the other.

On many topics of interest to social scientists, we either have no data or the data we do have are sparse or lousy or both (see here for more on this point). Under these circumstances, you need to find ways to make the most of the information you’ve got, but you don’t want to pretend that you can spin straw into gold.

Again, there’s no elegant escape from these trade-offs. That said, I think it’s generally true that there’s almost always a significant payoff to be had from to getting familiar with the data you’re using, and from doing what you can to make those data cleaner or more complete without setting yourself up for failure at the forecasting stage (e.g., your multiple-imputation algorithm might expand your historical sample, but it won’t give you the latest values of the predictors it was used to infill.)

5. Use data from the future to predict the future.

So you’ve estimated a logistic regression model and, using cross-validation, discovered that it works really well out of sample. Then you look at the estimated coefficients and discover that one variable really seems to be driving that result. Then you look closer and discover that this variable is actually a consequence of the dependent variable. Doh!

I had this happen once when I was trying to develop a model that could be used to forecast transitions to and from democracy (see here for the results). At the exploratory stage, I found that a variable which counts the number of democracy episodes a country has experienced was a really powerful predictor of transitions to democracy. Then I remembered that this counter—which I’d coded—ticks up in the year that a transition occurs, so of course higher values were associated with a higher probability of transition. In this case, the problem was easily solved by lagging the predictor, but the problem and its solution won’t always be that obvious. Again, knowing your data should go a long way toward protecting you against this error.

8. Ignore the subject-matter experts when building your model.

For me, a forecasting tournament we conducted as part of the work of the Political Instability Task Force (PITF) really drove this point home (see here). We got better results when we restricted our vision to a smaller set of variables selected by researchers who’d been immersed in the material for a long time than we did when we applied those same methods to a much larger set of variables that we happened to have available to us.

This is probably less likely to be a problem for academics, who are more likely to try to forecast on topics they know and care about, than it is for “data scientists” and “hackers” who are often asked to throw the methods they know at problems on all sorts of unfamiliar topics. Still, even when you’re covering territory that seems familiar, it’s always a good idea to brush up on the relevant literature and ask around as you get started. A single variable often makes a significant difference in the predictive power of a forecasting algorithm.

9. Just assume that the keepers of the data will be fully on board and cooperative.

Data sets that are licensed and are therefore either too expensive to keep buying or impossible to include in replication files. Boutique data sets that cover really important topics but were created once but aren’t updated. Data that are embargoed while someone waits to publish from them. Data sets whose creators start acting differently when they hear that their data are useful for forecasting.

These are all problems I’ve run into, and they can effectively kill an applied forecasting project. Better to clear them up early or set aside the relevant data sets than to paint yourself into this kind of corner, which can be very frustrating.

10. If you build it, they will come; don’t worry about how to serve it up.

I still don’t feel like I have a great handle on how to convey probabilistic forecasts to non-statistical audiences, or which parts of those forecasts are most useful to which kinds of audiences. This is a really hard problem that has a huge effect on the impact of the work, and in my experience, having forecasts that are demonstrably accurate doesn’t automatically knock these barriers down (just ask Nate Silver).

The two larger lessons I take from my struggles with this problem are 1) to incorporate thinking about how the forecasts will be conveyed into the research design and 2) to consider presenting the forecasts in different ways to different audiences.

Regarding the first, the idea is to avoid methods that you’re intended audience won’t understand or at least tolerate. For example, if your audience is going to want information about the relative influence of various predictors on the forecasts in specific cases, you’re going to want to avoid “black box” algorithms that make it hard or impossible to recover that information.

Regarding the second, my point is not to assume that you know the single “right” way to communicate your forecasts. In fact, I think it’s a good idea to be experimental if you can. Try presenting the forecasts a few different ways—maps or dot plots, point estimates or distributions, cross-sectional comparisons or time series—see which ones resonate with which audiences, and  tailor your publications and presentations accordingly.

11. If the results look obvious, throw out the model.

Even if it’s not generating a lot of counter-intuitive results, a reasonably accurate forecasting model can still be really valuable in a couple of ways. First, it’s a great baseline for further research. Second, when a model like that occasionally does serve up a counter-intuitive result, that forecast will often reward a closer review. Closer review of the cases that do land far from the regression line may hold some great leads on variables your initial model overlooked.

This often comes up in my work on forecasting rare events like coups and mass killings. Most years, most of the countries that my forecasts identify as riskiest are pretty obvious. It doesn’t take a model to tell me that Sweden probably won’t have a coup this year but Mali or Sudan might, so people often respond to the forecasts by saying, “I already knew all that.” When they slow down and give the forecasts a closer look, though, they’ll usually find at least a few cases on either tail of the distribution that don’t match their priors. To my mind, these handfuls of surprises are really the point of the exercise. The conversations that start in response to these counter-intuitive results are the reason we use statistical models instead of just asking people what they think.

If I had to sum up all of these lessons into a single suggestion, it would be to learn by doing. Applied forecasting is a very different problem from hypothesis testing and even from data mining. You have to live the process a few times to really appreciate its difficulties, and those difficulties can vary widely across different forecasting problems. Ideally, you’ll pick a problem, work it, and generate forecasts in real time for a while so you get feedback not just on your accuracy, but also on how sustainable your process is. To avoid hindsight bias, make the forecasts public as you produce them, or at least share them with some colleagues as you go.

Some Suggested Readings for Political Forecasters

A few people have recently asked me to recommend readings on political forecasting for people who aren’t already immersed in the subject. Since the question keeps coming up, I thought I’d answer with a blog post. Here, in no particular order, are books (and one article) I’d suggest to anyone interested in the subject.

Thinking, Fast and Slow, by Daniel Kahneman. A really engaging read on how we think, with special attention to cognitive biases and heuristics. I think forecasters should read it in hopes of finding ways to mitigate the effects of these biases on their own work, and of getting better at spotting them in the thinking of others.

Numbers Rule Your World, by Kaiser Fung. Even if you aren’t going to use statistical models to forecast, it helps to think statistically, and Fung’s book is the most engaging treatment of that topic that I’ve read so far.

The Signal and the Noise, by Nate Silver. A guided tour of how forecasters in a variety of fields do their work, with some useful general lessons on the value of updating and being an omnivorous consumer of relevant information.

The Theory that Would Not Die, by Sharon Bertsch McGrayne. A history of Bayesian statistics in the real world, including successful applications to some really hard prediction problems, like the risk of accidents with atomic bombs and nuclear power plants.

The Black Swan, by Nicholas Nassim Taleb. If you can get past the derisive tone—and I’ll admit, I initially found that hard to do—this book does a great job explaining why we should be humble about our ability to anticipate rare events in complex systems, and how forgetting that fact can hurt us badly.

Expert Political Judgment: How Good Is It? How Can We Know?, by Philip Tetlock. The definitive study to date on the limits of expertise in political forecasting and the cognitive styles that help some experts do a bit better than others.

Counterfactual Thought Experiments in World Politics, edited by Philip Tetlock and Aaron Belkin. The introductory chapter is the crucial one. It’s ostensibly about the importance of careful counterfactual reasoning to learning from history, but it applies just as well to thinking about plausible futures, an important skill for forecasting.

The Foundation Trilogy, by Isaac Asimov. A great fictional exploration of the Modernist notion of social control through predictive science. These books were written half a century ago, and it’s been more than 25 years since I read them, but they’re probably more relevant than ever, what with all the talk of Big Data and the Quantified Self and such.

The Perils of Policy by P-Value: Predicting Civil Conflicts,” by Michael Ward, Brian Greenhill, and Kristin Bakke. This one’s really for practicing social scientists, but still. The point is that the statistical models we typically construct for hypothesis testing often won’t be very useful for forecasting, so proceed with caution when switching between tasks. (The fact that they often aren’t very good for hypothesis testing, either, is another matter. On that and many other things, see Phil Schrodt’s “Seven Deadly Sins of Contemporary Quantitative Political Analysis.“)

I’m sure I’ve missed a lot of good stuff and would love to hear more suggestions from readers.

And just to be absolutely clear: I don’t make any money if you click through to those books or buy them or anything like that. The closest thing I have to a material interest in this list are ongoing professional collaborations with three of the authors listed here: Phil Tetlock, Phil Schrodt, and Mike Ward.

Forecasting Politics Is Still Hard to Do (Well)

Last November, after the U.S. elections, I wrote a thing for Foreign Policy about persistent constraints on the accuracy of statistical forecasts of politics. The editors called it “Why the World Can’t Have a Nate Silver,” and the point was that much of what people who follow international affairs care about is still a lot harder to forecast accurately than American presidential elections.

One of the examples I cited in that piece was Silver’s poor performance on the U.K.’s 2010 parliamentary elections. Just two years before his forecasts became a conversation piece in American politics, the guy the Economist called “the finest soothsayer this side of Nostradamus” missed pretty badly in what is arguably another of the most information-rich election environments in the world.

A couple of recent election-forecasting efforts only reinforce the point that, the Internet and polling and “math” notwithstanding, this is still hard to do.

The first example comes from political scientist Chris Hanretty, who applied a statistical model to opinion polls to forecast the outcome of Italy’s parliamentary elections. Hanretty’s algorithm indicated that a coalition of center-left parties was virtually certain to win a majority and form the next government, but that’s not what happened. After the dust had settled, Hanretty sifted through the rubble and concluded that “the predictions I made were off because the polls were off.”

Had the exit polls given us reliable information, I could have made an instant prediction that would have been proved right. As it was, the exit polls were wrong, and badly so. This, to me, suggests that the polling industry has made a collective mistake.

The second recent example comes from doctoral candidate Ken Opalo, who used polling as grist for a statistical mill to forecast the outcome of Kenya’s presidential election. Ken’s forecast indicated that Uhuru Kenyatta would get the most votes but would fall short of the 50-percent-plus-one-vote required to win in the first round, making a run-off “almost inevitable.” In fact, Kenyatta cleared the 50-percent threshold in the first try, making him Kenya’s new president-elect. Once again, noisy polling data was apparently to blame. As Ken noted in a blog post before the results were finalized,

Mr. Kenyatta significantly outperformed the national polls leading to the election. I estimated that the national polls over-estimated Odinga’s support by about 3 percentage points. It appears that I may have underestimated their overestimation. I am also beginning to think that their regional weighting was worse than I thought.

As I see it, both of these forecasts were, as Nate Silver puts it in his book, wrong for the right reasons. Both Hanretty and Opalo built models that used the best and most relevant information available to them in a thoughtful way, and neither forecast was wildly off the mark. Instead, it just so happened that modest errors in the forecasts interacted with each country’s electoral rules to produce categorical outcomes that were quite different from the ones the forecasts had led us to expect.

But that’s the rub, isn’t it? Even in the European Union in the Internet age, it’s still hard to predict the outcome of national elections. We’re getting smarter about how to model these things, and our computers can now process more of the models we can imagine, but polling data are still noisy and electoral systems complex.

And that’s elections, where polling data nicely mimic the data-generating process that underlies the events we’re trying to forecast. We don’t have polls telling us what share of the population plans to turn out for anti-government demonstrations or join a rebel group or carry out a coup—and even if we did, we probably wouldn’t trust them. Absent these micro-level data, we turn to proxy measures and indicators of structural opportunities and constraints, but every step away from the choices we’re trying to forecast adds more noise to the result. Agent-based computational models represent a promising alternative, but when it comes to macro-political phenomena like revolutions and state collapses, these systems are still in their infancy.

Don’t get me wrong. I’m thrilled to see more people using statistical models to try to forecast important events in international politics, and I would eagerly pit the forecasts from models like Hanretty’s and Opalo’s against the subjective judgments of individual experts any day. I just think it’s important to avoid prematurely declaring the arrival of a revolution in forecasting political events, to keep reminding ourselves how hard this problem still is. As if the (in)accuracy of our forecasts would let us have it any other way.

A Quick Post Mortem on Oscars Forecasting

I was intrigued to see that statistical forecasts of the Academy Awards from PredictWise and FiveThirtyEight performed pretty well this year. Neither nailed it, but they both used sound processes to generate probabilistic estimates that turned out to be fairly accurate.

In the six categories both sites covered, PredictWise assigned very high probabilities to the eventual winner in four: Picture, Actor, Actress, and Supporting Actress. PredictWise didn’t miss by much in one more—Supporting Actor, where winner Christoph Waltz ran a close second to Tommy Lee Jones (40% to 44%). Its biggest miss came in the Best Director category, where PredictWise’s final forecast favored Steven Spielberg (76%) over winner Ang Lee (22%).

At FiveThirtyEight, Nate Silver and co. also gave the best odds to the same four of six eventual winners, but they were a little less confident than PredictWise about a couple of them. FiveThirtyEight also had a bigger miss in the Best Supporting Actor category, putting winner Christoph Waltz neck and neck with Philip Seymour Hoffman and both of them a ways behind Jones. FiveThirtyEight landed closer to the mark than PredictWise in the Best Director category, however, putting Lee just a hair’s breadth behind Spielberg (0.56 to 0.58 on its index).

If this were a showdown, I’d give the edge to PredictWise for three reasons. One, my eyeballing of the results tells me that PredictWise’s forecasts were slightly better calibrated. Both put four of the six winners in front and didn’t miss by much on one more, but PredictWise was more confident in the four they both got “right.” Second, PredictWise expressed its forecasts as probabilities, while FiveThirtyEight used some kind of unitless index that I found harder to understand. Last but not least, PredictWise also gets bonus points for forecasting all 24 of the categories presented on Sunday night, and against that larger list it went an impressive 19 for 24.

It’s also worth noting the two forecasters used different methods. Silver and co. based their index on lists of awards that were given out before the Oscars, treating those results like the pre-election polls they used to accurately forecast the last couple of U.S. general elections. Meanwhile, PredictWise used an algorithm to combine forecasts from a few different prediction markets, which themselves combine the judgments of thousands of traders. PredictWise’s use of prediction markets gave it the added advantage of making its forecasts dynamic; as the prediction markets moved in the weeks before the awards ceremony, its forecasts updated in real time. We don’t have enough data to say yet, but it may also be that prediction markets are better predictors than the other award results, and that’s why PredictWise did a smidgen better.

If I’m looking to handicap the Oscars next year and both of these guys are still in the game, I would probably convert Silver’s index to a probability scale and then average the forecasts from the two of them. That approach wouldn’t have improved on the four-of-six record they each managed this year, but the results would have been better calibrated than either one alone, and that bodes well for future iterations. Again and again, we’re seeing that model averaging just works, so whenever the opportunity presents itself, do it.

UPDATE: Later on Monday, Harry Enten did a broader version of this scan for the Guardian‘s Film Blog and reached a similar conclusion:

A more important point to take away is that there was at least one statistical predictor got it right in all six major categories. That suggests that a key fact about political forecasting holds for the Oscars: averaging of the averages works. You get a better idea looking at multiple models, even if they themselves include multiple factors, than just looking at one.

Two Forecasting Lessons from a Crazy Football Season

My younger son is a huge fan of the Baltimore Ravens, and his enthusiasm over the past several years has converted me, so we had a lot of fun (and gut-busting anxiety) watching the Super Bowl on Sunday.

As a dad and fan, my favorite part of the night was the Baltimore win. As a forecaster, though, my favorite discovery of the night was a web site called Advanced NFL Stats, one of a budding set of quant projects applied to the game of football. Among other things, Advanced NFL Stats produces charts of the probability that either team will win every pro game in progress, including the Super Bowl. These charts are apparently based on a massive compilation of stats from games past, and they are updated in real time. As we watched the game, I could periodically refresh the page on my mobile phone and give us a fairly reliable, up-to-the-minute forecast of the game’s outcome. Since the Super Bowl confetti has settled, I’ve spent some time poking through archived charts of the Ravens’ playoff run, and that exercise got me thinking about two lessons for forecasters.

1. Improbable doesn’t mean impossible.

To get to the Super Bowl, the Ravens had to beat the Denver Broncos in the divisional round of the playoffs. Trailing by seven with 3:12 left in that game, the Ravens turned the ball over to Denver on downs at the Broncos’ 31-yard line. To win from there, the Ravens would need a turnover or quick stop; then a touchdown; then either a successful two-point conversion or a first score in overtime.

As the chart below shows, the odds of all of those things coming together were awfully slim. At that point—just before “Regulation” on the chart’s bottom axis—Advanced NFL Stats’ live win-probability graph gave the Ravens roughly a 1% chance of winning. Put another way, if the game could be run 100 times from that position, we would only expect to see Baltimore win once.


Well, guess what happened? The one-in-a-hundred event, that’s what. Baltimore got the quick stop they needed, Denver punted, Joe Flacco launched a 70-yard bomb down the right sideline to Jacoby Jones for a touchdown, the Ravens pushed the game into overtime, and two minutes into the second extra period at Mile High Stadium, Justin Tucker booted a 47-yard field goal to carry Baltimore back to the AFC Championship.

For Ravens’ fans, that outcome was a %@$# miracle. For forecasters, it was a great reminder that even highly unlikely events happen sometimes. When Nate Silver’s model indicates on the eve of the 2012 election that President Obama has a 91% chance of winning, it isn’t saying that Obama is going to win. It’s saying he’s probably going to win, and the Ravens-Broncos game reminds us that there’s an important difference. Conversely, when a statistical model of rare events like coups or mass killings identifies certain countries as more susceptible than others, it isn’t necessarily suggesting that those highest-risk cases are definitely going to suffer those calamities. When dealing with events as rare as those, even the most vulnerable cases will escape most years without a crisis.

The larger point here is one that’s been made many times but still deserves repeating: no single probabilistic forecast is plainly right and wrong. A sound forecasting process will reliably distinguish the more likely from the less likely, but it won’t attempt to tell us exactly what’s going to happen in every case. Instead, the more accurate the forecasts, the more closely the frequency of real-world outcomes or events will track the predicted probabilities assigned to them. If a meteorologist’s model is really good, we should end up getting wet roughly half of the times she tells us there’s a 50% chance of rain. And almost every time the live win-probability graph gives a football team a 99% chance of winning, they will go on to win that game—but, as my son will happily point out, not every time.

2. The “obvious” indicators aren’t always the most powerful predictors.

Take a look at the Advanced NFL Stats chart below, from Sunday’s Super Bowl. See that sharp dip on the right, close to the end? Something really interesting happened there: late in the game, Baltimore led on score (34-29) but trailed San Francisco in its estimated probability of winning (about 45%).


How could that be? Consideration of the likely outcomes of the next two possessions makes it clearer. At the time, San Francisco had a first-and-goal situation from Baltimore’s seven yard line. Teams with four shots at the end zone from seven yards out usually score touchdowns, and teams that get the ball deep in their own territory with a two- or three-point deficit and less than two minutes to play usually lose. In that moment, the live forecast confirmed the dread that Ravens fans were feeling in our guts: even though San Francisco was still trailing, the game had probably slipped away from Baltimore.

I think there’s a useful lesson for forecasters in that peculiar situation: the most direct indicators don’t tell the whole story. In football, the team with a late-game lead is usually going to win, but Advanced NFL Stats’ data set and algorithm have uncovered at least one situation where that’s not the case.

This lesson also applies to efforts to forecasts political processes, like violent conflict and regime collapse. With the former, we tend to think of low-level violence as the best predictor of future civil wars, but that’s not always true. It’s surely a valuable piece of information, but there are other sources of positive and negative feedback that might rein in incipient violence in some cases and produce sudden eruptions in others. Ditto for dramatic changes in political regimes. Eritrea, for example, recently had some sort of mutiny and North Korea did not, but that doesn’t necessarily mean the former is closer to breaking down than the latter. There may be features of the Eritrean regime that will allow it to weather those challenges and aspects of the North Korean regime that predispose it to more abrupt collapse.

In short, we shouldn’t ignore the seemingly obvious signals, but we should be careful to put them in their proper context, and the results will sometimes be counter-intuitive.

Oh, and…THIS:


Dr. Bayes, or How I Learned to Stop Worrying and Love Updating

Over the past week, I spent a chunk of every morning cycling through the desert around Tucson, Arizona. On Friday, while riding toward my in-laws’ place in the mountains west of town, I heard the roar of a jet overhead. My younger son’s really into flying right now, and Tucson’s home to a bunch of fighter jets, so I reflexively glanced up toward the noise, hoping to spot something that would interest him.

On that first glance, all I saw was an empty patch of deep blue sky. Without effort, my brain immediately retrieved a lesson from middle-school physics, reminding me that the relative speeds of light and sound meant any fast-moving plane would appear ahead of its roar. But which way? Before I glanced up again, I drew on prior knowledge of local patterns to guess that it would almost certainly be to my left, traveling east, and not so far ahead of the sound because it would be flying low as it approached either the airport or the Air Force Base.  Moments after my initial glance, I looked up a second time and immediately spotted the plane where I’d now expected to find it. When I did, I wasn’t surprised to see that it was a commercial jet, not a military one, because most of the air traffic in the area is civilian.

This is Bayesian thinking, and it turns out that we do it all the time.  The essence of Bayesian inference is updating. We humans intuitively form and hold beliefs (estimates) about all kinds of things. Those beliefs are often erroneous, but it turns out that we can make them better by revising (updating) them whenever we encounter new information that pertains to them. Updating is really just a form of learning, but Bayes’ theorem gives us a way to structure that learning that turns out to be very powerful. As cognitive scientists Tom Griffiths and Joshua Tenenbaum summarize in a nice 2006 paper [PDF] called “Statistics and the Bayesian Mind,”

The mathematics of Bayesian belief is set out in the box. The degree to which one should believe in a particular hypothesis h after seeing data d is determined by two factors: the degree to which one believed in it before seeing d, as reflected by the prior probability P(h), and how well it predicts the data d, as reflected in the likelihood, P(d|h).

This might sound like a lot of work or just too arcane to bother, but Griffiths and Tenenbaum argue that we often think that way intuitively. Their paper gives several examples, including predictions about the next result in a series of coin flips and the common tendency to infer causality from clusters that actually arise at random.

The same process appears in my airplane-spotting story. My initial glance is akin to the base rates that are often used as the starting point for Bayesian inference: to see something you hear, look where sound is coming from. When that prediction failed, I went through three rounds of updating before I looked up again—one based on general knowledge about the relative speeds of light and sound, and then a second (direction) and third (commercial vs. military) based on prior observations of local air traffic. My final “prediction” turned out to be right because those local patterns are strong, but even with all that objective information, there was still some uncertainty. Who knows, there could have been an emergency, or a rogue pilot, or an alien invasion…

I’m writing about this because I think it’s interesting, but I also have ulterior motives. A big part of my professional life involves using statistical models to forecast rare political events, and I am deeply frustrated by frequent encounters with people who dismiss statistical forecasts out of hand (see here and here for previous posts on the subject). It’s probably unrealistic of me to think so, but I am hopeful that recognition of the intuitive nature and power of Bayesian updating might make it easier for skeptics to make use of my statistical forecasts and others like them.

I’m a firm believer in the forecasting power of statistical models, so I usually treat a statistical forecast as my initial belief (or prior, in Bayesian jargon) and then only revise that forecast as new information arrives. That strategy is based on another prior, namely, the body of evidence amassed by Phil Tetlock and others that the predictive judgments of individual experts often aren’t very reliable, and that statistical models usually produce more accurate forecasts.

From personal experience I gather that most people, including many analysts and policymakers, don’t share that belief about the power of statistical models for forecasting. Even so, I would like to think those skeptics might still see how Bayes’ rule would allow them to make judicious use of statistical forecasts, even if they trust their own or other experts’ judgments more. After all, to ignore a statistical forecast is equivalent to holding the extreme view that that statistical forecast holds absolutely no useful information. In The Theory that Would Not Die, an entertaining lay history of Bayes’ rule, Sharon Bertsch McGrayne quotes Larry Stone, a statistician who used the theorem to help find a nuclear submarine that went missing in 1968, as saying that, “Discarding one of the pieces of information is in effect making the subjective judgment that its weight is zero and the other weight is one.”

So instead of rejecting the statistical forecast out of hand, why not update in response to it? When the statistical forecast closely accords with your prior belief, it will strengthen your confidence in that judgment, and rightly so. When the statistical forecast diverges from your prior belief, Bayes’ theorem offers a structured but simple way to arrive at a new estimate. Experience shows that this deliberate updating will produce more accurate forecasts than the willful myopia involved in ignoring the new information the statistical model has provided. And, as a kind of bonus, the deliberation involved in estimating the conditional probabilities Bayes’ theorem requires may help to clarify your thinking about the underlying processes involved and the sensitivity of your forecasts to certain assumptions.

PS. For some nice worked examples of Bayesian updating, see Appendix B of The Theory that Would Not Die or Chapter 8 of Nate Silver’s book, The Signal and the Noise. And thanks to Paul Meinshausen for pointing out the paper by Griffiths and Tenenbaum, and to Jay Yonamine for recommending The Theory That Would Not Die.

It’s Not Just The Math

This week, statistics-driven political forecasting won a big slab of public vindication after the U.S. election predictions of an array of number-crunching analysts turned out to be remarkably accurate. As John Sides said over at the Monkey Cage, “2012 was the Moneyball election.” The accuracy of these forecasts, some of them made many months before Election Day,

…shows us that we can use systematic data—economic data, polling data—to separate momentum from no-mentum, to dispense with the gaseous emanations of pundits’ “guts,” and ultimately to forecast the winner.  The means and methods of political science, social science, and statistics, including polls, are not perfect, and Nate Silver is not our “algorithmic overlord” (a point I don’t think he would disagree with). But 2012 has showed how useful and necessary these tools are for understanding how politics and elections work.

Now I’ve got a short piece up at Foreign Policy explaining why I think statistical forecasts of world politics aren’t at the same level and probably won’t be very soon. I hope you’ll read the whole thing over there, but the short version is: it’s the data. If U.S. electoral politics is a data hothouse, most of international politics is a data desert. Statistical models make very powerful forecasting tools, but they can’t run on thin air, and the density and quality of the data available for political forecasting drops off precipitously as you move away from U.S. elections.

Seriously: you don’t have to travel far in the data landscape to start running into trouble. In a piece posted yesterday, Stephen Tall asks rhetorically why there isn’t a British Nate Silver and then explains that it’s because “we [in the U.K.] don’t have the necessary quality of polls.” And that’s the U.K., for crying out loud. Now imagine how things look in, say, Ghana or Sierra Leone, both of which are holding their own national elections this month.

Of course, difficult does not mean impossible. I’m a bit worried, actually, that some readers of that Foreign Policy piece will hear me saying that most political forecasting is still stuck in the Dark Ages, when that’s really not what I meant. I think we actually do pretty well with statistical forecasting on many interesting problems in spite of the dearth of data, as evidenced by the predictive efforts of colleagues like Mike Ward and Phil Schrodt and some of the work I’ve posted here on things like coups and popular uprisings.

I’m also optimistic that the global spread of digital connectivity and associated developments in information-processing hardware and software are going to help fill some of those data gaps in ways that will substantially improve our ability to forecast many political events. I haven’t seen any big successes along those lines yet, but the changes in the enabling technologies are pretty radical, so it’s plausible that the gains in data quality and forecasting power will happen in big leaps, too.

Meanwhile, while we wait for those leaps to happen, there are some alternatives to statistical models that can help fill some of the gaps. Based partly on my own experiences and partly on my read of relevant evidence (see here, here, and here for a few tidbits), I’m now convinced that prediction markets and other carefully designed systems for aggregating judgments can produce solid forecasts. These tools are most useful in situations where the outcome isn’t highly predictable but relevant information is available to those who dig for it. They’re somewhat less useful for forecasting the outcomes of decision processes that are idiosyncratic and opaque, like North Korean government or even the U.S. Supreme Court. There’s no reason to let the perfect be the enemy of the good, but we should use these tools with full awareness of their limitations as well as their strengths.

More generally, though, I remain convinced that, when trying to forecast political events around the world, there’s a complexity problem we will never overcome no matter how many terabytes of data we produce and consume, how fast our processors run, and how sophisticated our methods become. Many of the events that observers of international politics care about are what Nassim Nicholas Taleb calls “gray swans”—”rare and consequential, but somewhat predictable, particularly to those who are prepared for them and have the tools to understand them.”

These events are hard to foresee because they bubble up from a complex adaptive system that’s constantly evolving underfoot. The patterns we think we discern in one time and place can’t always be generalized to others, and the farther into the future we try to peer, the thinner those strands get stretched. Events like these “are somewhat tractable scientifically,” as Taleb puts it, but we should never expect to predict their arrival the way we can foresee the outcomes of more orderly processes like U.S. elections.

When Is a Forecast Wrong?

This topic came up a few days ago when Foreign Policy managing editor Blake Hounshell tweeted: “Fill in the blank: If Nate Silver calls this election wrong, _____________.”

Silver writes the widely read FiveThirtyEight blog for the New York Times and is the closest thing to a celebrity that statistical forecasting of politics has ever produced. Silver uses a set of statistical models to produce daily forecasts of the outcome of the upcoming presidential and Congressional elections, and I suspect that Hounshell was primarily interested in how the accuracy of those forecasts might solidify or diminish Silver’s deservedly stellar reputation.

What caught my mind’s eye in Hounshell’s tweet, though, was what it suggested about how we conventionally assess a forecast’s accuracy. The question at the head of this post seems easy enough to answer: a forecast is wrong when it says one thing is going to happen and then something else does. For example, if I predict that a flipped coin is going to come up heads but it lands on tails, my forecast was incorrect. Or, if I say Obama will win in November but Romney does, I was obviously wrong.

But here’s the thing: few forecasters worth their salt will make that kind of unambiguous prediction. Silver won’t try to call the race one way or the other; instead, he’ll estimate the probabilities of all possible outcomes. As of today, his model of the presidential contest pegs Obama’s chances of re-election at about 70 percent—not exactly a toss-up, but hardly a done deal, either. Over at Votamatic, Drew Linzer’s model gives Obama a stronger chance of re-election—better than 95 percent—but even that estimate doesn’t entirely eliminate the possibility of a Romney win.

So when is a forecast like Silver’s or Linzer’s wrong? If a meteorologist says there’s a 20 percent chance of rain, and it rains, was he or she wrong? If an analyst tells you there probably won’t be a revolution in Tunisia this year and then there is one, was that a “miss”?

The important point here is that these forecasts are probabilities, not absolutes, and we really ought to evaluate them as such. The world is inherently uncertain, and sound forecasts will reflect that uncertainty instead of pretending to eliminate it. As Linzer said in a recent blog post,

It’s not realistic to expect any model to get exactly the right answer—the world is just too noisy, and the data are too sparse and (sadly) too low quality. But we can still assess whether the errors in the model estimates were small enough to warrant confidence in that model, and make its application useful and worthwhile.

So, what kinds of errors should we look for? Statistical forecasters draw a helpful distinction between discrimination and calibration. Discrimination refers to a model’s ability to distinguish accurately between cases in different categories: heads or tails, incumbent or challenger, coup or no coup. Models designed to forecast events that can be classed this way should be able to distinguish the one from the other in some useful way. Exactly what constitutes “useful,” though, will depends on the nature of the forecasting problem. For example, if one of two outcomes occurs very rarely—say, the start of a war between two countries—it’s often going to be very hard to do better at discrimination than a naïve forecast of “It’ll never happen.” If two possible outcomes occur with similar frequency, however, then a coin flip offers a more appropriate analogy.

For models basing forecasts of the presidential election on the aggregation of state-level results, we might ask, “In how many states did the model identify the eventual winner as the favorite?” Of course, some states are easier to call than others—it’s not much of a gamble to identify Obama as the likely winner in my home state of Maryland this year—so we’d want to pay special attention to the swing states, asking if the model does better than random chance at identifying the likely winner in those toss-up situations without screwing up the easier calls.

Calibration is a little different.  When an event-forecasting process is working well, the probabilities it produces will closely track the real-world incidence of the event of interest—that is, the frequency with which it occurs. Across all situations where a well-calibrated model tells us there’s a 20- to 30-percent chance of a rebellion occurring, we should see rebellions occur roughly one of every four times. As consumers of well-calibrated weather forecasts, we should know that a 30-percent chance of rain doesn’t mean it’s not going to rain; it means it probably won’t, but approximately three of every 10 times we see that forecast, it will. For election models that try directly to pick a winner, we can see how closely the predicted probabilities track the frequencies of the observed outcomes (see here for one example). For election models that try to identify a likely winner by forecasting vote shares (popular or Electoral College), we can see how closely the predicted shares match the observed ones.

A theme common to both yardsticks is that we can’t properly assess a forecast’s accuracy without first identifying a realistic baseline. In a world where crystal balls don’t exist, the proper goal is to be better than the competition, not oracular.

And, in many cases, that bar will be set pretty low. For most of the forms of political change I’ve tried to forecast over the past 15 years—things like coup attempts or the occurrence of mass atrocities—the main competition is occasional and usually ambiguous statements from experts. The ambiguity of these statements makes it very hard to assess their accuracy, but evidence from carefully structured studies of expert judgment suggests that they usually aren’t a whole lot more accurate than random guessing. When that, and not clairvoyance, is the state of the art, it’s not as hard to make useful forecasts as I think we conventionally suppose.

Given that reality, I think we’re often tougher on forecasters than we should be. Instead of judging forecasters according to the entirety of their track records and comparing those records to a realistic baseline, we succumb to the availability heuristic and lionize or dismiss forecasters on the basis of the last big call they made. This tendency is, I suspect, at least part of what Hounshell was thinking about when he tweeted his question about Silver.

What we need to understand, though, is that this reflex means we often get worse forecasts than we otherwise might. When forecasters’ reputations can collapse from a single wrong call, there’s not much incentive to get into the business in the first place, and once in, there’s a strong incentive to be as ambiguous as possible as a hedge against a career-defining error. Those strategies might make professional sense, but they don’t lead to more useful information.

A Comment on Nate Silver’s The Signal and the Noise

I’ve just finished reading Nate Silver’s very good new book, The Signal and the Noise: Why So Many Predictions Fail—But Some Don’t. For me, the book was more of a tossed salad than a cake—a tasty assemblage of conceptually related parts that doesn’t quite cohere into a new whole. Still, I would highly recommend it to anyone interested in forecasting, a category that should include anybody with a pulse, as Silver persuasively argues. We can learn a lot just by listening to a skilled practitioner talk about his craft, and that, to me, is what The Signal and the Noise is really all about.

Instead of trying to review the whole book here, though, I wanted to pull on one particular thread running through it, because I worry about where that thread might lead some readers. That thread concerns the relative merits of statistical models and expert judgment as forecasting tools.

Silver is a professional forecaster who built his reputation on the clever application of statistical tools, but that doesn’t mean he’s a quantitative fundamentalist. To the contrary, one of the strongest messages in The Signal and the Noise is that our judgment may be poor, but we shouldn’t fetishize statistical models, either. Yes, human forecasters are inevitably biased, but so, in a sense, are the statistical models they build. For starters, those models entail a host of assumptions about the reliability and structure of the data, many of which will often be wrong. For another, there is often important information that’s hard to quantify but is useful for forecasting, and we ignore the signals from this space at our own peril. Third, forecasts are usually more accurate when they aggregate information from multiple, independent sources, and subjective forecasts from skilled experts can be a really useful leg in this stool.

Putting all of these concerns together, Silver arrives at a philosophy of forecasting that might be described as “model-assisted,” or maybe just “omnivorous.” Silver recognizes the power of statistics for finding patterns in noisy data and checking our mental models, but he also cautions strongly against putting blind faith in those tools and favors keeping human judgment in the cycle, including at the final stage where we finally make a forecast about some situation of interest.

To illustrate the power of model-assisted forecasting, Silver describes how well this approach has worked in a few areas: baseball scouting, election forecasting, and meteorology, to name a few. About the latter, for example, he writes that “weather forecasting is one of the success stories in this book, a case of man and machine joining forces to understand and sometimes anticipate the complexities of nature.”

All of what Silver says about the pitfalls of statistical forecasting and the power of skilled human forecasters is true, but only to a point. I think Silver’s preferred approach depends on a couple of conditions that are often absent in real-world efforts to forecast complex political phenomena, where I’ve done most of my work. Because Silver doesn’t spell those conditions out, I thought I would, in an effort to discourage readers of The Signal and the Noise from concluding that statistical forecasts can always be improved by adjusting them according to our judgment.

First, the process Silver recommends assumes that the expert tweaking the statistical forecast is the modeler, or at least has a good understanding of the strengths and weaknesses of the model(s) being used. For example, he describes experienced meteorologists improving their forecasts by manually adjusting certain values to correct for a known flaw in the model. Those manual adjustments seem to make the forecasts better, but they depend on a pretty sophisticated knowledge of the underlying algorithm and the idiosyncrasies of the historical data.

Second and probably more important, the scenarios Silver approvingly describes all involve situations where the applied forecaster gets frequent and clear feedback on the accuracy of his or her predictions. This feedback allows the forecaster to look for patterns in the performance of the statistical tool and the adjustments being made to them. It’s the familiar process of trial and error, but that process only works when we can see where the errors are and see if the fixes we attempt are actually working.

Both of these conditions hold in several of the domains Silver discusses, including baseball scouting and and meteorology. These are data-rich environments where forecasters often know the quirks of the data and statistical models they might use and can constantly see how they’re doing.

In the world of international politics, however, most forecasters—and, whether they realize it or not, every analyst is a forecaster—have little or no experience with statistical forecasting tools and are often skeptical of their value. As a result, discussions about the forecasts these tools produce are more likely to degenerate into a competitive, “he said, she said” dynamic than they are to achieve the synergy that Silver praises.

More important, feedback on the predictive performance of analysts in international politics is usually fuzzy or absent. Poker players get constant feedback from the changing size of their chip stacks. By contrast, people who try to forecast politics rarely do so with much specificity, and even when they do, they rarely keep track of their performance over time. What’s worse, the events we try to forecast—things like coups or revolutions—rarely occur, so there aren’t even that many opportunities to assess our performance even if we try. Most of the score-keeping is done in our own heads, but as Phil Tetlock shows, we’re usually poor judges of own performance. We fixate on the triumphs, forget or explain away the misses, and spin the ambiguous cases as successes.

In this context, it’s not clear to me that Silver’s ideal of “model-assisted” forecasting is really attainable, at least not without more structure being imposed from the outside. For example, I could imagine a process where a third party elicits forecasts from human prognosticators and statistical models and then combines the results in a way that accounts for the strengths and blind spots of each input. This process would blend statistics and expert judgment, just not by means of a single individual as often happened in Silver’s favored examples.

Meanwhile, the virtuous circle Silver describes is already built into the process of statistical modeling, at least when done well. For example, careful statistical forecasters will train their models on one sample of cases and then apply them to another sample they’ve never “seen” before. This out-of-sample validation lets modelers know if they’re onto something useful and gives them some sense of the accuracy and precision of their models before they rush out and apply them.

I couldn’t help but wonder how much Silver’s philosophy was shaped by the social part of his experiences in baseball and elections forecasting. In both of those domains, there’s a running culture clash, or at least the perception of one, between statistical modelers and judgment-based forecasters—nerds and jocks in baseball, quants and pundits in politics. When you work in a field like that, you can get a lot of positive social feedback by saying “Everybody’s right!” I’ve sat in many meetings where someone proposed combining statistical forecasts and expert judgment without specifying how that process would work or that we actually check how the combination is affecting forecast accuracy. Almost every time, though, that proposal is met with a murmur of assent: “Of course! Experts are experts! More is more!” I get the sense that this advice will almost always be popular, but I’m not convinced that it’s always sound.

Silver is right, of course, when he argues that we can never escape subjectivity. Modelers still have to choose the data and models they use, both of which bake a host of judgments right into the pie. What we can do with models, though, is discipline our use of those data, and in so doing, more clearly compare sets of assumptions to see which are more useful. Most political forecasters don’t currently inhabit a world where they can get to know the quirks of the statistical models and adjust for them. Most don’t have statistical models or hold them at arm’s length if they do, and they don’t get to watch them perform anywhere near enough to spot and diagnose the biases. When these conditions aren’t met, we need to be very cautious about taking forecasts from a well-designed model and tweaking them because they don’t feel right.

When Forecasting Rare Events, the Value Comes from the Surprises

Forecasters of U.S. presidential elections are carrying on a healthy debate about the power and value of the models they construct. Nate Silver fired the opening salvo with a post arguing that the forecasts aren’t nearly as good as political scientists (and their publishers) claim. John Sides and Lynn Vavreck responded with reasoned defenses, and Brendan Nyhan‘s earlier post on the topic deserves another look in response to Silver’s skepticism as well.

One reason it’s so hard to forecast U.S. presidential elections is that there aren’t that many examples from which to learn. American presidential elections only happen 25 times each century, and the country’s only been around for a couple of those. As if that weren’t enough trouble, it’s hard to imagine that the forces shaping the outcomes of those contests aren’t changing over time. Just 25 election cycles ago, TVs and PCs didn’t exist, and most American homes didn’t even have phones.

Those of us who try to forecast rare forms of political conflict and crisis confront a similar challenge. Right now, I’m working on a model that’s meant to help anticipate onsets of state-sponsored mass killing in countries around the world. Since World War II, there have been only 110 of these “events” worldwide, and they have become even rarer in the two decades since the collapse of the Soviet Union.

The rarity of these atrocious episodes is good news for humanity, of course, but it does make statistical forecasting more difficult. With so few events, statistical models don’t have many cases on which to train, and modelers have to think more carefully about the trade-offs involved in partitioning the data for the kind of out-of-sample cross-validations that offer the most information about the accuracy of their constructs. The same logic applies to wars within and between states, coups, democratic transitions, popular uprisings, and just about everything else I’ve ever been asked to try to forecast.

When modeling events as rare as these in a data set that covers all relevant cases, the utility of the forecasts isn’t in the point estimate of the likelihood that the event will occur. With small samples and noisy data sets, those point estimates are way too uncertain to take literally, and even the most powerful models will never generate predictions that are nearly as precise as we’d like.

Instead, a good starting point for forecasting from rare-events models is a list of all at-risk cases shown in descending order by estimated probability of event occurrence. Most of the countries at the tops and bottoms of these lists will strike their consumers as “no-brainers.” For example, most of us probably don’t need a statistical model to tell us that China is especially susceptible to the onset of civil-resistance campaigns because it’s an authoritarian regime with more than 1 billion citizens. Likewise for a list that tells us Norway is unlikely to break out in civil war this year. Both of those forecasts can be accurate without being especially useful.

The real value of rare-events forecasts comes from the surprises–the cases for which a ranked list generated from a reasonably reliable model contradicts our prior expectations. These deviations provide us with a useful signal to revisit those expectations and, when relevant, to prepare against or even move to prevent that crisis’ occurrence.

Take the recent coup in Mali. While the conventional narrative described this country as a consolidated democracy, a watch list generated from statistical models identified it as one of the countries in the world most likely to suffer a coup attempt in 2012. Had people concerned about Mali’s political stability seen that forecast ahead of time, it might have spurred them to rethink their assumptions and perhaps prepare better for this unfortunate turn of events.

These surprises can cut the other way, too. In January, when I used a model of democratic transitions to generate forecasts for 2012, I was chagrined to see that Egypt ranked pretty far down the list. Now, with the outcome of the transition increasingly in doubt, I’m thinking that forecast wasn’t so bad after all. For concerned observers, a forecast like that could have served as a useful reminder that Egypt still isn’t on a steady glide path to democracy.

Even with well-calibrated models, these “deviations” won’t always prove prescient. A watch list that accurately identified Egypt, Morocco, and Syria as three of the countries most likely to see civil-resistance campaigns emerge in 2011 also ranked North Korea in the top 10 for that year, and nothing in that list or the underlying model could have told us in advance which would be which.

In spite of that imprecision, I think the forecasts worked pretty well. Most of the countries toward the top of the list may not have seen popular uprisings, but nearly all of the uprisings that did occur happened in top 30 countries. Analysts who were surprised to see a civil-resistance campaign erupt in Syria might not have been so surprised if they had seen those forecasts and reconsidered their mental models accordingly.

The broader point is that, when trying to forecast rare events, we shouldn’t get too hung up on the exact values of the predicted probabilities. The model we’re striving for here isn’t an actuarial table that allows us to allocate our dollars and attention as efficiently as possible. Even if policy and advocacy worked that way–and they don’t–the statistics won’t allow it.

A more useful model, I think, is the light on your car’s dashboard that tells you you’re running low on fuel. When that light comes on, you don’t know how far you can drive before you’ll run out of gas, but you do know that you’d better start worrying about refilling soon. The light directs your attention to a potential problem you probably weren’t thinking about a few moments earlier. A reasonably well-calibrated statistical model of rare political events should do the same thing for analysts and other concerned observers, whose attention usually doesn’t get redirected until the engine is already sputtering.


Get every new post delivered to your Inbox.

Join 5,698 other followers

%d bloggers like this: