Dart-Throwing Chimp

Coup Forecasts for 2014


This year, I’ll start with the forecasts, then describe the process. First, though, a couple of things to bear in mind as you look at the map and dot plot:

  1. Coup attempts rarely occur, so the predicted probabilities are all on the low side, and most are approximately zero. The fact that a country shows up in dark red on the map or ranks high on an ordered list does not mean that we should anticipate a coup occurring there. It just means that country is at relatively high risk compared to the rest of the world. Statistically speaking, the safest bet for any country almost any year is that a coup attempt won’t occur. The point of this exercise is to try to get a better handle on where the few coup attempts we can expect to see this year are most likely to happen.
  2. These forecasts are estimates based on noisy data, so they are highly imprecise, and small differences are not terribly meaningful. The fact that one country lands a few notches higher or lower than other on an ordered list does not imply a significant difference in risk.

Okay, now the the forecasts. First, the heat-map version, which sorts the world into fifths. From cross-validation in the historical data, we can expect nearly 80 percent of the countries with coup attempts this year to be somewhere in that top fifth. So, if there are four countries with coup attempts in 2014, three of them are probably in dark red on that map, and the other one is probably dark orange.

Now, a dot plot of the Top 40, which is a slightly larger set than the top fifth in the heat map. Here, the gray dots show the forecasts from the two component models (see below), while the red dots are the unweighted average of those two—what I consider the single-best forecast.

A lot of food for thought in there, but I’m going to leave interpretation of these results to future posts and to you.

Now, on the process: As statistical forecasters are wont to do, I have tinkered with the models again this year. As I said in a blogged research note a couple of weeks ago, this year’s tinkering was driven by a combination of data practicalities and the usual sense of, “Hey, let’s see if we can do a little better this time.”Predictably, though, I also ended up doing things a little different than I’d expected in December. Specifically:

The forecasts are an unweighted average of predicted probabilities from a logistic regression model and a Random Forest that use more or less the same inputs. Both models were trained on data covering the period 1960-2010; applied to data from 2011 to 2013 to assess their predictive performance; and then applied to the newest data to generate forecasts for 2014. Variable selection was based mostly on my prior experience working this problem. As noted above, I did a little bit of model checking—using stratified 10-fold cross-validation—to make sure the process worked reasonably well, and to help choose between some different measures for the same concept. In that cross-validation, the unweighted average got good but not great accuracy scores, with an area under the ROC curve in the low 0.80s. Here are the variables used in the models:

All of the predictors are lagged one year except for region, last colonizer, country age, post-Cold War period, and the election-year indicator. The fact that a variable appears on this list does not necessarily mean that it has a significant effect on the risk of any coup attempts. As I said earlier, I drew up a roster of variables to include based on a sense of what might matter (a.k.a., theory) and past experience and did not try to do much winnowing.

If you are interested in exploring the results in more detail or just trying to do this better, you can replicate my analysis using code I’ve put on GitHub (here). The posted script includes a Google Drive link with the requisite data. If you tinker and find something useful, I only ask that you return the favor and let me know. [N.B. As its name implies, the generation of a Random Forest is partially stochastic, so the results will vary slightly each time the process is repeated. If you run the posted script on the posted data, you can expect to see some small differences in the final estimates. I think these small differences are actually a nice representation of the forecasts’ inherent uncertainty, so I have not attempted to eliminate it by, for example, setting the random number seed within the R script.]

For the 2013 version of this post, see here. For 2012, here.

UPDATE: In response to a comment, I tried to produce another version of the heat map that more clearly differentiates the quantiles and better reflects the fact that the predicted probabilities for cases outside the top two fifths are all pretty close to zero. The result is shown below. Here, the differences in the shades of gray represent differences in the average predicted probabilities across the five tiers. You can decide if it’s clearer or not.