In Applied Forecasting, Keep It Simple

One of the lessons I think I’ve learned from the nearly 15 years I’ve spent developing statistical models to forecast rare political events is: keep it simple unless and until you’re compelled to do otherwise.

The fact that the events we want to forecast emerge from extremely complex systems doesn’t mean that the models we build to forecast them need to be extremely complex as well. In a sense, the unintelligible complexity of the causal processes relieves us from the imperative to follow that path. We know our models can’t even begin to capture the true data-generating process. So, we can and usually should think instead about looking for measures that capture relevant concepts in a coarse way and then use simple model forms to combine those measures.

A few experiences and readings have especially shaped my thinking on this issue.

  • When I worked on the Political Instability Task Force (PITF), my colleagues and I found that a logistic regression model with just four variables did a pretty good job assessing relative risks of a few forms of major political crisis in countries worldwide (see here, or ungated here). In fact, one of the four variables in that model—an indicator that four or more bordering countries have ongoing major armed conflicts—has almost no variance, so it’s effectively a three-variable model. We tried adding a lot of other things that were suggested by a lot of smart people, but none of them really improved the model’s predictive power. (There were also a lot of things we couldn’t even try because the requisite data don’t exist, but that’s a different story.)
  • Toward the end of my time with PITF, we ran a “tournament of methods” to compare the predictive power of several statistical techniques that varied in their complexity, from logistic regression to Bayesian hierarchical models with spatial measures (see here for the write-up). We found that the more complex approaches usually didn’t outperform the simpler ones, and when they did, it wasn’t by much. What mattered most for predictive accuracy was finding the inputs with the most forecasting power. Once we had those, the functional form and complexity of the model didn’t make much difference.
  • As Andreas Graefe describes (here), models that assign equal weights to all predictors often forecast at least as accurately as multiple regression models that estimate weights from historical data. “Such findings have led researchers to conclude that the weighting of variables is secondary for the accuracy of forecasts,” Graefe writes. “Once the relevant variables are included and their directional impact on the criterion is specified, the magnitudes of effects are not very important.”

Of course, there will be some situations in which complexity adds value, so it’s worth exploring those ideas when we have a theoretical rationale and the coding skills, data, and time needed to pursue them. In general, though, I am convinced that we should always try simpler forms first and only abandon them if and when we discover that more complex forms significantly increase forecasting power.

Importantly, the evidence for that judgment should come from out-of-sample validation—ideally, from forecasts made about events that hadn’t yet happened. Models with more variables and more complex forms will often score better than simpler ones when applied to the data from which they were derived, but this will usually turn out to be a result of overfitting. If the more complex approach isn’t significantly better at real-time forecasting, it should probably be set aside until it does.

Oh, and a corollary: if you have to choose between a) building more complex models, or even just applying lots of techniques to the same data, and b) testing other theoretically relevant variables for predictive power, do (b).

The Ethics of Political Science in Practice

As citizens and as engaged intellectuals, we all have the right—indeed, an obligation—to make moral judgments and act based on those convictions. As political scientists, however, we have a unique set of potential contributions and constraints. Political scientists do not typically have anything of distinctive value to add to a chorus of moral condemnation or declarations of normative solidarity. What we do have, hopefully, is the methodological training, empirical knowledge and comparative insight to offer informed assessments about alternative courses of action on contentious issues. Our primary ethical commitment as political scientists, therefore must be to get the theory and the empirical evidence right, and to clearly communicate those findings to relevant audiences—however unpalatable or inconclusive they might be.

That’s a manifesto of sorts, nested in a great post by Marc Lynch at the Monkey Cage. Marc’s post focuses on analysis of the Middle East, but everything he writes generalizes to the whole discipline.

I’ve written a couple of posts on this theme, too:

  • This Is Not a Drill,” on the challenges of doing what Marc proposes in the midst of fast-moving and politically charged events with weighty consequences; and
  • Advocascience,” on the ways that researchers’ political and moral commitments shape our analyses, sometimes but not always intentionally.

Putting all of those pieces together, I’d say that I wholeheartedly agree with Marc in principle, but I also believe this is extremely difficult to do in practice. We can—and, I think, should—aspire to this posture, but we can never quite achieve it.

That applies to forecasting, too, by the way. Coincidentally, I saw this great bit this morning in the Letter from the Editors for a new special issue of The Appendix, on “futures of the past”:

Prediction is a political act. Imagined futures can be powerful tools for social change, but they can also reproduce the injustices of the present.

Concern about this possibility played a role in my decision to leave my old job, helping to produce forecasts of political instability around the world for private consumption by the U.S. government. It is also part of what attracts me to my current work on a public early-warning system for mass atrocities. By making the same forecasts available to all comers, I hope that we can mitigate that downside risk in an area where the immorality of the acts being considered is unambiguous.

As a social scientist, though, I also understand that we’ll never know for sure what good or ill effects our individual and collective efforts had. We won’t know because we can’t observe the “control” worlds we would need to confidently establish cause and effect, and we won’t know because the world we seek to understand keeps changing, sometimes even in response to our own actions. This is the paradox at the core of applied, empirical social science, and it is inescapable.

Beware the Confident Counterfactual

Did you anticipate the Syrian uprising that began in 2011? What about the Tunisian, Egyptian, and Libyan uprisings that preceded and arguably shaped it? Did you anticipate that Assad would survive the first three years of civil war there, or that Iraq’s civil war would wax again as intensely as it has in the past few days?

All of these events or outcomes were difficult forecasting problems before they occurred, and many observers have been frank about their own surprise at many of them. At the same time, many of those same observers speak with confidence about the causes of those events. The invasion of Iraq in 2003 surely is or is not the cause of the now-raging civil war in that country. The absence of direct US or NATO military intervention in Syria is or is not to blame for continuation of that country’s civil war and the mass atrocities it has brought—and, by extension, the resurgence of civil war in Iraq.

But here’s the thing: strong causal claims require some confidence about how history would have unfolded in the absence of the cause of interest, and those counterfactual histories are no easier to get right than observed history was to anticipate.

Like all of the most interesting questions, what causality means and how we might demonstrate it will forever be matters for debate—see here on Daniel Little’s blog for an overview of that debate’s recent state—but most conceptions revolve around some idea of necessity. When we say X caused Y, we usually mean that had X not occurred, Y wouldn’t have happened, either. Subtler or less stringent versions might center on salience instead of necessity and insert a “probably” into the final phrase of the previous sentence, but the core idea is the same.

In nonexperimental social science, this logic implicitly obliges us to consider the various ways history might have unfolded in response to X’ rather than X. In a sense, then, both prediction and explanation are forecasting problems. They require us to imagine states of the world we have not seen and to connect them in plausible ways to to ones we have. If anything, the counterfactual predictions required for explanation are more frustrating epistemological problems than the true forecasts, because we will never get to see the outcome(s) against which we could assess the accuracy of our guesses.

As Robert Jervis pointed out in his contribution to a 1996 edited volume on counterfactual thought experiments in world politics, counterfactuals are (or should be) especially hard to construct—and thus causal claims especially hard to make—when the causal processes of interest involve systems. For Jervis,

A system exists when elements or units are interconnected so that the system has emergent properties—i.e., its characteristics and behavior canot be inferred from the characteristics and behavior of the units taken individually—and when changes in one unit or the relationship between any two of them produce ramifying alterations in other units or relationships.

As Jervis notes,

A great deal of thinking about causation…is based on comparing two situations that are the same in all ways except one. Any differences in the outcome, whether actual or expected…can be attributed to the difference in the state of the one element…

Under many circumstances, this method is powerful and appropriate. But it runs into serious problems when we are dealing with systems because other things simply cannot be held constant: as Garret Hardin nicely puts it, in a system, ‘we can never do merely one thing.’

Jervis sketches a few thought experiments to drive this point home. He has a nice one about the effects of external interventions on civil wars that is topical here, but I think his New York traffic example is more resonant:

In everyday thought experiments we ask what would have happened if one element in our world had been different. Living in New York, I often hear people speculate that traffic would be unbearable (as opposed to merely terrible) had Robert Moses not built his highways, bridges, and tunnels. But to try to estimate what things would have been like, we cannot merely subtract these structures from today’s Manhattan landscape. The traffic patterns, the location of businesses and residences, and the number of private automobiles that are now on the streets are in significant measure the product of Moses’s road network. Had it not been built, or had it been built differently, many other things would have been different. Traffic might now be worse, but it is also possible that it would have been better because a more efficient public transportation system would have been developed or because the city would not have grown so large and prosperous without the highways.

Substitute “invade Iraq” or “fail to invade Syria” for Moses’s bridges and tunnels, and I hope you see what I mean.

In the end, it’s much harder to get beyond banal observations about influences to strong claims about causality than our story-telling minds and the popular media that cater to them would like. Of course the invasion of Iraq in 2003 or the absence of Western military intervention in Syria have shaped the histories that followed. But what would have happened in their absence—and, by implication, what would happen now if, for example, the US now re-inserted its armed forces into Iraq or attempted to topple Assad? Those questions are far tougher to answer, and we should beware of anyone who speaks with great confidence about their answers. If you’re a social scientist who isn’t comfortable making and confident in the accuracy of your predictions, you shouldn’t be comfortable making and confident in the validity of your causal claims, either.

Conflict Events, Coup Forecasts, and Data Prospecting

Last week, for an upcoming post to the interim blog of the atrocities early-warning project I direct, I got to digging around in ACLED’s conflict event data for the first time. Once I had the data processed, I started wondering if they might help improve forecasts of coup attempts, too. That train of thought led to the preliminary results I’ll describe here, and to a general reminder of the often-frustrating nature of applied statistical forecasting.

ACLED is the Armed Conflict Location & Event Data Project, a U.S. Department of Defense–funded, multi-year endeavor to capture information about instances of political violence in sub-Saharan Africa from 1997 to the present.ACLED’s coders scan an array of print and broadcast sources, identifiy relevant events from them, and then record those events’ date, location, and form (battle, violence against civilians, or riots/protests); the types of actors involved; whether or not territory changed hands; and the number of fatalities that occurred. Researchers can download all of the project’s data in various formats and structures from the Data page, one of the better ones I’ve seen in political science.

I came to ACLED last week because I wanted to see if violence against civilians in Somalia had waxed, waned, or held steady in recent months. Trying to answer that question with their data meant:

  • Downloading two Excel spreadsheets, Version 4 of the data for 1997-2013 and the Realtime Data file covering (so far) the first five months of this year;
  • Processing and merging those two files, which took a little work because my software had trouble reading the original spreadsheets and the labels and formats differed a bit across them; and
  • Subsetting and summarizing the data on violence against civilians in Somalia, which also took some care because there was an extra space at the end of the relevant label in some of the records.

Once I had done these things, it was easy to generalize it to the entire data set, producing tables with monthly counts of fatalities and events by type  for all African countries over the past 13 years. And, once I had those country-month counts of conflict events, it was easy to imagine using them to try to help forecast of coup attempts in the world’s most coup-prone region. Other things being equal, variations across countries and over time in the frequency of conflict events might tell us a little more about the state of politics in those countries, and therefore where and when coup attempts are more likely to happen.

Well, in this case, it turns out they don’t tell us much more. The plot below shows ROC curves and the areas under those curves for the out-of-sample predictions from a five-fold cross-validation exercise involving a few country-month models of coup attempts. The Base Model includes: national political regime type (the categorization scheme from PITF’s global instability model applied to Polity 3d, the spell-file version); time since last change in Polity score (in days, logged); infant mortality rate (relative to the annual global median, logged); and an indicator for any coup attempts in the previous 24 months (yes/no). The three other models add logged sums of counts of ACLED events by type—battles, violence against civilians, or riots/protests—in the same country over the previous three, six, or 12 months, respectively. These are all logistic regression models, and the dependent variable is a binary one indicating whether or not any coup attempts (successful or failed) occurred in that country during that month, according to Powell and Thyne.

ROC Curves and AUC Scores from Five-Fold Cross-Validation of Coup Models Without and With ACLED Event Counts

ROC Curves and AUC Scores from Five-Fold Cross-Validation of Coup Models Without and With ACLED Event Counts

As the chart shows, adding the conflict event counts to the base model seems to buy us a smidgen more discriminatory power, but not enough to have confidence that they would routinely lead to more accurate forecasts. Intriguingly, the crossing of the ROC curves suggests that the base model, which emphasizes structural conditions, is actually a little better at identifying the most coup-prone countries. The addition of conflict event counts to the model leads to some under-prediction of coups in that high-risk set, but the balance tips the other way in countries with less structural vulnerability. In the aggregate, though, there is virtually no difference in discriminatory power between the base model and the ones that at the conflict event counts.

There are, of course, many other ways to group and slice ACLED’s data, but the rarity of coups leads me to believe that narrower cuts or alternative operationalizations aren’t likely to produce stronger predictive signals. In Africa since 1997, there are only 36 country-months with coup attempts, according to Powell and Thyne. When the events are this rare and complex and the examples this few, there’s really not much point in going beyond the most direct measures. Under these circumstances, we’re unlikely to discover finer patterns, and if we do, we probably shouldn’t have much confidence in them. There are also other models and techniques to try, but I’m dubious for the same reasons. (FWIW, I did try Random Forests and got virtually identical accuracy.)

So those are the preliminary results from this specific exercise. (The R scripts I used are on Github, here). I think those results are interesting in their own right, but the process involved in getting to them is also a great example of the often-frustrating nature of applied statistical forecasting. I spent a few hours each day for three days straight getting from the thought of exploring ACLED to the results described here. Nearly all of that time was spent processing data; only the last half-hour or so involved any modeling. As is often the case, a lot of that data-processing time was really just me staring at my monitor trying to think of another way to solve some problem I’d already tried and failed to solve.

In my experience, that kind of null result is where nearly all statistical forecasting ideas end. Even when you’re lucky enough to have the data to pursue them, few of your ideas pan out. But panning is the right metaphor, I think. Most of the work is repetitive and frustrating, but every so often you catch a nice nugget. Those nuggets tempt you to keep looking for more, and once in a great while, they can make you rich.

Ripple Effects from Thailand’s Coup

Thailand just had another coup, its first since 2006 but its twelfth since 1932. Here are a few things statistical analysis tells us about how that coup is likely to reverberate through Thailand’s economy and politics for the next few years.

1. Economic growth will probably suffer a bit more. Thailand’s economy was already struggling in 2014, thanks in part to the political instability to which the military leadership was reacting. Still, a statistical analysis I did a few years ago indicates that the coup itself will probably impose yet more drag on the economy. When we compare annual GDP growth rates from countries that suffered coups to similarly susceptible ones that didn’t, we see an average difference of about 2 percentage points in the year of the coup and another 1 percentage point the year after. (See this FiveThirtyEight post for a nice plot and discussion of those results.) Thailand might find its way to the “good” side of the distribution underlying those averages, but the central tendency suggests an additional knock on the country’s economy.

2. The risk of yet another coup will remain elevated for several years. The “coup trap” is real. Countries that have recently suffered successful or failed coup attempts are more likely to get hit again than ones that haven’t. This increase in risk seems to persist for several years, so Thailand will probably stick toward the top of the global watch list for these events until at least 2019.

3. Thailand’s risk of state-led mass killing has nearly tripled…but remains modest. The risk and occurrence of coups and the character of a country’s national political regime feature prominently in the multimodel ensemble we’re using in our atrocities early-warning project to assess risks of onsets of state-led mass killing. When I recently updated those assessments using data from year-end 2013—coming soon to a blog near you!—Thailand remained toward the bottom of the global distribution: 100th of 162 countries, with a predicted probability of just 0.3%. If I alter the inputs to that ensemble to capture the occurrence of this week’s coup and its effect on Thailand’s regime type, the predicted probability jumps to about 0.8%.

That’s a big change in relative risk, but it’s not enough of a change in absolute risk to push the country into the end of the global distribution where the vast majority of these events occur. In the latest assessments, a risk of 0.8% would have placed Thailand about 50th in the world, still essentially indistinguishable from the many other countries in that long, thin tail. Even with changes in these important risk factors and an ongoing insurgency in its southern provinces, Thailand remains in the vast bloc of countries where state-led mass killing is extremely unlikely, thanks (statistically speaking) to its relative wealth, the strength of its connection to the global economy, and the absence of certain markers of atrocities-prone regimes.

4. Democracy will probably be restored within the next few years… As Henk Goemans and Nikolay Marinov show in a paper published last year in the British Journal of Political Science, since the end of the Cold War, most coups have been followed within a few years by competitive elections. The pattern they observe is even stronger in countries that have at least seven years of democratic experience and have held at least two elections, as Thailand does and has. In a paper forthcoming in Foreign Policy Analysis that uses a different measure of coups, Jonathan Powell and Clayton Thyne see that same broad pattern. After the 2006 coup, it took Thailand a little over a year to get back to a competitive elections for a civilian government under a new constitution. If anything, I would expect this junta to move a little faster, and I would be very surprised if the same junta was still ruling in 2016.

5. …but it could wind up right back here again after that. As implied by nos. 1 and 2 above, however, the resumption of democracy wouldn’t mean that Thailand won’t repeat the cycle again. Both statistical and game-theoretic models indicate that prospects for yet another democratic breakdown will stay relatively high as long as Thai politics remains sharply polarized. My knowledge of Thailand is shallow, but the people I read or follow who know the country much better skew pessimistic on the prospects for this polarization ending soon. From afar, I wonder if it’s ultimately a matter of generational change and suspect that Thailand will finally switch to a stable and less contentious equilibrium when today’s conservative leaders start retiring from their jobs in the military and bureaucracy age out of street politics.

Military Coup in Thailand

This morning here but this afternoon in Thailand, the country’s military leadership sealed the deal on a coup d’etat when it announced via national television that it was taking control of government.

The declaration of martial law that came two days earlier didn’t quite qualify as a coup because it didn’t involve a seizure of power. Most academic definitions of coups have involve (1) the use or threat of force (2) by political insiders, that is, people inside government or state security forces (3) to seize national political power. Some definitions also specify that the putschists’ actions must be illegal or extra-constitutional. The declaration of martial law certainly involved the use or threat of force by political insiders, but it did not entail a direct grab for power and technically was not even illegal.

Today’s announcement checks those last boxes. Frankly, I’m a bit surprised by this turn of events, but not shocked. In my statistical assessments of coup risk for 2014, Thailand ranked 10th, clearly putting it among the highest-risk countries in the world. In December, though, I judged from a distance that the country’s military leadership probably didn’t want to take ownership of this situation unless things got significantly worse:

The big question now is whether or not the military leadership will respond as desired [by anti-government forces angling for a coup]. They would be very likely to do so if they coveted power for themselves, but I think it’s pretty clear from their actions that many of them don’t. I suspect that’s partly because they saw after 2006 that seizing power didn’t really fix anything and carried all kinds of additional economic and reputational costs. If that’s right, then the military will only seize power again if the situation degenerates enough to make the costs of inaction even worse—say, into sustained fighting between rival factions, like we see in Bangladesh right now.

I guess the growing concerns about an impending civil war and economic recession were finally enough to tip military leaders’ thinking in favor of action. Here’s hoping the final forecast I offered in that December post comes true:

Whatever happens this time around, though, the good news is that within a decade or so, Thai politics will probably stabilize into a new normal in which the military no longer acts directly in politics and parts of what’s now Pheu Thai and its coalition compete against each other and the remnants of today’s conservative forces for power through the ballot box.

Galton’s Experiment Revisited

This is another cross-post from the blog of the Good Judgment Project.

One of my cousins, Steve Ulfelder, writes good mystery novels. He left a salaried writer’s job 13 years ago to freelance and make time to pen those books. In March, he posted this announcement on Facebook:

CONTEST! When I began freelancing, I decided to track the movies I saw to remind myself that this was a nice bennie you can’t have when you’re an employee (I like to see early-afternoon matinees in near-empty theaters). I don’t review them or anything; I simply keep a Word file with dates and titles.

Here’s the contest: How many movies have I watched in the theater since January 1, 2001? Type your answer as a comment. Entries close at 8pm tonight, east coast time. Closest guess gets a WOLVERINE BROS. T-shirt and a signed copy of the Conway Sax novel of your choice. The eminently trustworthy Martha Ruch Ulfelder is holding a slip of paper with the correct answer.

I read that post and thought: Now, that’s my bag. I haven’t seen Steve in a while and didn’t have a clear idea of how many movies he’s seen in the past 13 years, but I do know about Francis Galton and that ox at the county fair. Instead of just hazarding a guess of my own, I would give myself a serious shot at outwitting Steve’s Facebook crowd by averaging their guesses.

After a handful of Steve’s friends had submitted answers, I posted the average of them as a comment of my own, then updated it periodically as more guesses came in. I had to leave the house not long before the contest was set to close, so I couldn’t include the last few entrants in my final answer. Still, I had about 40 guesses in my tally at the time and was feeling pretty good about my chances of winning that t-shirt and book.

In the end, 45 entries got posted before Steve’s 8 PM deadline, and my unweighted average wasn’t even close. The histogram below shows the distribution of the crowd’s guesses and the actual answer. Most people guessed fewer than 300 movies, but a couple of extreme values on the high side pulled the average up to 346.  Meanwhile, the correct answer was 607, nearly one standard deviation (286) above that mean. I hadn’t necessarily expected to win, but I was surprised to see that 12 of the 45 guesses—including the winner at 600—landed closer to the truth than the average did.

steve.movie.data.hist

I read the results of my impromptu experiment as a reminder that crowds are often smart, but they aren’t magically clairvoyant. Retellings of Galton’s experiment sometimes make it seem like even pools of poorly informed guessers will automatically produce an accurate estimate, but, apparently, that’s not true.

As I thought about how I might have done better, I got to wondering if there was something about Galton’s crowd that made it particularly powerful for his task. Maybe we should expect a bunch of county fair–goers in nineteenth century England to be good at guessing the weight of farm animals. Still, the replication of Galton’s experiment under various conditions suggests that domain knowledge helps, but it isn’t essential. So maybe this was just an unusually hard problem. Steve has seen an average of nearly one movie in theaters each week for the past 13 years. In my experience, that’s pretty extreme, so even with the hint he dropped in his post about being a frequent moviegoer, it’s easy to see why the crowd would err on the low side. Or maybe this result was just a fluke, and if we could rerun the process with different or larger pools, the average would usually do much better.

Whatever the reason for this particular failure, though, the results of my experiment also got me thinking again about ways we might improve on the unweighted average as a method of gleaning intelligence from crowds. Unweighted averages are a reasonable strategy when we don’t have reliable information about variation in the quality of the individual guesses (see here), but that’s not always the case. For example, if Steve’s wife or kids had posted answers in this contest, it probably would have been wise to give their guesses more weight on the assumption that they knew better than acquaintances or distant relatives like me.

Figuring out smarter ways to aggregate forecasts is also an area of active experimentation for the Good Judgment Project (GJP), and the results so far are encouraging. The project’s core strategy involves discovering who the most accurate forecasters are and leaning more heavily on them. I couldn’t do this in Steve’s single-shot contest, but GJP gets to see forecasters’ track records on large numbers of questions and has been using them to great effect. In the recently-ended Season 3, GJP’s “super forecasters” were grouped into teams and encouraged to collaborate, and this approach has worked quite well. In a paper published this spring, GJP has also shown that they can do well with nonlinear aggregations derived from a simple statistical model that adjusts for systematic bias in forecasters’ judgments. Team GJP’s bias-correction model beats not only the unweighted average but also a number of widely-used and more complex nonlinear algorithms.

Those are just a couple of the possibilities that are already being explored, and I’m sure people will keep coming up with new and occasionally better ones. After all, there’s a lot of money to be made and bragging rights to be claimed in those margins. In the meantime, we can use Steve’s movie-counting contest to remind ourselves that crowds aren’t automatically as clairvoyant as we might hope, so we should keep thinking about ways to do better.

Early Results from a New Atrocities Early Warning System

For the past couple of years, I have been working as a consultant to the U.S. Holocaust Memorial Museum’s Center for the Prevention of Genocide to help build a new early-warning system for mass atrocities around the world. Six months ago, we started running the second of our two major forecasting streams, a “wisdom of (expert) crowds” platform that aggregates probabilistic forecasts from a pool of topical and area experts on potential events of concern. (See this conference paper for more detail.)

The chart below summarizes the output from that platform on most of the questions we’ve asked so far about potential new episodes of mass killing before 2015. For our early-warning system, we define a mass killing as an episode of sustained violence in which at least 1,000 noncombatant civilians from a discrete group are intentionally killed, usually in a period of a year or less. Each line in the chart shows change over time in the daily average of the inputs from all of the participants who choose to make a forecast on that question. In other words, the line is a mathematical summary of the wisdom of our assembled crowd—now numbering nearly 100—on the risk of a mass killing beginning in each case before the end of 2014. Also:

  • Some of the lines (e.g., South Sudan, Iraq, Pakistan) start further to the right than others because we did not ask about those cases when the system launched but instead added them later, as we continue to do.
  • Two lines—Central African Republic and South Sudan—end early because we saw onsets of mass-killing episodes in those countries. The asterisks indicate the dates on which we made those declarations and therefore closed the relevant questions.
  • Most but not all of these questions ask specifically about state-led mass killings, and some focus on specific target groups (e.g., the Rohingya in Burma) or geographic regions (the North Caucasus in Russia) as indicated.
Crowd-Estimated Probabilities of Mass-Killing Onset Before 1 January 2015

Crowd-Estimated Probabilities of Mass-Killing Onset Before 1 January 2015

I look at that chart and conclude that this process is working reasonably well so far. In the six months since we started running this system, the two countries that have seen onsets of mass killing are both ones that our forecasters promptly and consistently put on the high side of 50 percent. Nearly all of the other cases, where mass killings haven’t yet occurred this year, have stuck on the low end of the scale.

I’m also gratified to see that the system is already generating the kind of dynamic output we’d hoped it would, even with fewer than 100 forecasters in the pool. In the past several weeks, the forecasts for both Burma and Iraq have risen sharply, apparently in response to shifts in relevant policies in the former and the escalation of the civil war in the latter. Meanwhile, the forecast for Uighurs in China has risen steadily over the year as a separatist rebellion in Xinjiang Province has escalated and, with it, concerns about a harsh government response. These inflection points and trends can help identify changes in risk that warrant attention from organizations and individuals concerned about preventing or mitigating these potential atrocities.

Finally, I’m also intrigued to see that our opinion pool seems to be sorting cases into a few clusters that could be construed as distinct tiers of concern. Here’s what I have in mind:

  • Above the 50-percent threshold are the high risk cases, where forecasters assess that mass killing is likely to occur during the specified time frame.  These cases won’t necessarily be surprising. Some observers had been warning on the risk of mass atrocities in CAR and South Sudan for months before those episodes began, and the plight of the Rohingya in Burma has been a focal point for many advocacy groups in the past year. Even in supposedly “obvious” cases, however, this system can help by providing a sharper estimate of that risk and giving a sense of how it is trending over time. In the case of Burma, for example, it is the separation that has happened in the last several weeks that tells the story of a switch from possible to likely and thus adds a degree of urgency to that warning.
  • A little farther down the y-axis are the moderate risk cases—ones that probably won’t suffer mass killing during the period in question but could more readily tip in that direction. In the chart above, Iraq, Sudan, Pakistan, Bangladesh, and Burundi all land in this tier, although Iraq now appears to be sliding into the high risk group.
  • Clustered toward the bottom are the low risk cases where the forecasters seem fairly confident that mass killing will not occur in the near future. In the chart above, Russia, Afghanistan, and Ethiopia are the cases that land firmly in this set. China (Uighurs) remains closer to them than the moderate risk tier, but it appears to be creeping toward the moderate-risk group. We are also running a question about the risk of state-led mass killing in Rwanda before 2015, and it currently lands in this tier, with a forecast of 14 percent.

The system that generates the data behind this chart is password protected, but the point of our project is to make these kinds of forecasts freely available to the global public. We are currently building the web site that will display the forecasts from this opinion pool in real time to all comers and hope to have it ready this fall.

In the meantime, if you think you have relevant knowledge or expertise—maybe you study or work on this topic, or maybe you live or work in parts of the world where risks tend to be higher—and are interested in volunteering as a forecaster, please send an email to us at ewp@ushmm.org.

Asking the Right Questions

This is a cross-post from the Good Judgment Project’s blog.

I came to the Good Judgment Project (GJP) two years ago, in Season 2, as a forecaster, excited about contributing to an important research project and curious to learn more about my skill at prediction. I did pretty well at the latter, and GJP did very well at the former. I’m also a political scientist who happened to have more time on my hands than many of my colleagues, because I work as an independent consultant and didn’t have a full plate at that point. So, in Season 3, the project hired me to work as one of its lead question writers.

Going into that role, I had anticipated that one of the main challenges would be negotiating what Phil Tetlock calls the “rigor-relevance trade-off”—finding questions that are relevant to the project’s U.S. government sponsors and can be answered as unambiguously as possible. That forecast was correct, but even armed with that information, I failed to anticipate just how hard it often is to strike this balance.

The rigor-relevance trade-off exists because most of the big questions about global politics concern latent variables. Sometimes we care about specific political events because of their direct consequences, but more often we care about those events because of what they reveal to us about deeper forces shaping the world. For example, we can’t just ask if China will become more cooperative or more belligerent, because cooperation and belligerence are abstractions that we can’t directly observe. Instead, we have to find events or processes that (a) we can see and (b) that are diagnostic of that latent quality. For example, we can tell when China issues another statement reiterating its claim to the Senkaku Islands, but that happens a lot, so it doesn’t give us much new information about China’s posture. If China were to fire on Japanese aircraft or vessels in the vicinity of the islands—or, for that matter, to renounce its claim to them—now that would be interesting.

It’s tempting to forego some rigor to ask directly about the latent stuff, but it’s also problematic. For the forecast’s consumers, we need to be able to explain clearly what a forecast does and does not cover, so they can use the information appropriately. As forecasters, we need to understand what we’re being asked to anticipate so we can think clearly about the forces and pathways that might or might not produce the relevant outcome. And then there’s the matter of scoring the results. If we can’t agree on what eventually happened, we won’t agree on the accuracy of the predictions. Then the consumers don’t know how reliable those forecasts are, the producers don’t get the feedback they need, and everyone gets frustrated and demotivated.

It’s harder to formulate rigorous questions than many people realize until they try to do it, even on things that seem like they should be easy to spot. Take coups. It’s not surprising that the U.S. government might be keen on anticipating coups in various countries for various reasons. It is, however, surprisingly hard to define a “coup” in such a way that virtually everyone would agree on whether or not one had occurred.

In the past few years, Egypt has served up a couple of relevant examples. Was the departure of Hosni Mubarak in 2011 a coup? On that question, two prominent scholarly projects that use similar definitions to track coups and coup attempts couldn’t agree. Where one source saw an “overt attempt by the military or other elites within the state apparatus to unseat the sitting head of state using unconstitutional means,” the other saw the voluntary resignation of a chief executive due to a loss of his authority and a prompt return to civilian-led government. And what about the ouster of Mohammed Morsi in July 2013? On that, those academic sources could readily agree, but many Egyptians who applauded Morsi’s removal—and, notably, the U.S. government—could not.

We see something similar on Russian military intervention in Ukraine. Not long after Russia annexed Crimea, GJP posted a question asking whether or not Russian armed forces would invade the eastern Ukrainian cities of Kharkiv or Donetsk before 1 May 2014. The arrival of Russian forces in Ukrainian cities would obviously be relevant to U.S. policy audiences, and with Ukraine under such close international scrutiny, it seemed like that turn of events would be relatively easy to observe as well.

Unfortunately, that hasn’t been the case. As Mark Galeotti explained in a mid-April blog post,

When the so-called “little green men” deployed in Crimea, they were very obviously Russian forces, simply without their insignia. They wore Russian uniforms, followed Russian tactics and carried the latest, standard Russian weapons.

However, the situation in eastern Ukraine is much less clear. U.S. Secretary of State John Kerry has asserted that it was “clear that Russian special forces and agents have been the catalyst behind the chaos of the last 24 hours.” However, it is hard to find categorical evidence of this.

Even evidence that seemed incontrovertible when it emerged, like video of a self-proclaimed Russian lieutenant colonel in the Ukrainian city of Horlivka, has often been debunked.

This doesn’t mean we were wrong to ask about Russian intervention in eastern Ukraine. If anything, the intensity of the debate over whether or not that’s happened simply confirms how relevant this topic was. Instead, it implies that we chose the wrong markers for it. We correctly anticipated that further Russian intervention was possible if not probable, but we—like many others—failed to anticipate the unconventional forms that intervention would take.

Both of these examples show how hard it can be to formulate rigorous questions for forecasting tournaments, even on topics that are of keen interest to everyone involved and seem like naturals for the task. In an ideal world, we could focus exclusively on relevance and ask directly about all the deeper forces we want to understand and anticipate. As usual, though, that ideal world isn’t the one we inhabit. Instead, we struggle to find events and processes whose outcomes we can discern that will also reveal something telling about those deeper forces at play.

 

Alarmed By Iraq

Iraq’s long-running civil war has spread and intensified again over the past year, and the government’s fight against a swelling Sunni insurgency now threatens to devolve into the sort of indiscriminate reprisals that could produce a new episode of state-led mass killing there.

The idea that Iraq could suffer a new wave of mass atrocities at the hands of state security forces or sectarian militias collaborating with them is not far fetched. According to statistical risk assessments produced for our atrocities early-warning project (here), Iraq is one of the 10 countries worldwide most susceptible to an onset of state-led mass killing, bracketed by places like Syria, Sudan, and the Central African Republic where large-scale atrocities and even genocide are already underway.

Of course, Iraq is already suffering mass atrocities of its own at the hands of insurgent groups who routinely kill large numbers of civilians in indiscriminate attacks, every one of which would stun American or European publics if it happened there. According to the widely respected Iraq Body Count project, the pace of civilian killings in Iraq accelerated sharply in July 2013 after a several-year lull of sorts in which “only” a few hundred civilians were dying from violence each month. Since the middle of last year, the civilian toll has averaged more than 1,000 fatalities per month. That’s well off the pace of 2006-2007, the peak period of civilian casualties under Coalition occupation, but it’s still an astonishing level of violence.

Monthly Counts of Civilian Deaths from Violence in Iraq (Source: Iraq Body Count)

Monthly Counts of Civilian Deaths from Violence in Iraq (Source: Iraq Body Count)

What seems to be increasing now is the risk of additional atrocities perpetrated by the very government that is supposed to be securing civilians against those kinds of attacks. A Sunni insurgency is gaining steam, and the government, in turn, is ratcheting up its efforts to quash the growing threat to its power in worrisome ways. A recent Reuters story summarized the current situation:

In Buhriz and other villages and towns encircling the capital, a pitched battle is underway between the emboldened Islamic State of Iraq and the Levant, the extremist Sunni group that has led a brutal insurgency around Baghdad for more than a year, and Iraqi security forces, who in recent months have employed Shi’ite militias as shock troops.

And this anecdote from the same Reuters story shows how that battle is sometimes playing out:

The Sunni militants who seized the riverside town of Buhriz late last month stayed for several hours. The next morning, after the Sunnis had left, Iraqi security forces and dozens of Shi’ite militia fighters arrived and marched from home to home in search of insurgents and sympathizers in this rural community, dotted by date palms and orange groves.

According to accounts by Shi’ite tribal leaders, two eyewitnesses and politicians, what happened next was brutal.

“There were men in civilian clothes on motorcycles shouting ‘Ali is on your side’,” one man said, referring to a key figure in Shi’ite tradition. “People started fleeing their homes, leaving behind the elders and young men and those who refused to leave. The militias then stormed the houses. They pulled out the young men and summarily executed them.”

Sadly, this escalatory spiral of indiscriminate violence is not uncommon in civil wars. Ben Valentino, a collaborator of mine in the development of this atrocities early-warning project, has written extensively on this topic (see especially here , here, and here). As Ben explained to me via email,

The relationship between counter-insurgency and mass violence against civilians is one of the most well-established findings in the social science literature on political violence. Not all counter-insurgency campaigns lead to mass killing, but when insurgent groups become large and effective enough to seriously threaten the government’s hold on power and when the rebels draw predominantly on local civilians for support, the risks of mass killing are very high. Usually, large-scale violence against civilians is neither the first nor the only tactic that governments use to defeat insurgencies. They may try to focus operations primarily against armed insurgents, or even offer positive incentives to civilians who collaborate with the government. But when less violent methods fail, the temptation to target civilians in the effort to defeat the rebels increases.

Right now, it’s hard to see what’s going to halt or reverse this trend in Iraq. “Things can get much worse from where we are, and more than likely they will,” Daniel Serwer told IRIN News for a story on Iraq’s escalating conflict (here). Other observers quoted in the same story seemed to think that conflict fatigue would keep the conflict from ballooning further, but that hope is hard to square with the escalation of violence that has already occurred over the past year and the fact that Iraq’s civil war never really ended.

In theory, elections are supposed to be a brake on this process, giving rival factions opportunities to compete for power and influence state policy in nonviolent ways. In practice, this often isn’t the case. Instead, Iraq appears to be following the more conventional path in which election winners focus on consolidating their own power instead of governing well, and excluded factions seek other means to advance their interests. Here’s part of how the New York Times set the scene for this week’s elections, which incumbent prime minister Nouri al-Maliki’s coalition is apparently struggling to win:

American intelligence assessments have found that Mr. Maliki’s re-election could increase sectarian tensions and even raise the odds of a civil war, citing his accumulation of power, his failure to compromise with other Iraqi factions—Sunni or Kurd—and his military failures against Islamic extremists. On his watch, Iraq’s American-trained military has been accused by rights groups of serious abuses as it cracks down on militants and opponents of Mr. Maliki’s government, including torture, indiscriminate roundups of Sunnis and demands of bribes to release detainees.

Because Iraq ranked so high in our last statistical risk assessments, we posted a question about it a few months ago on our “wisdom of (expert) crowds” forecasting system. Our pool of forecasters is still relatively small—89 as I write this—but the ones who have weighed in on this topic so far have put it in what I see as a middle tier of concern, where the risk is seen as substantial but not imminent or inevitable. Since January, the pool’s estimated probability of an onset of state-led mass killing in Iraq in 2014 has hovered around 20 percent, alongside countries like Pakistan (23 percent), Bangladesh (20 percent), and Burundi (19 percent) but well behind South Sudan (above 80 percent since December) and Myanmar (43 percent for the risk of a mass killing targeting the Rohingya in particular).

Notably, though, the estimate for Iraq has ticked up a few notches in the past few days to 27 percent as forecasters (including me) have read and discussed some of the pre-election reports mentioned here. I think we are on to something that deserves more scrutiny than it appears to be getting.

Follow

Get every new post delivered to your Inbox.

Join 6,426 other followers

%d bloggers like this: