A New Statistical Approach to Assessing Risks of State-Led Mass Killing

Which countries around the world are currently at greatest risk of an onset of state-led mass killing? At the start of the year, I posted results from a wiki survey that asked this question. Now, here in heat-map form are the latest results from a rejiggered statistical process with the same target. You can find a dot plot of these data at the bottom of the post, and the data and code used to generate them are on GitHub.

Estimated Risk of New Episode of State-Led Mass Killing

These assessments represent the unweighted average of probabilistic forecasts from three separate models trained on country-year data covering the period 1960-2011. In all three models, the outcome of interest is the onset of an episode of state-led mass killing, defined as any episode in which the deliberate actions of state agents or other organizations kill at least 1,000 noncombatant civilians from a discrete group. The three models are:

  • PITF/Harff. A logistic regression model approximating the structural model of genocide/politicide risk developed by Barbara Harff for the Political Instability Task Force (PITF). In its published form, the Harff model only applies to countries already experiencing civil war or adverse regime change and produces a single estimate of the risk of a genocide or politicide occurring at some time during that crisis. To build a version of the model that was more dynamic, I constructed an approximation of the PITF’s global model for forecasting political instability and use the natural log of the predicted probabilities it produces as an additional input to the Harff model. This approach mimics the one used by Harff and Ted Gurr in their ongoing application of the genocide/politicide model for risk assessment (see here).
  • Elite Threat. A logistic regression model that uses the natural log of predicted probabilities from two other logistic regression models—one of civil-war onset, the other of coup attempts—as its only inputs. This model is meant to represent the argument put forth by Matt Krain, Ben Valentino, and others that states usually engage in mass killing in response to threats to ruling elites’ hold on power.
  • Random Forest. A machine-learning technique (see here) applied to all of the variables used in the two previous models, plus a few others of possible relevance, using the ‘randomforest‘ package in R. A couple of parameters were tuned on the basis of a gridded comparison of forecast accuracy in 10-fold cross-validation.

The Random Forest proved to be the most accurate of the three models in stratified 10-fold cross-validation. The chart below is a kernel density plot of the areas under the ROC curve for the out-of-sample estimates from that cross-validation drill. As the chart shows, the average AUC for the Random Forest was in the low 0.80s, compared with the high 0.70s for the PITF/Harff and Elite Threat models. As expected, the average of the forecasts from all three performed even better than the best single model, albeit not by much. These out-of-sample accuracy rates aren’t mind blowing, but they aren’t bad either, and they are as good or better than many of the ones I’ve seen from similar efforts to anticipate the onset of rare political crises in countries worldwide.


Distribution of Out-of-Sample AUC Scores by Model in 10-Fold Cross-Validation

The decision to use an unweighted average for the combined forecast might seem simplistic, but it’s actually a principled choice in this instance. When examples of the event of interest are hard to come by and we have reason to believe that the process generating those events may be changing over time, sticking with an unweighted average is a reasonable hedge against risks of over-fitting the ensemble to the idiosyncrasies of the test set used to tune it. For a longer discussion of this point, see pp. 7-8 in the last paper I wrote on this work and the paper by Andreas Graefe referenced therein.

Any close readers of my previous work on this topic over the past couple of years (see here and here) will notice that one model has been dropped from the last version of this ensemble, namely, the one proposed by Michael Colaresi and Sabine Carey in their 2008 article, “To Kill or To Protect” (here). As I was reworking my scripts to make regular updating easier (more on that below), I paid closer attention than I had before to the fact that the Colaresi and Carey model requires a measure of the size of state security forces that is missing for many country-years. In previous iterations, I had worked around that problem by using a categorical version of this variable that treated missingness as a separate category, but this time I noticed that there were fewer than 20 mass-killing onsets in country-years for which I had a valid observation of security-force size. With so few examples, we’re not going to get reliable estimates of any pattern connecting the two. As it happened, this model—which, to be fair to its authors, was not designed to be used as a forecasting device—was also by far the least accurate of the lot in 10-fold cross-validation. Putting two and two together, I decided to consign this one to the scrap heap for now. I still believe that measures of military forces could help us assess risks of mass killing, but we’re going to need more and better data to incorporate that idea into our multimodel ensemble.

The bigger and in some ways more novel change from previous iterations of this work concerns the unorthodox approach I’m now using to make the risk assessments as current as possible. All of the models used to generate these assessments were trained on country-year data, because that’s the only form in which most of the requisite data is produced. To mimic the eventual forecasting process, the inputs to those models are all lagged one year at the model-estimation stage—so, for example, data on risk factors from 1985 are compared with outcomes in 1986, 1986 inputs to 1987 outcomes, and so on.

If we stick rigidly to that structure at the forecasting stage, then I need data from 2013 to produce 2014 forecasts. Unfortunately, many of the sources for the measures used in these models won’t publish their 2013 data for at least a few more months. Faced with this problem, I could do something like what I aim to do with the coup forecasts I’ll be producing in the next few days—that is, only use data from sources that quickly and reliably update soon after the start of each year. Unfortunately again, though, the only way to do that would be to omit many of the variables most specific to the risk of mass atrocities—things like the occurrence of violent civil conflict or the political salience of elite ethnicity.

So now I’m trying something different. Instead of waiting until every last input has been updated for the previous year and they all neatly align in my rectangular data set, I am simply applying my algorithms to the most recent available observation of each input. It took some trial and error to write, but I now have an R script that automates this process at the country level by pulling the time series for each variable, omitting the missing values, reversing the series order, snipping off the observation at the start of that string, collecting those snippets in a new vector, and running that vector through the previously estimated model objects to get a forecast (see the section of this starting at line 284).

One implicit goal of this approach is to make it easier to jump to batch processing, where the forecasting engine routinely and automatically pings the data sources online and updates whenever any of the requisite inputs has changed. So, for example, when in a few months the vaunted Polity IV Project releases its 2013 update, my forecasting contraption would catch and ingest the new version and the forecasts would change accordingly. I now have scripts that can do the statistical part but am going to be leaning on other folks to automate the wider routine as part of the early-warning system I’m helping build for the U.S. Holocaust Memorial Museum’s Center for the Prevention of Genocide.

The big upside of this opportunistic approach to updating is that the risk assessments are always as current as possible, conditional on the limitations of the available data. The way I figure, when you don’t have information that’s as fresh as you’d like, use the freshest information you’ve got.

The downside of this approach is that it’s not clear exactly what the outputs from that process represent. Technically, a forecast is a probabilistic statement about the likelihood of a specific event during a specific time period. The outputs from this process are still probabilistic statements about the likelihood of a specific event, but they are no longer anchored to a specific time period. The probabilities mapped at the top of this post mostly use data from 2012, but the inputs for some variables for some cases are a little older, while the inputs for some of the dynamic variables (e.g., GDP growth rates and coup attempts) are essentially current. So are those outputs forecasts for 2013, or for 2014, or something else?

For now, I’m going with “something else” and am thinking of the outputs from this machinery as the most up-to-date statistical risk assessments I can produce, but not forecasts as such. That description will probably sound like fudging to most statisticians, but it’s meant to be an honest reflection of both the strengths and limitations of the underlying approach.

Any gear heads who’ve read this far, I’d really appreciate hearing your thoughts on this strategy and any ideas you might have on other ways to resolve this conundrum, or any other aspect of this forecasting process. As noted at the top, the data and code used to produce these estimates are posted online. This work is part of a soon-to-launch, public early-warning system, so we hope and expect that they will have some effect on policy and advocacy planning processes. Given that aim, it behooves us to do whatever we can to make them as accurate as possible, so I would very much welcome any suggestions on how to do or describe this better.

Finally and as promised, here is a dot plot of the estimates mapped above. Countries are shown in descending order by estimated risk. The gray dots mark the forecasts from the three component models, and the red dot marks the unweighted average.


PS. In preparation for a presentation on this work at an upcoming workshop, I made a new map of the current assessments that works better, I think, than the one at the top of this post. Instead of coloring by quintiles, this new version (below) groups cases into several bins that roughly represent doublings of risk: less than 1%, 1-2%, 2-4%, 4-8%, and 8-16%. This version more accurately shows that the vast majority of countries are at extremely low risk and more clearly shows variations in risk among the ones that are not.

Estimated Risk of New State-Led Mass Killing

Estimated Risk of New State-Led Mass Killing

Leave a comment


  1. qubriq

     /  January 22, 2014

    I’m not sure your forecasting (or, fine, let’s call it risk assessment) process is going to be really ‘the most accurate’ if you use the most recent data from different years. My suggestion here would be to really try to forecast you 2013 missing regressors, and in so doing to use multiple times one- or multi-step ahead forecasts. So, first, forecast missing regressors; second, use those forecasts in your 2014 forecast. It can’t get messier than it is now, and now you could even incorporate the uncertainty in your regressors’ forecasts IN the model (multiple imputations might be a solution here). Uncertainty IS risk, especially when dealing with rare events!
    What do you think of it?

    • I’m really glad you suggested forecasting the regressors, because it helped crystallize a partially formed thought of mine. The way I see it, my opportunistic approach represents a special case of the approach you suggest. In fact, I’d wager that it produces the most likely case under that approach, and that’s a big part of why I think it’s reasonable.

      The regressors in my models fall into three basic types: 1) slow-changing structural features (e.g., population size, infant mortality rate, trade openness); 2) structural features that change rarely but abruptly, in step-function fashion (e.g., political regime type, recent coup activity); and 3) dynamic and continuous variables (e.g., economic growth rate). Of those three types, there’s not much to be gained from forecasting the first, because the changes from year to year are so small. The third can benefit greatly from forecasting, but that’s where I’m already using fresher data (and, in the case of economic growth, a forecast courtesy of the IMF).

      So, if forecasting the regressors is going to produce something very different from my opportunistic approach, it’s going to have to come from the forecasts of those categorical variables that change rarely but abruptly. Now, I’d like to think I could build predictive models that would reliably anticipate those jumps in things like political regime type a year or two in advance, but I’m not optimistic that I can, at least not in an absolute sense. In my experience, any models I could build of those transitions might do a decent job of accurately sorting cases by *relative* risk, but the rarity of those transitions would ensure that the absolute values of those probabilities would generally be quite low. So, the outcome that would occupy almost all of the probability space–i.e., the tall peak of each distribution–would be the continuation of the status quo. And that continuation of the status quo is exactly the special case that my opportunistic updating strategy represents.

      As with so many things, this is of course an empirical question, and it would be really interesting to try to develop some of those predictive models and see how much variation they generate. Time permitting, I’d like to try that this year and see if that’s a promising avenue for teasing out additional nuances in the forecasts. In the meantime, though, I think the opportunistic approach looks more reasonable when we recognize it as a special case representing the tall peaks of the forecast distributions that systems-of-equations approach would produce.

      • Jay, first of all thanks for your reply, I’m very glad we are having this conversation! Let me begin by stressing that I think your approach is great and that I don’t think my contribution would substantially change much. I simply thought of it as a refinement: whether it turns out to be useful or not, it seems to me to depend on many factors (mostly cost-effectiveness, forecast efficiency and your own subjective preferences).
        On its merits, my suggestion depends a lot on the type of missing variables that you are left with for your last *observed* year or the last couple of years, and since I didn’t know exactly which ones enter your model and which don’t (although I think had a good picture, since you’ve been quite explicit on which models you relied upon in order to build yours), the suggestion had to be a little too general than I had preferred.
        Anyway, some comments.

        First of all, at the cost of being didactic: whenever you use your opportunistic strategy, you are already forecasting regressors; you are simply using the naïve forecast approach, which the forecasting literature uses as one of the benchmark to any other forecasting method. In your case, you also decided for its deterministic strand (i.e., you don’t calculate a prediction interval for your naïve forecasts).
        For slow-changing variables, one-step-ahead forecasting is so close to the naïve forecast, and the prediction interval so small, that it is true: the naïve deterministic forecast is a nearly-perfect approximation.
        In a special case of your “type 3″ variables, i.e. GDP growth, the IMF is already forecasting macroeconomic variables through a very complex, quite beautiful and feedback-y country and regional expert surveys model, so I think your choice is great. Too bad they don’t also provide you with the uncertainty interval around their forecasts, or you could use that too.
        In both “type 1″ and “type 3″ cases (more in the latter than in the former), however, using multiple imputations would increase a little the uncertainty around your estimates, but at the same time could help improve your point forecasts.

        [I also have some ideas on using the distribution of missing-not-at-random data directly as a proxy for instability (see e.g. the recent cases of South Sudan and Iraq data from the World Bank database), instead of imputing and creating n-complete datasets, but that's a stretch, and it doesn't enter our discussion here]

        As regards your “type 2″ variables, I must say that I agree with your reliance on naïve forecasts only up to a point. Leaving aside uncertainty for a moment and focussing on point forecasts, the first question that comes to mind is: how sensitive are your forecasts to changes to those variables?
        If the answer to that question is “quite a lot”, which I imagine is the case for both coup attempts and abrupt shifts in regime type, my suggestion is to split the problem in two:
        a) as you say, one cannot really use relative risk to forecast transitions –> Naïve forecasts must be relied upon when we are looking at the future. But if I am not mistaken you are only using lagged variables in your model, so if you don’t have the need to use contemporaneous values, then there’s really no need to forecast future, non-observable values;
        b) when you have missing observations for _your past_ (the case in which you still do not have the updated dataset at hand – be it FH political rights scores or Polity IV or any other dataset), ALWAYS try and update it manually with your own observations! I am working on comparative politics projects myself, so we can be quite straightforward: assigning some granular score to observable transitions AND producing dummies for coup attempts and other rate international events is both subjective and, at the same time, not that difficult. My suggestion here would be to daily/monthly update your model with values that mirror your observation of the historical reality, to revise them month by month, and finally use the updated dataset.
        Going back to uncertainty, you could even assign subjective (bayesian) confidence intervals to your estimate, and again use multiple imputation methods to compute forecast intervals.
        One byproduct of all this is that, once the updated dataset comes out, you can even calculate your own accuracy in ‘mirroring’ the observation-and-coding process, and improve your own (subjective) forecasting year-on-year!

        PS: a final observation about your 10-fold cross-validation approach. If you really think, as you state, that “the process generating [mass killing] events may be changing over time”, I think that pure-and-simple n-folding is not the right approach to follow. Any exponential smoothing model, even one that doesn’t produce the best AIC/BIC/etc., could be preferable, in order to avoid the risk that your model ‘wins’ only because it better fit distant events than recent ones.
        Another way to do it in an n-fold context would be to assign exponentially-increasing weights to your folds the nearer they are to the present, so that nearer test folds are assigned exponentially more importance. Ideally, you could even want to use more folds (let’s say, double them and make it a 20) and again try to assign exponentially-increasing weights the nearer to the present they are. Obviously, the more folds there are, the more you risk overfitting, so you need to strike a balance between the two things.

      • First off, let me say how much I appreciate the careful thought and time you’re giving to commenting on this work. This conversation is really helpful, and ones like it don’t happen often.

        On the substance, I don’t disagree with anything you’re saying or suggesting. It’s really a matter of thinking about how to allocate my time on this project. I would really like to try building some models to forecast some of those “type 2″ regressors, propogating forward the uncertainty in their estimates, and seeing how sensitive the forecasts, er, risk assessments are to the results. Hopefully, I’ll get a chance to do that later this year, and if I do, you’ll definitely hear about it here.

        On hand-coding as an alternative, that’s actually what I did the past couple of years and am now trying to get away from. For some variables, it’s not so hard, but for others it can actually take quite a lot of research, and some of the coding decisions still feel highly uncertain. That’s troubling when the results can have a big impact on the resulting forecasts. Given those facts and our program’s limited resources, I decided to get out of that business this year, which is precisely what led to this opportunistic updating strategy. In an ideal world, I would have a pool of experts who would give me their best guesses on all the relevant inputs at the start of the year, and I could summarize those guesses in a distribution that could be used to generate a distribution of forecasts. Alas, we’re not there (yet).

        Oh, and on cross-validation: my process isn’t so sensitive to the change-over-time issue because I’m not using CV for model or variable selection. Two of the three models are adaptations of previous published work on related topics, and the third (Random Forest) just applies machine learning to the variables from the other two. I’m really just using CV to confirm that the resulting forecasting process should work reasonably well and to get a rough estimate of just how accurate we might expect it to be.

      • Thank you again, Jay, for your follow up!
        I also really enjoyed this exchange, and your replies are leaving me almost completely satisfied. For example, as the conversation carried on I started to forget that you weren’t actually selecting the best model through cross-validation, so you did great in bringing me back on topic.
        As for one bit of our discussion in which I am still not convinced, I still think that handcoding (really meaning: subjectively estimating) should be preferred in your ‘type 2′ cases. As I mentioned earlier, you could assign some uncertainty around your own point estimate (as subjective as they both are), and you could also justify them by referring to different sources. This is most important in the case of dummy variables (presence of coup attempts, civil wars, etc.) and for abrupt transitions (which will be almost equivalent to ordinal jumps even on a supposedly-continuous scale such as Polity IV).
        I mean, what do you gain carrying over to the next year the absence of a coup of the previous year, or even worse, to carry over to the next year the presence of a rare event such as a coup the previous year? The naïve forecast will certainly work well for every country whose regime hasn’t changed or that hasn’t been affected by a coup in both years (the observed one, and the previous one), but what about the others?
        So I think that my proposal in a nutshell could actually be:
        (a) for regime type: DO use naïve forecasts for most countries in the world, but DO try to handcode whenever you believe that something has changed year-on-year;
        (b) for coup attempts (or any other rare-event dummy): the default should always be zeroes, and you should only assign 1 to actual coups, or some non-integer point estimate. In this way you should always get a more accurate forecast, don’t you think?

        Now, as I write, I feel like there is still a grey area in this conversation, so here’s a question you did not answer before: do your models only employ lagged variables, or do they include contemporaneous regressors? Because what you are saying about asking the experts and extrapolating a distribution of beliefs seems to me to be absolutely necessary only if you had to *forecast* future (e.g., 2014) regime transitions or coup occurrences. But you could do everything by yourself if you only had to *estimate* already-occurred (e.g., 2013) coups and transitions. In the latter case, assigning estimates and uncertainties would work much better – and in many cases a coup is highly evident and you shouldn’t be so uncertain as to whether it happened or not, while in others you could assign non-integer estimates as I was saying before.
        Or am I missing something?

      • A couple of variables in the ensemble (country age and post-Cold War indicator) are deterministic. Everything else is lagged one year. So in principle it’s all hand-codable. I should also clarify that coup attempts is something for which I do have up-to-date data, thanks to Jonathan Powell and Clayton Thyne.

        To get more specific, because the details matter here, the things that fall into that “type 2″ category for which updating by the original source doesn’t happen for at least a few months are:

        * Political regime type (a categorical treatment of the Polity scale)
        * Political regime duration (a count of years linked to changes in the Polity scale)
        * Autocracy (a binary categorization also based on Polity)
        * Armed civil conflict (a scalar measure of intensity from the Center for Systemic Peace)
        * Armed conflict in the geographic region (a concatenation of a broader scalar measure from CSP)
        * Population share subjected to state-led discrimination
        * Political salience of elite ethnicity (binary)
        * Exclusionary elite ideology (binary)
        * Cumulative “upheaval” (a 10-year prior moving summary of political instability intensity scores from the Political Instability Task Force)

        As you can see, a) the list is long and b) some of those are not at all straightforward to code by hand. Actually, the Polity categorizations are probably the easiest of the lot. Many of the others summarize a lot of careful research that people conduct year-round, and I’m just not confident that my off-the-cuff guesses won’t introduce even more noise into the estimates than a naive extension of the status quo does.

  2. Fascinating work. How would one convert the AUC scores to some more direct measure of predictive accuracy? I realize there will be a tradeoff somewhere between false negatives a and false positive, but can you give an example pair? Even better, I’d like to see these curves because I’m having a hard time going from AUC to accuracy.

    • Jonathan, if I could paste a plot of the ROC curves in this comment, I would. I can’t, though, so I will just tell you that in out-of-sample estimates from 10-fold CV, the sensitivity and specificity at the decision threshold that equalizes those two rates is just under 80 percent for the unweighted average. The initial part of the curve is pretty steep, so you get to about 60 percent sensitivity with specificity of about 85 percent. Of course, these events are super-rare, so equal error rates with a binary classification rule would still mean a lot of false positives for every true positive—something like 10 or 12 to 1 each year. That said, the true positives of future years usually come from the false positives of the present.

      • I want to make sure I understand the interpretation, so I’m going to imagine using this model for decision making. I think you’re saying that If we need to predict mass killing events at least 60% of the time (miss no more than 40%) then we must set the threshold such that we should expect to get 10 times more false positives than true positives. Is that correct?

        Seems pretty dismal. Not that other conflict prediction models have done better, AFAIK. Rare events make everything difficult.

        Do you have any idea how this model does versus human predictions? How about versus the naive model of “predict same as last year of data for each country”? I suspect the latter is the appropriate baseline for any sophisticated model, as suggested by http://jayyonamine.com/wp-content/uploads/2013/03/Forecasting_Afghanistan.pdf

      • Your interpretation is numerically correct, but I think the sharp distinction you’re drawing between true and false positives is a little misleading. As I said, the events further in the future usually come from countries that get high scores now, so they aren’t false positives in any crisp sense (e.g., someone who doesn’t have HIV but tests positive for it). So in that sense I think the imprecision is less dismal than you infer.

        As for comparisons to other baselines:

        * Because we’re modeling onsets only and those onsets are so rare, the appropriate naive model is “nothing happens.” This produces a fantastic Brier score and a wonderful true-negative rate but, obviously, never warns on any of these low-probability, high-impact events. That latter problem is the motivation to produce and publish these scores, even if there not as precise as we’d like.

        * I have looked hard for them but cannot find any routine, global, probabilistic judgment-based forecasts to which we could compare our statistical results, past or present. We are now pursuing a couple of avenues for producing them in the future, namely, 1) an opinion pool that will generate real-time forecasts throughout the year on selected cases and 2) wiki surveys we run before the start of the year. For an example of the latter, see this post from a few weeks ago.

  3. How Bangladesh is more ripe for state-led mass killings than Sri Lanka and North Korea is beyond me…#TheDangersOfQuantifyingEverything #MarkTwainWasRight

    • It should be noted that North Korea already has an ongoing mass killing against political “enemies,” according to our data. In a case like that, this assessment addresses the risk that another episode of mass killing targeting a different group would start alongside that one. The models aren’t designed to tell us anything about the trajectory of the episode that’s already occurring.

    • Also, a quick word on Sri Lanka: Ongoing armed conflict and coup threats are two huge risk factors for state-led mass killing. Sri Lanka had a civil war that culminated in a mass killing, but that war is over, and the government has successfully (and ruthlessly) consolidated its authority since then. As a result, Sri Lanka now looks like a low-risk case. By contrast, Bangladesh has ongoing factional violence, and as Human Rights Watch has documented, state security forces have demonstrated their willingness to use brutal force against protesters in the past year. So I think the model actually gets these cases about right, at least in relative terms.

  1. A New Statistical Approach to Assessing Risks o...
  2. Rights Groups Pressure Myanmar to Investigate Killing of 40 Rohingya Muslims | Humanosphere
  3. Calculating Coups: Can Data Stop Disasters? « Afronline – The Voice Of Africa
  4. Africa: Calculating Coups – Can Data Stop Disasters? | AfricaHot
  5. Calculating Coups: Can Data Stop Disasters? | The Diary of my Insomnia
  6. Spreadsheets and Global Mayhem : Alliance for Peacebuilding
  7. Spreadsheets and Global Mayhem; Researchers in several countries are designing mathematical models to predict atrocities and war | Bamboo Innovator
  8. How Circumspect Should Quantitative Forecasters Be? | Dart-Throwing Chimp
  9. Do Quantitative Forecasters Have Special Obligations to Policy Advisees? | The Couch Psychologist
  10. The Rwanda Enigma | Dart-Throwing Chimp

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Get every new post delivered to your Inbox.

Join 5,747 other followers

%d bloggers like this: