How Circumspect Should Quantitative Forecasters Be?

Yesterday, I participated in a panel discussion on the use of technology to prevent and document mass atrocities as part of an event at American University’s Washington College of Law to commemorate the Rwandan genocide.* In my prepared remarks, I talked about the atrocities early-warning system I’m helping build for the U.S. Holocaust Memorial Museum’s Center for the Prevention of Genocide. The chief outputs of that system are probabilistic forecasts, some from statistical models and others from a “wisdom of (expert) crowds” system called an opinion pool.

After I’d described that project, one of the other panelists, Patrick Ball, executive director of Human Rights Data Analysis Group, had this to say via Google Hangout:

As someone who uses machine learning to build statistical models—that’s what I do all day long, that’s my job—I’m very skeptical that models about conflict, about highly rare events that have very complicated and situation-unique antecedents are forecastable. I worry about early warning because when we build models we listen to people less. I know that, from my work with the U.N., when we have a room full of people who know an awful lot about what’s going on on the ground, a graph—when someone puts a graph on the table, everybody stops thinking. They just look at the graph. And that worries me a lot.

In 1994, human-rights experts warned the world about what was happening [in Rwanda]. No one listened. So as we, as technologists and people who like technology, when we ask questions of data, we have to make sure that if anybody is going to listen to us, we’d better be giving them the right answers.

Maybe I was being vain, but I heard that part of Patrick’s remarks as a rebuke of our early-warning project and pretty much every other algorithm-driven atrocities and conflict forecasting endeavor out there. I responded by acknowledging that our forecasts are far from perfect, but I also asserted that we have reason to believe they will usually be at least marginally better than the status quo, so they’re worth doing and sharing anyway.

A few minutes later, Patrick came back with this:

When we build technology for human rights, I think we need to be somewhat thoughtful about how our less technical colleagues are going to hear the things that we say. In a lot of meetings over a lot of years, I’ve listened to very sophisticated, thoughtful legal, qualitative, ethnographic arguments about very specific events occurring on the ground. But almost inevitably, when someone proposes some kind of quantitative analysis, all that thoughtful reasoning escapes the room… The practical effect of introducing any kind of quantitative argument is that it displaces the other arguments that are on the table. And we are naive to think otherwise.

What that means is that the stakes for getting these kinds of claims right are very high. If we make quantitative claims and we’re wrong—because our sampling foundations are weak, because our model is inappropriate, because we misinterpreted the error around our claim, or for any other reason—we can do a lot of harm.

From that combination of uncertainty and the possibility for harm, Patrick concludes that quantitative forecasters have a special responsibility to be circumspect in the presentation of their work:

I propose that one of the foundations of any kind of quantitative claims-making is that we need to have very strict validation before we propose a conclusion to be used by our broader community. There are all kinds of rules about validation in model-building. We know a lot about it. We have a lot of contexts in which we have ground truth. We have a lot of historical detail. Some of that historical detail is itself beset by these sampling problems, but we have opportunities to do validation. And I think that any argument, any claim that we make—especially to non-technical audiences—should lead with that validation rather than leaving it to the technical detail. By avoiding discussing the technical problems in front of non-technical audiences, we’re hiding stuff that might not be working. So I warn us all to be much stricter.

Patrick has applied statistical methods to human-rights matters for a long time, and his combined understanding of the statistics and the advocacy issues is as good if not better than almost anyone else’s. Still, what he described about how people respond to quantitative arguments is pretty much the exact opposite of my experience over 15 years of working on statistical forecasts of various forms of political violence and change. Many of the audiences to which I’ve presented that work have been deeply skeptical of efforts to forecast political behavior. Like Patrick, many listeners have asserted that politics is fundamentally unquantifiable and unpredictable. Statistical forecasts in particular are often derided for connoting a level of precision that’s impossible to achieve and for being too far removed from the messy reality of specific places to produce useful information. Even in cases where we can demonstrate that the models are pretty good at distinguishing high-risk cases from low-risk ones, that evidence usually fails to persuade many listeners, who appear to reject the work on principle.

I hear loud echoes of my experiences in Daniel Kahneman’s discussion of clinical psychologists’ hostility to algorithms and enduring prejudice in favor of clinical judgment, even in situations where the former is demonstrably superior to the latter. On pp. 228 of Thinking, Fast and Slow, Kahneman observes that this prejudice “is an attitude we can all recognize.”

When a human competes with a machine, whether it is John Henry a-hammerin’ on the mountain or the chess genius Garry Kasparov facing off against the computer Deep Blue, our sympathies lie with our fellow human. The aversion to algorithms making decisions that affect humans is rooted in the strong preference that many people have for the natural over the synthetic or artificial.

Kahneman further reports that

The prejudice against algorithms is magnified when the decisions are consequential. [Psychologist Paul] Meehl remarked, ‘I do not quite know how to alleviate the horror some clinicians seem to experience when they envisage a treatable case being denied treatment because a ‘blind, mechanical’ equation misclassifies him.’ In contrast, Meehl and other proponents of algorithms have argued strongly that it is unethical to rely on intuitive judgments for important decisions if an algorithm is available that will make fewer mistakes. Their rational argument is compelling, but it runs against a stubborn psychological reality: for most people, the cause of a mistake matters. The story of a child dying because an algorithm made a mistake is more poignant than the story of the same tragedy occurring as a result of human error, and the difference in emotional intensity is readily translated into a moral preference.

If our distaste for algorithms is more emotional than rational, then why do forecasters who use them have a special obligation, as Patrick asserts, to lead presentations of their work with a discussion of the “technical problems” when experts offering intuitive judgments almost never do? I’m uncomfortable with that requirement, because I think it unfairly handicaps algorithmic forecasts in what is, frankly, a competition for attention against approaches that are often demonstrably less reliable but also have real-world consequences. This isn’t a choice between action or inaction; it’s a trolley problem. Plenty of harm is already happening on the current track, and better forecasts could help reduce that harm. Under these circumstances, I think we behave ethically when we encourage the use of our forecasts in honest but persuasive ways.

If we could choose between forecasting and not forecasting, then I would be happier to set a high bar for predictive claims-making and let the validation to which Patrick alluded determine whether or not we’re going to try forecasting at all. Unfortunately, that’s not the world we inhabit. Instead, we live in a world in which governments and other organizations are constantly making plans, and those plans incorporate beliefs about future states of the world.

Conventionally, those beliefs are heavily influenced by the judgments of a small number of experts elicited in unstructured ways. That approach probably works fine in some fields, but geopolitics is not one of them. In this arena, statistical models and carefully designed procedures for eliciting and combining expert judgments will also produce forecasts that are uncertain and imperfect, but those algorithm-driven forecasts will usually be more accurate than the conventional approach of querying one or a few experts and blending their views in our heads (see here and here for some relevant evidence).

We also know that most of those subject-matter experts don’t abide by the rules Patrick proposes for quantitative forecasters. Anyone who’s ever watched cable news or read an op-ed—or, for that matter, attended a panel discussion—knows that experts often convey their judgments with little or no discussion of their cognitive biases and sources of uncertainty.

As it happens, that confidence is persuasive. As Kahneman writes (p. 263),

Experts who acknowledge the full extent of their ignorance may expect to be replaced by more confident competitors who are better able to gain the trust of clients. An unbiased appreciation of uncertainty is a cornerstone of rationality—but it is not what people and organizations want. Extreme uncertainty is paralyzing under dangerous circumstances, and the admission that one is merely guessing is especially unacceptable when the stakes are high. Acting on pretended knowledge is often the preferred solution.

The allure of confidence is dysfunctional in many analytic contexts, but it’s also not something we can wish away. And if confidence often trumps content, then I think we do our work and our audiences a disservice when we hem and haw about the validity of our forecasts as long as the other guys don’t. Instead, I believe we are behaving ethically when we present imperfect but carefully derived forecasts in a confident manner. We should be transparent about the limitations of the data and methods, and we should assess the accuracy of our forecasts and share what we learn. Until we all agree to play by the same rules, though, I don’t think quantitative forecasters have a special obligation to lead with the limitations of their work, thus conceding a persuasive advantage to intuitive forecasters who will fill that space and whose prognostications we can expect to be less reliable than ours.

* You can replay a webcast of that event here. Our panel runs from 1:00:00 to 2:47:00.

Demography, Democracy, and Complexity

Five years ago, demographer Richard Cincotta claimed in a piece for Foreign Policy that a country’s age structure is a powerful predictor of its prospects for attempting and sustaining liberal democracy. “A country’s chances for meaningful democracy increase,” he wrote, “as its population ages.” Applying that superficially simple hypothesis to the data at hand, he ventured a forecast:

The first (and perhaps most surprising) region that promises a shift to liberal democracy is a cluster along Africa’s Mediterranean coast: Morocco, Algeria, Tunisia, Libya, and Egypt, none of which has experienced democracy in the recent past. The other area is in South America: Ecuador, Colombia, and Venezuela, each of which attained liberal democracy demographically “early” but was unable to sustain it. Interpreting these forecasts conservatively, we can expect there will be one, maybe two, in each group that will become stable democracies by 2020.

I read that article when it was published, and I recall being irritated by it. At the time, I had been studying democratization for more than 15 years and was building statistical models to forecast transitions to and from democracy as part of my paying job. Seen through those goggles, Cincotta’s construct struck me as simplistic to the point of naiveté. Democratization is a hard theoretical problem. States have arrived at and departed from democracy by many different pathways, so how could what amounts to a one-variable model possibly have anything useful to say about it?

Revisiting Cincotta’s work in 2014, I like it a lot more for a couple of reasons. First, I like the work better now because I have come to see it as an elegant representation of a larger idea. As Cincotta argues in that Foreign Policy article and another piece he published around the same time, demographic structure is one component of a much broader and more complex syndrome in which demography is both effect and cause. Changes in fertility rates, and through them age structure, are strongly shaped by other social changes like education and urbanization, which are correlated with, but hardly determined by, increases in national wealth.

Of course, that syndrome is what we conventionally call “development,” and the pattern Cincotta observes has a strong affinity with modernization theory. Cincotta’s innovation was to move the focus away from wealth, which has turned out to be unreliable as a driver and thus as a proxy for development in a larger sense, to demographic structure, which is arguably a more sensitive indicator of it. As I see it now, what we now call development is part of a “state shift” occurring in human society at the global level that drives and is reinforced by long-term trends in democratization and violent conflict. As in any complex system, though, the visible consequences of that state shift aren’t evenly distributed.

In this sense, Cincotta’s argument is similar to one I often find myself making about the value of using infant mortality rates instead of GDP per capita as a powerful summary measure in models of a country’s susceptibility to insurgency and civil war. The idea isn’t that dead children motivate people to attack their governments, although that may be one part of the story. Instead, the idea is that infant mortality usefully summarizes a number of other things that are all related to conflict risk. Among those things are the national wealth we can observe directly (if imperfectly) with GDP, but also the distribution of that wealth and the state’s will and ability to deliver basic social services to its citizens. Seen through this lens, higher-than-average infant mortality helps us identify states suffering from a broader syndrome that renders them especially susceptible to violent conflict.

Second, I have also come to appreciate more what Cincotta was and is doing because I respect his willingness to apply his model to generate and publish probabilistic forecasts in real time. In professional and practical terms, that’s not always easy for scholars to do, but doing it long enough to generate a real track record can yield valuable scientific dividends.

In this case, it doesn’t hurt that the predictions Cincotta made six years ago are looking pretty good right now, especially in contrast to the conventional wisdom of the late 2000s on the prospects for democratization in North Africa. None of the five states he lists there yet qualifies as a liberal democracy on his terms, a “free” designation from Freedom House). Still, it’s only 2014, one of them (Tunisia) has moved considerably in that direction, and two others (Egypt and Libya) have seen seemingly frozen political regimes crumble and substantial attempts at democratization ensue. Meanwhile, the long-dominant paradigm in comparative democratization would have left us watching for splits among ruling elites that really only happened in those places as their regimes collapsed, and many area experts were telling us in 2008 to expect more of the same in North Africa as far as the mind could see. Not bad for a “one-variable model.”

On Prediction

It is surprisingly hard to find cogent statements about why prediction is important for developing theory in political science. This is primarily a cultural artifact, I think—political science has mostly eschewed prediction for decades, and many colleagues and reviewers remain openly hostile to it, so it’s not something we spend much time writing and talking about—but there’s a hard philosophy-of-science question lurking there, too. It’s the one Karl Popper implies when he argues in “Prediction and Prophecies in the Social Sciences” that:

Long-term prophecies can be derived from scientific conditional predictions only if they apply to systems which can be described as well-isolated, stationary, and recurrent. These systems are very rare in nature; and modern society is not one of them.

In other words, the causal processes social scientists aim to discover could be moving targets, so the theories we develop through our research will typically be bounded and contingent. If we’re not sure a priori how general we expect our theories to be, then how much predictive power should we expect them to have? To which cases is a theory meant to apply, and therefore to predict? Without an answer to that question, we can’t confidently judge how informative our predictive accuracy or errors are, and we can only answer that question with another layer of theory.

Statistical modeling suggests a practical rationale to prefer out-of-sample to in-sample “prediction” as means of validation, namely, overfitting. That’s what we call it when we build models that chase after the idiosyncrasies in the data on which they’re trained and lose sight along the way of the wider regularities we seek to uncover. As a technical matter, though, the problem of overfitting is specific to the method, and as philosophical point it’s another handwave. In particular, it doesn’t deal with the possibility that a theory works wonderfully for the sample from which it’s derived but poorly elsewhere. That proposition must sound bizarre to biologists or physicists, but as Popper and other would argue, it’s not a crazy idea in political science. Do we really think that the causes of war between states are the same now as they were 100 years ago? Of democratization? Human physiology may not evolve that fast, but human society arguably does, at least recently.

After all that hand-wringing, I land on a pragmatic rationale for emphasizing prediction as a means of validation: it works better than the alternatives. Science is a method for deepening our understanding of the world we inhabit. Deepening implies directional movement. For science to move, we have to try to assess the validity of, and adjudicate between, different ideas. From psychology, we know that the sheer plausibility of a story—and at some level, that’s really what all social-science theories are, whichever “language” we use to represent them—is not a reliable guide to its truth. Just because something makes sense or feels right does not mean that it is, and our brains are terrible about (or excellent at) filtering evidence to confirm our expectations.

Under those circumstances, we need some other way to assess whether an explanation is plausible—or, when more than one explanation is available, to determine which version is more plausible than the others. We can try to do that with reference to the evidence from which the explanation was derived, but the result is a tautology: evidence A suggests theory B; we know theory B is correct because it implies evidence A, and A is what we observe.

The alternative that remains is prediction. To determine whether or not our mental models are getting at something “real,” we have to apply them to situations yet unseen and see if the regularities we thought we had uncovered hold. The results will rarely be binary, but they will still provide new information about the usefulness of the model. Crucially, those results can also be compared to ones from competing theories to help us determine which explanation covers more. That’s the engine of accumulation. Under certain conditions, that new information may even reveal something about the scope conditions of the initial theory. When a model routinely predicts well in some kinds of cases but not others, then we have uncovered a new pattern that we can add to the initial construct and then continue to test in the same way.

For reasons to which Popper alludes, I don’t believe these iterations will reveal laws that consistently explain and anticipate political behavior writ large. Still, we have seen this process produce great advances in other fields, so I prefer it to the alternative of not trying. And, absent some version of this process, the theories we construct about politics are epistemologically indistinguishable from fiction. Fiction can be satisfying to write and read and even be “true” in a fashion, but it is not science because, among other things, it does not aspire to accumulation.

I use the word “aspire” in that last sentence advisedly. Directionality does not necessarily imply eventual arrival, or even reliable navigation. I think it’s perfectly reasonable to establish the accumulation of knowledge at the basic point of the endeavor and still understand and experience science as neuroscientist Stuart Firestein describes it in his wonderful book, Ignorance. Firestein opens the book with a proverb—”It is very difficult to find a black cat in a dark room, especially when there is no cat”—and goes on to say that

[Science] is not facts and rules. It’s black cats in dark rooms. As the Princeton mathematician Andrew Wiles describes it: It’s groping and probing and poking, and some bumbling and bungling, and then a switch is discovered, often by accident, and the light is lit, and everyone says, “Oh, wow, so that’s how it looks,” and then it’s off into the next dark room, looking for the next mysterious black feline. If this all sounds depressing, perhaps some bleak Beckett-like scenario of existential endlessness, it’s not. In fact, it’s somehow exhilarating.

From my own experience, I’d say it can be both depressing and exhilarating, but the point still stands.

The Wisdom of Crowds, Oscars Edition

Forecasts derived from prediction markets did an excellent job predicting last night’s Academy Awards.

PredictWise uses odds from online bookmaker Betfair for its Oscars forecasts, and it nearly ran the table. PredictWise assigned the highest probability to the eventual winner in 21 of 24 awards, and its three “misses” came in less prominent categories (Best Documentary, Best Short, Best Animated Short). Even more telling, its calibration was excellent. The probability assigned to the eventual winner in each category averaged 87 percent, and most winners were correctly identified as nearly sure things.

Inkling Markets also did quite well. This public, play-money prediction market has a lot less liquidity than BetFair, but it still assigned the highest probability of winning to the eventual winner in 17 of the 18 categories is covered—it “missed” on Best Original Song—and for extra credit it correctly identified Gravity as the film that would probably win the most Oscars. Just by eyeballing, it’s clear that Inkling’s calibration wasn’t as good as PredictWise’s, but that’s what we’d expect from a market with a much smaller pool of participants. In any case, you still probably would have one your Oscars pool if you’d relied on it.

This is the umpteen-gajillionth reminder that crowds are powerful forecasters. “When our imperfect judgments are aggregated in the right way,” James Surowiecki wrote (p. xiv), “our collective intelligence is often excellent.” Or, as PredictWise’s David Rothschild said in his live blog last night,

This is another case of pundits and insiders advertising a close event when the proper aggregation of data said it was not. As I noted on Twitter earlier, my acceptance speech is short. I would like to thank prediction markets for efficiently aggregating dispersed and idiosyncratic data.

How’d Those Football Forecasts Turn Out?

Yes, it’s February, and yes, the Winter Olympics are on, but it’s a cold Sunday so I’ve got football on the brain. Here’s where that led today:

Last August, I used a crowdsourcing technique called a wiki survey to generate a set of preseason predictions on who would win Super Bowl 48 (see here). I did this fun project to get a better feel for how wiki surveys work so I could start applying them to more serious things, but I’m also a pro football fan who wanted to know what the season portended.

Now that Super Bowl 48′s in the books, I thought I would see how those forecasts fared. One way to do that is to take the question and results at face value and see if the crowd picked the right winner. The short answer is “no,” but it didn’t miss by a lot. The dot plot below shows teams in descending order by their final score on the preseason survey. My crowd picked New England to win, but Seattle was second by just a whisker, and the four teams that made the conference championship games occupied the top four slots.

nflpostmortem.dotplotSo the survey did great, right? Well, maybe not if you look a little further down the list. The Atlanta Falcons, who finished the season 4-12, ranked fifth in the wiki survey, and the Houston Texans—widely regarded as the worst team in the league this year—also landed in the top 10. Meanwhile, the 12-4 Carolina Panthers and 11-5 KC Chiefs got stuck in the basement. Poke around a bit more, and I’m sure you can find a few other chuckles.

Still, the results didn’t look crazy, and I was intrigued enough to want to push it further. To get a fuller picture of how well this survey worked as a forecasting tool, I decided to treat the results as power rankings and compare them across the board to postseason rankings. In other words, instead of treating this as a classification problem (find the Super Bowl winner), I thought I’d treat it as a calibration problem, where the latent variable I was trying to observe before and after is relative team strength.

That turned out to be surprisingly difficult—not because it’s hard to compare preseason and postseason scores, but because it’s hard to measure team strength, even after the season’s over. I asked Trey Causey and Sean J. Taylor, a couple of professional acquaintances who know football analytics, to point me toward an off-the-shelf “ground truth,” and neither one could. Lots of people publish ordered lists, but those lists don’t give us any information about the distance between rungs on the ladder, a critical piece of any calibration question. (Sean later produced and emailed me a set of postseason Bradley-Terry rankings that look excellent, but I’m going to leave the presentation of that work to him.)

About ready to give up on the task, it occurred to me that I could use the same instrument, a wiki survey, to convert those ordered lists into a set of scores that would meet my criteria. Instead of pinging the crowd, I would put myself in the shoes of those lists’ authors for a while, using their rankings to guide my answers to the pairwise comparisons the wiki survey requires. Basically, I would kluge my way to a set of rankings that amalgamated the postseason judgments of several supposed experts. The results would have the added advantage of being on the same scale as my preseason assessments, so the two series could be directly compared.

To get started, I Googled “nfl postseason power rankings” and found four lists that showed up high in the search results and had been updated since the Super Bowl (here, here, here, and here). Then I set up a wiki survey and started voting as List Author #1. My initial thought was to give each list 100 votes, but when I got to 100, the results of the survey in progress didn’t look as much like the original list as I’d expected. Things were a little better at 200 but still not terrific. In the end, I decided to give each survey 320 votes, or the equivalent of 10 votes for each item (team) on the list. When I got to 320 with List 1, the survey results were nearly identical to the original, so I declared victory and stuck with that strategy. That meant 1,280 votes in all, with equal weight for each of the four list-makers.

The plot below compares my preseason wiki survey’s ratings with the results of this Mechanical Turk-style amalgamation of postseason rankings. Teams in blue scored higher than the preseason survey anticipated (i.e., over-performed), while teams in red scored lower (i.e., under-performed).

nflpostmortemplot

Looking at the data this way, it’s even clearer that the preseason survey did well at the extremes and less well in the messy middle. The only stinkers the survey badly overlooked were Houston and Atlanta, and I think it’s fair to say that a lot of people were surprised by how dismal their seasons were. Ditto the Washington [bleep]s and Minnesota Vikings, albeit to a lesser extent. On the flip side, Carolina stands out as a big miss, and KC, Philly, Arizona, and the Colts can also thumb their noses at me and my crowd. Statistically minded readers might want to know that the root mean squared error (RMSE) here is about 27, where the observations are on a 0-100 scale. That 27 is better than random guessing, but it’s certainly not stellar.

A single season doesn’t offer a robust test of a forecasting technique. Still, as a proof of concept, I think this exercise was a success. My survey only drew about 1,800 votes from a few hundred respondents whom I recruited casually through my blog and Twitter feed, which focuses on international affairs and features very little sports talk. When that crowd was voting, the only information they really had was the previous season’s performance and whatever they knew about off-season injuries and personnel changes. Under the circumstances, I’d say a RMSE of 27 ain’t terrible.

It’d be fun to try this again in August 2014 with a bigger crowd and see how that turns out. Before and during the season, it would also be neat to routinely rerun that Mechanical Turk exercise to produce up-to-date “wisdom of the (expert) crowd” power rankings and see if they can help improve predictions about the coming week’s games. Better yet, we could write some code to automate the ingestion of those lists, simulate their pairwise voting, and apply All Our Ideas‘ hierarchical model to the output. In theory, this approach could scale to incorporate as many published lists as we can find, culling the purported wisdom of our hand-selected crowd without the hassle of all that recruiting and voting.

Unfortunately, that crystal palace was a bit too much for me to build on this dim and chilly Sunday. And now, back to our regularly scheduled programming…

PS If you’d like to tinker with the data, you can find it here.

Watch Experts’ Beliefs Evolve Over Time

On 15 December 2013, “something” happened in South Sudan that quickly began to spiral into a wider conflict. Prior research tells us that mass killings often occur on the heels of coup attempts and during civil wars, and at the time South Sudan ranked among the world’s countries at greatest risk of state-led mass killing.

Motivated by these two facts, I promptly added a question about South Sudan to the opinion pool we’re running as part of a new atrocities early-warning system for the U.S. Holocaust Memorial Museum’s Center for the Prevention of Genocide (see this recent post for more on that). As it happened, we already had one question running about the possibility of a state-led mass killing in South Sudan targeting the Murle, but the spiraling conflict clearly implied a host of other risks. Posted on 18 December 2013, the new question asked, “Before 1 January 2015, will an episode of mass killing occur in South Sudan?”

The criteria we gave our forecasters to understand what we mean by “mass killing” and how we would decide if one has happened appear under the Background Information header at the bottom of this post. Now, shown below is an animated sequence of kernel density plots of each day’s forecasts from all participants who’d chosen to answer this question. A kernel density plot is like a histogram, but with some nonparametric estimation thrown in to try to get at the distribution of a variable’s “true” values from the sample of observations we’ve got. If that sound like gibberish to you, just think of the peaks in the plots as clumps of experts who share similar beliefs about the likelihood of mass killing in South Sudan. The taller the peak, the bigger the clump. The farther right the peak, the more likely that clump thinks a mass killing is.

kplot.ssd.20140205

I see a couple of interesting patterns in those plots. The first is the rapid rightward shift in the distribution’s center of gravity. As the fighting escalated and reports of atrocities began to trickle in (see here for one much-discussed article from the time), many of our forecasters quickly became convinced that a mass killing would occur in South Sudan in the coming year, if one wasn’t occurring already. On 23 December—the date that aforementioned article appeared—the average forecast jumped to approximately 80 percent, and it hasn’t fallen below that level since.

The second pattern that catches my eye is the appearance in January of a long, thin tail in the distribution that reaches into the lower ranges. That shift in the shape of the distribution coincides with stepped-up efforts by U.N. peacekeepers to stem the fighting and the start of direct talks between the warring parties. I can’t say for sure what motivated that shift, but it looks like our forecasters split in their response to those developments. While most remained convinced that a mass killing would occur or had already, a few forecasters were apparently more optimistic about the ability of those peacekeepers or talks or both to avert a full-blown mass killing. A few weeks later, it’s still not clear which view is correct, although a forthcoming report from the U.N. Mission in South Sudan may soon shed more light on this question.

I think this set of plots is interesting on its face for what it tells us about the urgent risk of mass atrocities in South Sudan. At the same time, I also hope this exercise demonstrates the potential to extract useful information from an opinion pool beyond a point-estimate forecast. We know from prior and ongoing research that those point estimates can be quite informative in their own right. Still, by looking at the distribution of participant’s forecasts on a particular question, we can glean something about the degree of uncertainty around an event of interest or concern. By looking for changes in that distribution over time, we can also get a more complete picture of how the group’s beliefs evolve in response to new information than a simple line plot of the average forecast could ever tell us. Look for more of this work as our early-warning system comes online, hopefully in the next few months.

UPDATE (7 Feb): At the urging of Trey Causey, I tried making another version of this animation in which the area under the density plot is filled in. I also decided to add a vertical line to show each day’s average forecast, which is what we currently report as the single-best forecast at any given time. Here’s what that looks like, using data from a question on the risk of a mass killing occurring in the Central African Republic before 2015. We closed this question on 19 December 2013, when it became clear through reporting by Human Rights Watch and others that an episode of mass killing has occurred.

kplot2.car.20140207

Background Information

We will consider a mass killing to have occurred when the deliberate actions of state security forces or other armed groups result in the deaths of at least 1,000 noncombatant civilians over a period of one year or less.

  • A noncombatant civilian is any person who is not a current member of a formal or irregular military organization and who does not apparently pose an immediate threat to the life, physical safety, or property of other people.
  • The reference to deliberate actions distinguishes mass killing from deaths caused by natural disasters, infectious diseases, the accidental killing of civilians during war, or the unanticipated consequences of other government policies. Fatalities should be considered intentional if they result from actions designed to compel or coerce civilian populations to change their behavior against their will, as long as the perpetrators could have reasonably expected that these actions would result in widespread death among the affected populations. Note that this definition also covers deaths caused by other state actions, if, in our judgment, perpetrators enacted policies/actions designed to coerce civilian population and could have expected that these policies/actions would lead to large numbers of civilian fatalities. Examples of such actions include, but are not limited to: mass starvation or disease-related deaths resulting from the intentional confiscation, destruction, or medicines or other healthcare supplies; and deaths occurring during forced relocation or forced labor.
  • To distinguish mass killing from large numbers of unrelated civilian fatalities, the victims of mass killing must appear to be perceived by the perpetrators as belonging to a discrete group. That group may be defined communally (e.g., ethnic or religious), politically (e.g., partisan or ideological), socio-economically (e.g., class or professional), or geographically (e.g., residents of specific villages or regions). In this way, apparently unrelated executions by police or other state agents would not qualify as mass killing, but capital punishment directed against members of a specific political or communal group would.

The determination of whether or not a mass killing has occurred will be made by the administrators of this system using publicly available secondary sources and in consultation with subject-matter experts. Relevant evidence will be summarized in a blog post published when the determination is announced, and any dissenting views will be discussed as well.

Will Unarmed Civilians Soon Get Massacred in Ukraine?

According to one pool of forecasters, most probably not.

As part of a public atrocities early-warning system I am currently helping to build for the U.S. Holocaust Memorial Museum’s Center for the Prevention of Genocide (see here), we are running a kind of always-on forecasting survey called an opinion pool. An opinion pool is similar in spirit to a prediction market, but instead of having participants trade shares tied the occurrence of some future event, we simply ask participants to estimate the probability of each event’s occurrence. In contrast to a traditional survey, every question remains open until the event occurs or the forecasting window closes. This way, participants can update their forecasts as often as they like, as they see or hear relevant information or just change their minds.

With generous support from Inkling, we started up our opinion pool in October, aiming to test and refine it before our larger early-warning system makes its public debut this spring (we hope). So far, we have only recruited opportunistically among colleagues and professional acquaintances, but we already have more than 70 registered participants. In the first four months of operation, we have used the system to ask more than two dozen questions, two of which have since closed because the relevant events occurred (mass killing in CAR and the Geneva II talks on Syria).

Over the next few years, we aim to recruit a large and diverse pool of volunteer forecasters from around the world with some claim to topical expertise or relevant local knowledge. The larger and more diverse our pool, the more accurate we expect our forecasts to be, and the wider the array of questions we can ask. (If you are interested in participating, please drop me a line at ulfelder <at> gmail <dot> com.)

A few days ago and prompted by a couple of our more active members, I posted a question to our pool asking, “Before 1 March 2014, will any massacres occur in Ukraine?” As of this morning, our pool had made a total of 13 forecasts, and the unweighted average of the latest of those estimates from each participating forecaster was just 15 percent. Under the criteria we specified (see Background Information below), this forecast does not address the risk of large-scale violence against or among armed civilians, nor does it exclude the possibility of a series of small but violent encounters that cumulatively produce a comparable or larger death toll. Still, for those of us concerned that security forces or militias will soon kill nonviolent protesters in Ukraine on a large scale, our initial forecast implies that those fears are probably unwarranted.

Crowd-Estimated Probability of Any Massacres in Ukraine Before 1 March 2014

Crowd-Estimated Probability of Any Massacres in Ukraine Before 1 March 2014

Obviously, we don’t have a crystal ball, and this is just an aggregation of subjective estimates from a small pool of people, none of whom (I think) is on the scene in Ukraine or has inside knowledge of the decision-making of relevant groups. Still, a growing body of evidence shows that aggregations of subjective forecasts like this one can often be usefully accurate (see here), even with a small number of contributing forecasters (see here). On this particular question, I very much hope our crowd is right. Whatever happens in Ukraine over the next few weeks, though, principle and evidence suggest that the method is sound, and we soon expect to be using this system to help assess risks of mass atrocities all over the world in real time.

Background Information

We define a “massacre” as an event that has the following features:

  • At least 10 noncombatant civilians are killed in one location (e.g., neighborhood, town, or village) in less than 48 hours. A noncombatant civilian is any person who is not a current member of a formal or irregular military organization and who does not apparently pose an immediate threat to the life, physical safety, or property of other people.
  • The victims appear to have been the primary target of the violence that killed them.
  • The victims do not appear to have been engaged in violent action or criminal activity when they were killed, unless that violent action was apparently in self-defense.
  • The relevant killings were carried out by individuals affiliated with a social group or organization engaged in a wider political conflict and appear to be connected to each other and to that wider conflict.

Those features will not always be self-evident or uncontroversial, so we use the following series of ad hoc rules to make more consistent judgments about ambiguous events.

  • Police, soldiers, prison guards, and other agents of state security are never considered noncombatant civilians, even if they are killed while off duty or out of uniform.
  • State officials and bureaucrats are not considered civilians when they are apparently targeted because of their professional status (e.g., assassinated).
  • Civilian deaths that occur in the context of operations by uniformed military-service members against enemy combatants are considered collateral damage, not atrocities, and should be excluded unless there is strong evidence that the civilians were targeted deliberately. We will err on the side of assuming that they were not.
  • Deaths from state repression of civilians engaged in nonviolent forms of protest are considered atrocities. Deaths resulting from state repression targeting civilians who were clearly engaged in rioting, looting, attacks on property, or other forms of collective aggression or violence are not.
  • Non-state militant or paramilitary groups, such as militias, gangs, vigilante groups, or raiding parties, are considered combatants, not civilians.

We will use contextual knowledge to determine whether or not a discrete event is linked to a wider conflict or campaign of violence, and we will err on the side of assuming that it is.

Determinations of whether or not a massacre has occurred will be made by the administrator of this system using publicly available secondary sources. Relevant evidence will be summarized in a blog post published when the determination is announced, and any dissenting views will be discussed as well.

Disclosure

I have argued on this blog that scholars have an obligation to disclose potential conflicts of interest when discussing their research, so let me do that again here: For the past two years, I have been paid as a contractor by the U.S. Holocaust Memorial Museum for my work on the atrocities early-warning system discussed in this post. Since the spring of 2013, I have also been paid to write questions for the Good Judgment Project, in which I participated as a forecaster the year before. To the best of my knowledge, I have no financial interests in, and have never received any payments from, any companies that commercially operate prediction markets or opinion pools.

What the U.S. Intelligence Community Says About Mass Atrocities in 2014

Here’s what Director of National Intelligence James Clapper said about the risk of mass atrocities this year in the Worldwide Threat Assessment he delivered today to the Senate Select Committee on Intelligence:

The overall risk of mass atrocities worldwide will probably increase in 2014 and beyond. Trends driving this increase include more social mobilization, violent conflict, including communal violence, and other forms of instability that spill over borders and exacerbate ethnic and religious tensions; diminished or stagnant quality of governance; and widespread impunity for past abuses. Many countries at risk of mass atrocities will likely be open to influence to prevent or mitigate them. This is because they are dependent on Western assistance or multilateral missions in their countries, have the political will to prevent mass atrocities, or would be responsive to international scrutiny. Overall international will and capability to prevent or mitigate mass atrocities will likely diminish in 2014 and beyond, although support for human rights norms to prevent atrocities will almost certainly deepen among some non-government organizations. Much of the world will almost certainly turn to the United States for leadership to prevent and respond to mass atrocities.

That’s a ton of analysis crammed into a single paragraph, and I suspect a lot of person-hours went into the construction of those six sentences.

However many hours it was, I think the results are largely correct. After two decades of relative quiescence, we’ve seen a troubling rebound in the occurrence of mass atrocities in the past few years, and the systemic forces that seem to be driving that rebound don’t yet show signs of abating.

One point on which I disagree with the IC’s analysis, though, is the claim that “widespread impunity for past abuses” is helping to fuel the upward trend in mass atrocities. I don’t think this assertion is flat-out false; I just think it’s overblown and over-confident. As Mark Kersten argued last week in a blog post on the debate over whether or not the situation in Syria should be referred to the International Criminal Court (ICC),

Any suggestion that international criminal justice should be pursued in the context of ongoing hostilities in Syria leads us to the familiar “peace versus justice” debate. Within this debate, there are broadly two camps: one which views international criminal justice as a necessary and useful tool which can deter crimes, marginalize perpetrators and even be conducive to peace negotiations; and a second camp which sees judicial interventions as deleterious to peace talks and claims that it creates disincentives for warring parties to negotiate and leads to increased levels of violence.

So who’s right? I think Kersten is when he says this:

It remains too rarely conceded that the Courts effects are mixed and, even more rarely, that they might be negligible.This points to the ongoing need to reimagine how we study and assess the effects of the ICC on ongoing and active conflicts. There is little doubt that the Court can have negative and positive effects on the ability of warring parties and interested actors to transform conflicts and establish peace. But this shouldn’t lead to a belief that the ICC must have these effects across cases. In some instances, the Court may actually have minimal or even inconsequential effects. As importantly, in many if not most cases, the ICC won’t be the be-all and end-all of peace processes. Even when the Court has palpable effects, peace processes aren’t likely to flourish or perish on the hill of international criminal justice.

Finally, I’m not sure what the Threat Assessment‘s drafters had in mind when they wrote that “overall international will and capability to prevent or mitigate mass atrocities will likely diminish in 2014 and beyond.” I suspect that statement is a nod in the direction of declinists who worry that a recalcitrant Russia and rising China spell trouble for the supposed Pax Americana, but that’s just a guess.

In any case, I think the assertion is wrong. Syria is the horror that seems to lurk behind this point, and there’s no question that the escalation and spread of that war represents one of the greatest failures of global governance in modern times. Even as the war in Syria continues, though, international forces have mobilized to stem fighting in the Central African Republic and South Sudan, two conflicts that are already terrible but could also get much, much worse. Although the long-term effects of those mobilizations remain unclear, the very fact of their occurrence undercuts the claim that international will and capability to respond to mass atrocities are flagging.

Coup Forecasts for 2014

This year, I’ll start with the forecasts, then describe the process. First, though, a couple of things to bear in mind as you look at the map and dot plot:

  1. Coup attempts rarely occur, so the predicted probabilities are all on the low side, and most are approximately zero. The fact that a country shows up in dark red on the map or ranks high on an ordered list does not mean that we should anticipate a coup occurring there. It just means that country is at relatively high risk compared to the rest of the world. Statistically speaking, the safest bet for any country almost any year is that a coup attempt won’t occur. The point of this exercise is to try to get a better handle on where the few coup attempts we can expect to see this year are most likely to happen.
  2. These forecasts are estimates based on noisy data, so they are highly imprecise, and small differences are not terribly meaningful. The fact that one country lands a few notches higher or lower than other on an ordered list does not imply a significant difference in risk.

Okay, now the the forecasts. First, the heat-map version, which sorts the world into fifths. From cross-validation in the historical data, we can expect nearly 80 percent of the countries with coup attempts this year to be somewhere in that top fifth. So, if there are four countries with coup attempts in 2014, three of them are probably in dark red on that map, and the other one is probably dark orange.

forecast.heatmap.2014

Now, a dot plot of the Top 40, which is a slightly larger set than the top fifth in the heat map. Here, the gray dots show the forecasts from the two component models (see below), while the red dots are the unweighted average of those two—what I consider the single-best forecast.

forecast.dotplot.2014

A lot of food for thought in there, but I’m going to leave interpretation of these results to future posts and to you.

Now, on the process: As statistical forecasters are wont to do, I have tinkered with the models again this year. As I said in a blogged research note a couple of weeks ago, this year’s tinkering was driven by a combination of data practicalities and the usual sense of, “Hey, let’s see if we can do a little better this time.”Predictably, though, I also ended up doing things a little different than I’d expected in December. Specifically:

  • I trained and validated the models on an amalgamation of two coup data sets—as described in a November post that showed an animated map of coup attempts worldwide since 1946—instead of just using the Powell and Thyne list. So that map and the bar plots with it should give you a clearer sense of what these forecasts are (and aren’t) trying to anticipate.
  • After waiting for Freedom House to update its Freedom in the World data, which it did a few days ago, I decided to go back to using Polity after all because the forecasts based on it were noticeably more accurate in cross-validation. The models include a categorical measure of regime type based on the Polity scale and a “clock” counting years since the last significant change in that score. I hard-coded updates to those measures, which are much coarser (and therefore easier to update) than the Polity scale or its component variables.
  • As with coup events, I used an amalgamation of GDP growth data from the World Bank and IMF instead of picking one. I also went back to summarizing this feature in the models with a binary indicator for slow growth of less than 2 percent (annual, per capita).
  • Finally, I did not include GDELT summaries in the models because they only slightly improved forecast accuracy, and they did not cover a country of great interest to me (South Sudan). The latter is surely a fixable glitch, but it’s not fixed now, and I really wanted to have a forecast for that particular country in this year’s list for reasons that should now be evident from the results. On the accuracy part, I should note that I’ve only done a little bit of checking, and there are still plenty of ways to try to squeeze more forecasting power out of those data, not the least of them being to build more dynamic models that use monthly instead of annual summaries.

The forecasts are an unweighted average of predicted probabilities from a logistic regression model and a Random Forest that use more or less the same inputs. Both models were trained on data covering the period 1960-2010; applied to data from 2011 to 2013 to assess their predictive performance; and then applied to the newest data to generate forecasts for 2014. Variable selection was based mostly on my prior experience working this problem. As noted above, I did a little bit of model checking—using stratified 10-fold cross-validation—to make sure the process worked reasonably well, and to help choose between some different measures for the same concept. In that cross-validation, the unweighted average got good but not great accuracy scores, with an area under the ROC curve in the low 0.80s. Here are the variables used in the models:

  • Geographic Region. Per the U.S. Department of State (and only in the Random Forest).
  • Last Colonizer. Indicators for former French, British, and Spanish colonies.
  • Country Age. Years since independence, logged.
  • Post-Cold War Period. Indicator marking country-years since 1991, when coup activity has generally slowed.
  • Infant Mortality Rate. Relative to the annual global median, logged, and courtesy of the U.S. Census Bureau. The latest version ends in 2012, so I’ve simply pulled those values forward a year here.
  • Political Regime Type. Four-way categorization based on the Polity scale into autocracies, “anocracies,” democracies, and transitional, collapsed, or occupied cases.
  • Political Stability. Count of years since a significant change in the Polity scale, logged.
  • Political Salience of Elite Ethnicity. Yes or no, per a data set on elite characteristics produced by the Center for Systemic Peace (CSP) for the Political Instability Task Force (PITF), with hard-coded updates for 2013 (no changes). This one is not posted on CSP’s data page and was obtained from PITF and shared with their permission.
  • Violent Civil Conflict. Yes or no, per CSP’s Major Episodes of Political Violence data set (here), with hard-coded updates for 2013 (a few changes).
  • Election Year. Yes-or-no indicator for any national elections—executive, legislative, or constituent assembly—courtesy of the NELDA project, with hard-coded updates for 2012 through 2014 (scheduled).
  • Slow Economic Growth. Yes-or-no indicator for less than 2 percent, as described above.
  • Domestic Coup Activity. Yes-or-no indicator for countries with any attempts in the past 5 years, successful or failed.
  • Regional Coup Activity. A count of other countries in the same region with any coup attempts the previous year, logged.
  • Global Coup Activity. Same as the previous tic, but for the whole world.

All of the predictors are lagged one year except for region, last colonizer, country age, post-Cold War period, and the election-year indicator. The fact that a variable appears on this list does not necessarily mean that it has a significant effect on the risk of any coup attempts. As I said earlier, I drew up a roster of variables to include based on a sense of what might matter (a.k.a., theory) and past experience and did not try to do much winnowing.

If you are interested in exploring the results in more detail or just trying to do this better, you can replicate my analysis using code I’ve put on GitHub (here). The posted script includes a Google Drive link with the requisite data. If you tinker and find something useful, I only ask that you return the favor and let me know. [N.B. As its name implies, the generation of a Random Forest is partially stochastic, so the results will vary slightly each time the process is repeated. If you run the posted script on the posted data, you can expect to see some small differences in the final estimates. I think these small differences are actually a nice representation of the forecasts' inherent uncertainty, so I have not attempted to eliminate it by, for example, setting the random number seed within the R script.]

For the 2013 version of this post, see here. For 2012, here.

UPDATE: In response to a comment, I tried to produce another version of the heat map that more clearly differentiates the quantiles and better reflects the fact that the predicted probabilities for cases outside the top two fifths are all pretty close to zero. The result is shown below. Here, the differences in the shades of gray represent differences in the average predicted probabilities across the five tiers. You can decide if it’s clearer or not.

forecast.grayscalemap.2014

A New Statistical Approach to Assessing Risks of State-Led Mass Killing

Which countries around the world are currently at greatest risk of an onset of state-led mass killing? At the start of the year, I posted results from a wiki survey that asked this question. Now, here in heat-map form are the latest results from a rejiggered statistical process with the same target. You can find a dot plot of these data at the bottom of the post, and the data and code used to generate them are on GitHub.

Estimated Risk of New Episode of State-Led Mass Killing

These assessments represent the unweighted average of probabilistic forecasts from three separate models trained on country-year data covering the period 1960-2011. In all three models, the outcome of interest is the onset of an episode of state-led mass killing, defined as any episode in which the deliberate actions of state agents or other organizations kill at least 1,000 noncombatant civilians from a discrete group. The three models are:

  • PITF/Harff. A logistic regression model approximating the structural model of genocide/politicide risk developed by Barbara Harff for the Political Instability Task Force (PITF). In its published form, the Harff model only applies to countries already experiencing civil war or adverse regime change and produces a single estimate of the risk of a genocide or politicide occurring at some time during that crisis. To build a version of the model that was more dynamic, I constructed an approximation of the PITF’s global model for forecasting political instability and use the natural log of the predicted probabilities it produces as an additional input to the Harff model. This approach mimics the one used by Harff and Ted Gurr in their ongoing application of the genocide/politicide model for risk assessment (see here).
  • Elite Threat. A logistic regression model that uses the natural log of predicted probabilities from two other logistic regression models—one of civil-war onset, the other of coup attempts—as its only inputs. This model is meant to represent the argument put forth by Matt Krain, Ben Valentino, and others that states usually engage in mass killing in response to threats to ruling elites’ hold on power.
  • Random Forest. A machine-learning technique (see here) applied to all of the variables used in the two previous models, plus a few others of possible relevance, using the ‘randomforest‘ package in R. A couple of parameters were tuned on the basis of a gridded comparison of forecast accuracy in 10-fold cross-validation.

The Random Forest proved to be the most accurate of the three models in stratified 10-fold cross-validation. The chart below is a kernel density plot of the areas under the ROC curve for the out-of-sample estimates from that cross-validation drill. As the chart shows, the average AUC for the Random Forest was in the low 0.80s, compared with the high 0.70s for the PITF/Harff and Elite Threat models. As expected, the average of the forecasts from all three performed even better than the best single model, albeit not by much. These out-of-sample accuracy rates aren’t mind blowing, but they aren’t bad either, and they are as good or better than many of the ones I’ve seen from similar efforts to anticipate the onset of rare political crises in countries worldwide.

cpg.statrisk2014.val.auc.by.fold

Distribution of Out-of-Sample AUC Scores by Model in 10-Fold Cross-Validation

The decision to use an unweighted average for the combined forecast might seem simplistic, but it’s actually a principled choice in this instance. When examples of the event of interest are hard to come by and we have reason to believe that the process generating those events may be changing over time, sticking with an unweighted average is a reasonable hedge against risks of over-fitting the ensemble to the idiosyncrasies of the test set used to tune it. For a longer discussion of this point, see pp. 7-8 in the last paper I wrote on this work and the paper by Andreas Graefe referenced therein.

Any close readers of my previous work on this topic over the past couple of years (see here and here) will notice that one model has been dropped from the last version of this ensemble, namely, the one proposed by Michael Colaresi and Sabine Carey in their 2008 article, “To Kill or To Protect” (here). As I was reworking my scripts to make regular updating easier (more on that below), I paid closer attention than I had before to the fact that the Colaresi and Carey model requires a measure of the size of state security forces that is missing for many country-years. In previous iterations, I had worked around that problem by using a categorical version of this variable that treated missingness as a separate category, but this time I noticed that there were fewer than 20 mass-killing onsets in country-years for which I had a valid observation of security-force size. With so few examples, we’re not going to get reliable estimates of any pattern connecting the two. As it happened, this model—which, to be fair to its authors, was not designed to be used as a forecasting device—was also by far the least accurate of the lot in 10-fold cross-validation. Putting two and two together, I decided to consign this one to the scrap heap for now. I still believe that measures of military forces could help us assess risks of mass killing, but we’re going to need more and better data to incorporate that idea into our multimodel ensemble.

The bigger and in some ways more novel change from previous iterations of this work concerns the unorthodox approach I’m now using to make the risk assessments as current as possible. All of the models used to generate these assessments were trained on country-year data, because that’s the only form in which most of the requisite data is produced. To mimic the eventual forecasting process, the inputs to those models are all lagged one year at the model-estimation stage—so, for example, data on risk factors from 1985 are compared with outcomes in 1986, 1986 inputs to 1987 outcomes, and so on.

If we stick rigidly to that structure at the forecasting stage, then I need data from 2013 to produce 2014 forecasts. Unfortunately, many of the sources for the measures used in these models won’t publish their 2013 data for at least a few more months. Faced with this problem, I could do something like what I aim to do with the coup forecasts I’ll be producing in the next few days—that is, only use data from sources that quickly and reliably update soon after the start of each year. Unfortunately again, though, the only way to do that would be to omit many of the variables most specific to the risk of mass atrocities—things like the occurrence of violent civil conflict or the political salience of elite ethnicity.

So now I’m trying something different. Instead of waiting until every last input has been updated for the previous year and they all neatly align in my rectangular data set, I am simply applying my algorithms to the most recent available observation of each input. It took some trial and error to write, but I now have an R script that automates this process at the country level by pulling the time series for each variable, omitting the missing values, reversing the series order, snipping off the observation at the start of that string, collecting those snippets in a new vector, and running that vector through the previously estimated model objects to get a forecast (see the section of this starting at line 284).

One implicit goal of this approach is to make it easier to jump to batch processing, where the forecasting engine routinely and automatically pings the data sources online and updates whenever any of the requisite inputs has changed. So, for example, when in a few months the vaunted Polity IV Project releases its 2013 update, my forecasting contraption would catch and ingest the new version and the forecasts would change accordingly. I now have scripts that can do the statistical part but am going to be leaning on other folks to automate the wider routine as part of the early-warning system I’m helping build for the U.S. Holocaust Memorial Museum’s Center for the Prevention of Genocide.

The big upside of this opportunistic approach to updating is that the risk assessments are always as current as possible, conditional on the limitations of the available data. The way I figure, when you don’t have information that’s as fresh as you’d like, use the freshest information you’ve got.

The downside of this approach is that it’s not clear exactly what the outputs from that process represent. Technically, a forecast is a probabilistic statement about the likelihood of a specific event during a specific time period. The outputs from this process are still probabilistic statements about the likelihood of a specific event, but they are no longer anchored to a specific time period. The probabilities mapped at the top of this post mostly use data from 2012, but the inputs for some variables for some cases are a little older, while the inputs for some of the dynamic variables (e.g., GDP growth rates and coup attempts) are essentially current. So are those outputs forecasts for 2013, or for 2014, or something else?

For now, I’m going with “something else” and am thinking of the outputs from this machinery as the most up-to-date statistical risk assessments I can produce, but not forecasts as such. That description will probably sound like fudging to most statisticians, but it’s meant to be an honest reflection of both the strengths and limitations of the underlying approach.

Any gear heads who’ve read this far, I’d really appreciate hearing your thoughts on this strategy and any ideas you might have on other ways to resolve this conundrum, or any other aspect of this forecasting process. As noted at the top, the data and code used to produce these estimates are posted online. This work is part of a soon-to-launch, public early-warning system, so we hope and expect that they will have some effect on policy and advocacy planning processes. Given that aim, it behooves us to do whatever we can to make them as accurate as possible, so I would very much welcome any suggestions on how to do or describe this better.

Finally and as promised, here is a dot plot of the estimates mapped above. Countries are shown in descending order by estimated risk. The gray dots mark the forecasts from the three component models, and the red dot marks the unweighted average.

dotplot.20140122

PS. In preparation for a presentation on this work at an upcoming workshop, I made a new map of the current assessments that works better, I think, than the one at the top of this post. Instead of coloring by quintiles, this new version (below) groups cases into several bins that roughly represent doublings of risk: less than 1%, 1-2%, 2-4%, 4-8%, and 8-16%. This version more accurately shows that the vast majority of countries are at extremely low risk and more clearly shows variations in risk among the ones that are not.

Estimated Risk of New State-Led Mass Killing

Estimated Risk of New State-Led Mass Killing

Follow

Get every new post delivered to your Inbox.

Join 5,757 other followers

%d bloggers like this: