Introducing A New Venue for Atrocities Early Warning

Starting today, the bits of this blog on forecasting and monitoring mass atrocities are moving to their proper home, or at least the initial makings of it. Say hi to the (interim) blog of the Early Warning Project.

Since 2012, I have been working as a consultant to the U.S. Holocaust Memorial Museum’s Center for the Prevention of Genocide (CPG) to help build a new global early-warning system for mass atrocities. As usual, that process is taking longer than we had expected. We now have working versions of the project’s two main forecasting streams—statistical risk assessments and a “wisdom of (expert) crowds” system called an opinion pool—and CPG has hired a full-time staffer (hi, Ali) to manage their day-to-day workings. Unfortunately, though, the web site that will present, discuss, and invite discussion of those forecasts is still under construction. Thanks to Dartmouth’s DALI Lab, we’ve got a great prototype, but there’s finishing work to be done, and doing it takes a while.

Well, delays, be damned. We think the content we’re producing is useful now, so we’re not waiting for that site to get finished to start sharing it. Instead, we’re launching this interim blog to go ahead and start doing things like:

When the project’s full-blown web site finally goes up, it will feature a blog, too, and all of the content from this interim venue will migrate there. Until then, if you’re interested in atrocities early warning and prevention—or applied forecasting more generally—please come see what we’re doing, share what you find interesting, and help us think about how to do it even better.

Meanwhile, Dart-Throwing Chimp will keep plugging along on its core themes of democratization, political instability, and forecasting. If you’ve got the interest and the bandwidth, I hope you’ll find time to watch and engage with both channels.

Early Results from a New Atrocities Early Warning System

For the past couple of years, I have been working as a consultant to the U.S. Holocaust Memorial Museum’s Center for the Prevention of Genocide to help build a new early-warning system for mass atrocities around the world. Six months ago, we started running the second of our two major forecasting streams, a “wisdom of (expert) crowds” platform that aggregates probabilistic forecasts from a pool of topical and area experts on potential events of concern. (See this conference paper for more detail.)

The chart below summarizes the output from that platform on most of the questions we’ve asked so far about potential new episodes of mass killing before 2015. For our early-warning system, we define a mass killing as an episode of sustained violence in which at least 1,000 noncombatant civilians from a discrete group are intentionally killed, usually in a period of a year or less. Each line in the chart shows change over time in the daily average of the inputs from all of the participants who choose to make a forecast on that question. In other words, the line is a mathematical summary of the wisdom of our assembled crowd—now numbering nearly 100—on the risk of a mass killing beginning in each case before the end of 2014. Also:

  • Some of the lines (e.g., South Sudan, Iraq, Pakistan) start further to the right than others because we did not ask about those cases when the system launched but instead added them later, as we continue to do.
  • Two lines—Central African Republic and South Sudan—end early because we saw onsets of mass-killing episodes in those countries. The asterisks indicate the dates on which we made those declarations and therefore closed the relevant questions.
  • Most but not all of these questions ask specifically about state-led mass killings, and some focus on specific target groups (e.g., the Rohingya in Burma) or geographic regions (the North Caucasus in Russia) as indicated.
Crowd-Estimated Probabilities of Mass-Killing Onset Before 1 January 2015

Crowd-Estimated Probabilities of Mass-Killing Onset Before 1 January 2015

I look at that chart and conclude that this process is working reasonably well so far. In the six months since we started running this system, the two countries that have seen onsets of mass killing are both ones that our forecasters promptly and consistently put on the high side of 50 percent. Nearly all of the other cases, where mass killings haven’t yet occurred this year, have stuck on the low end of the scale.

I’m also gratified to see that the system is already generating the kind of dynamic output we’d hoped it would, even with fewer than 100 forecasters in the pool. In the past several weeks, the forecasts for both Burma and Iraq have risen sharply, apparently in response to shifts in relevant policies in the former and the escalation of the civil war in the latter. Meanwhile, the forecast for Uighurs in China has risen steadily over the year as a separatist rebellion in Xinjiang Province has escalated and, with it, concerns about a harsh government response. These inflection points and trends can help identify changes in risk that warrant attention from organizations and individuals concerned about preventing or mitigating these potential atrocities.

Finally, I’m also intrigued to see that our opinion pool seems to be sorting cases into a few clusters that could be construed as distinct tiers of concern. Here’s what I have in mind:

  • Above the 50-percent threshold are the high risk cases, where forecasters assess that mass killing is likely to occur during the specified time frame.  These cases won’t necessarily be surprising. Some observers had been warning on the risk of mass atrocities in CAR and South Sudan for months before those episodes began, and the plight of the Rohingya in Burma has been a focal point for many advocacy groups in the past year. Even in supposedly “obvious” cases, however, this system can help by providing a sharper estimate of that risk and giving a sense of how it is trending over time. In the case of Burma, for example, it is the separation that has happened in the last several weeks that tells the story of a switch from possible to likely and thus adds a degree of urgency to that warning.
  • A little farther down the y-axis are the moderate risk cases—ones that probably won’t suffer mass killing during the period in question but could more readily tip in that direction. In the chart above, Iraq, Sudan, Pakistan, Bangladesh, and Burundi all land in this tier, although Iraq now appears to be sliding into the high risk group.
  • Clustered toward the bottom are the low risk cases where the forecasters seem fairly confident that mass killing will not occur in the near future. In the chart above, Russia, Afghanistan, and Ethiopia are the cases that land firmly in this set. China (Uighurs) remains closer to them than the moderate risk tier, but it appears to be creeping toward the moderate-risk group. We are also running a question about the risk of state-led mass killing in Rwanda before 2015, and it currently lands in this tier, with a forecast of 14 percent.

The system that generates the data behind this chart is password protected, but the point of our project is to make these kinds of forecasts freely available to the global public. We are currently building the web site that will display the forecasts from this opinion pool in real time to all comers and hope to have it ready this fall.

In the meantime, if you think you have relevant knowledge or expertise—maybe you study or work on this topic, or maybe you live or work in parts of the world where risks tend to be higher—and are interested in volunteering as a forecaster, please send an email to us at ewp@ushmm.org.

How Circumspect Should Quantitative Forecasters Be?

Yesterday, I participated in a panel discussion on the use of technology to prevent and document mass atrocities as part of an event at American University’s Washington College of Law to commemorate the Rwandan genocide.* In my prepared remarks, I talked about the atrocities early-warning system I’m helping build for the U.S. Holocaust Memorial Museum’s Center for the Prevention of Genocide. The chief outputs of that system are probabilistic forecasts, some from statistical models and others from a “wisdom of (expert) crowds” system called an opinion pool.

After I’d described that project, one of the other panelists, Patrick Ball, executive director of Human Rights Data Analysis Group, had this to say via Google Hangout:

As someone who uses machine learning to build statistical models—that’s what I do all day long, that’s my job—I’m very skeptical that models about conflict, about highly rare events that have very complicated and situation-unique antecedents are forecastable. I worry about early warning because when we build models we listen to people less. I know that, from my work with the U.N., when we have a room full of people who know an awful lot about what’s going on on the ground, a graph—when someone puts a graph on the table, everybody stops thinking. They just look at the graph. And that worries me a lot.

In 1994, human-rights experts warned the world about what was happening [in Rwanda]. No one listened. So as we, as technologists and people who like technology, when we ask questions of data, we have to make sure that if anybody is going to listen to us, we’d better be giving them the right answers.

Maybe I was being vain, but I heard that part of Patrick’s remarks as a rebuke of our early-warning project and pretty much every other algorithm-driven atrocities and conflict forecasting endeavor out there. I responded by acknowledging that our forecasts are far from perfect, but I also asserted that we have reason to believe they will usually be at least marginally better than the status quo, so they’re worth doing and sharing anyway.

A few minutes later, Patrick came back with this:

When we build technology for human rights, I think we need to be somewhat thoughtful about how our less technical colleagues are going to hear the things that we say. In a lot of meetings over a lot of years, I’ve listened to very sophisticated, thoughtful legal, qualitative, ethnographic arguments about very specific events occurring on the ground. But almost inevitably, when someone proposes some kind of quantitative analysis, all that thoughtful reasoning escapes the room… The practical effect of introducing any kind of quantitative argument is that it displaces the other arguments that are on the table. And we are naive to think otherwise.

What that means is that the stakes for getting these kinds of claims right are very high. If we make quantitative claims and we’re wrong—because our sampling foundations are weak, because our model is inappropriate, because we misinterpreted the error around our claim, or for any other reason—we can do a lot of harm.

From that combination of uncertainty and the possibility for harm, Patrick concludes that quantitative forecasters have a special responsibility to be circumspect in the presentation of their work:

I propose that one of the foundations of any kind of quantitative claims-making is that we need to have very strict validation before we propose a conclusion to be used by our broader community. There are all kinds of rules about validation in model-building. We know a lot about it. We have a lot of contexts in which we have ground truth. We have a lot of historical detail. Some of that historical detail is itself beset by these sampling problems, but we have opportunities to do validation. And I think that any argument, any claim that we make—especially to non-technical audiences—should lead with that validation rather than leaving it to the technical detail. By avoiding discussing the technical problems in front of non-technical audiences, we’re hiding stuff that might not be working. So I warn us all to be much stricter.

Patrick has applied statistical methods to human-rights matters for a long time, and his combined understanding of the statistics and the advocacy issues is as good if not better than almost anyone else’s. Still, what he described about how people respond to quantitative arguments is pretty much the exact opposite of my experience over 15 years of working on statistical forecasts of various forms of political violence and change. Many of the audiences to which I’ve presented that work have been deeply skeptical of efforts to forecast political behavior. Like Patrick, many listeners have asserted that politics is fundamentally unquantifiable and unpredictable. Statistical forecasts in particular are often derided for connoting a level of precision that’s impossible to achieve and for being too far removed from the messy reality of specific places to produce useful information. Even in cases where we can demonstrate that the models are pretty good at distinguishing high-risk cases from low-risk ones, that evidence usually fails to persuade many listeners, who appear to reject the work on principle.

I hear loud echoes of my experiences in Daniel Kahneman’s discussion of clinical psychologists’ hostility to algorithms and enduring prejudice in favor of clinical judgment, even in situations where the former is demonstrably superior to the latter. On pp. 228 of Thinking, Fast and Slow, Kahneman observes that this prejudice “is an attitude we can all recognize.”

When a human competes with a machine, whether it is John Henry a-hammerin’ on the mountain or the chess genius Garry Kasparov facing off against the computer Deep Blue, our sympathies lie with our fellow human. The aversion to algorithms making decisions that affect humans is rooted in the strong preference that many people have for the natural over the synthetic or artificial.

Kahneman further reports that

The prejudice against algorithms is magnified when the decisions are consequential. [Psychologist Paul] Meehl remarked, ‘I do not quite know how to alleviate the horror some clinicians seem to experience when they envisage a treatable case being denied treatment because a ‘blind, mechanical’ equation misclassifies him.’ In contrast, Meehl and other proponents of algorithms have argued strongly that it is unethical to rely on intuitive judgments for important decisions if an algorithm is available that will make fewer mistakes. Their rational argument is compelling, but it runs against a stubborn psychological reality: for most people, the cause of a mistake matters. The story of a child dying because an algorithm made a mistake is more poignant than the story of the same tragedy occurring as a result of human error, and the difference in emotional intensity is readily translated into a moral preference.

If our distaste for algorithms is more emotional than rational, then why do forecasters who use them have a special obligation, as Patrick asserts, to lead presentations of their work with a discussion of the “technical problems” when experts offering intuitive judgments almost never do? I’m uncomfortable with that requirement, because I think it unfairly handicaps algorithmic forecasts in what is, frankly, a competition for attention against approaches that are often demonstrably less reliable but also have real-world consequences. This isn’t a choice between action or inaction; it’s a trolley problem. Plenty of harm is already happening on the current track, and better forecasts could help reduce that harm. Under these circumstances, I think we behave ethically when we encourage the use of our forecasts in honest but persuasive ways.

If we could choose between forecasting and not forecasting, then I would be happier to set a high bar for predictive claims-making and let the validation to which Patrick alluded determine whether or not we’re going to try forecasting at all. Unfortunately, that’s not the world we inhabit. Instead, we live in a world in which governments and other organizations are constantly making plans, and those plans incorporate beliefs about future states of the world.

Conventionally, those beliefs are heavily influenced by the judgments of a small number of experts elicited in unstructured ways. That approach probably works fine in some fields, but geopolitics is not one of them. In this arena, statistical models and carefully designed procedures for eliciting and combining expert judgments will also produce forecasts that are uncertain and imperfect, but those algorithm-driven forecasts will usually be more accurate than the conventional approach of querying one or a few experts and blending their views in our heads (see here and here for some relevant evidence).

We also know that most of those subject-matter experts don’t abide by the rules Patrick proposes for quantitative forecasters. Anyone who’s ever watched cable news or read an op-ed—or, for that matter, attended a panel discussion—knows that experts often convey their judgments with little or no discussion of their cognitive biases and sources of uncertainty.

As it happens, that confidence is persuasive. As Kahneman writes (p. 263),

Experts who acknowledge the full extent of their ignorance may expect to be replaced by more confident competitors who are better able to gain the trust of clients. An unbiased appreciation of uncertainty is a cornerstone of rationality—but it is not what people and organizations want. Extreme uncertainty is paralyzing under dangerous circumstances, and the admission that one is merely guessing is especially unacceptable when the stakes are high. Acting on pretended knowledge is often the preferred solution.

The allure of confidence is dysfunctional in many analytic contexts, but it’s also not something we can wish away. And if confidence often trumps content, then I think we do our work and our audiences a disservice when we hem and haw about the validity of our forecasts as long as the other guys don’t. Instead, I believe we are behaving ethically when we present imperfect but carefully derived forecasts in a confident manner. We should be transparent about the limitations of the data and methods, and we should assess the accuracy of our forecasts and share what we learn. Until we all agree to play by the same rules, though, I don’t think quantitative forecasters have a special obligation to lead with the limitations of their work, thus conceding a persuasive advantage to intuitive forecasters who will fill that space and whose prognostications we can expect to be less reliable than ours.

* You can replay a webcast of that event here. Our panel runs from 1:00:00 to 2:47:00.

Watch Experts’ Beliefs Evolve Over Time

On 15 December 2013, “something” happened in South Sudan that quickly began to spiral into a wider conflict. Prior research tells us that mass killings often occur on the heels of coup attempts and during civil wars, and at the time South Sudan ranked among the world’s countries at greatest risk of state-led mass killing.

Motivated by these two facts, I promptly added a question about South Sudan to the opinion pool we’re running as part of a new atrocities early-warning system for the U.S. Holocaust Memorial Museum’s Center for the Prevention of Genocide (see this recent post for more on that). As it happened, we already had one question running about the possibility of a state-led mass killing in South Sudan targeting the Murle, but the spiraling conflict clearly implied a host of other risks. Posted on 18 December 2013, the new question asked, “Before 1 January 2015, will an episode of mass killing occur in South Sudan?”

The criteria we gave our forecasters to understand what we mean by “mass killing” and how we would decide if one has happened appear under the Background Information header at the bottom of this post. Now, shown below is an animated sequence of kernel density plots of each day’s forecasts from all participants who’d chosen to answer this question. A kernel density plot is like a histogram, but with some nonparametric estimation thrown in to try to get at the distribution of a variable’s “true” values from the sample of observations we’ve got. If that sound like gibberish to you, just think of the peaks in the plots as clumps of experts who share similar beliefs about the likelihood of mass killing in South Sudan. The taller the peak, the bigger the clump. The farther right the peak, the more likely that clump thinks a mass killing is.

kplot.ssd.20140205

I see a couple of interesting patterns in those plots. The first is the rapid rightward shift in the distribution’s center of gravity. As the fighting escalated and reports of atrocities began to trickle in (see here for one much-discussed article from the time), many of our forecasters quickly became convinced that a mass killing would occur in South Sudan in the coming year, if one wasn’t occurring already. On 23 December—the date that aforementioned article appeared—the average forecast jumped to approximately 80 percent, and it hasn’t fallen below that level since.

The second pattern that catches my eye is the appearance in January of a long, thin tail in the distribution that reaches into the lower ranges. That shift in the shape of the distribution coincides with stepped-up efforts by U.N. peacekeepers to stem the fighting and the start of direct talks between the warring parties. I can’t say for sure what motivated that shift, but it looks like our forecasters split in their response to those developments. While most remained convinced that a mass killing would occur or had already, a few forecasters were apparently more optimistic about the ability of those peacekeepers or talks or both to avert a full-blown mass killing. A few weeks later, it’s still not clear which view is correct, although a forthcoming report from the U.N. Mission in South Sudan may soon shed more light on this question.

I think this set of plots is interesting on its face for what it tells us about the urgent risk of mass atrocities in South Sudan. At the same time, I also hope this exercise demonstrates the potential to extract useful information from an opinion pool beyond a point-estimate forecast. We know from prior and ongoing research that those point estimates can be quite informative in their own right. Still, by looking at the distribution of participant’s forecasts on a particular question, we can glean something about the degree of uncertainty around an event of interest or concern. By looking for changes in that distribution over time, we can also get a more complete picture of how the group’s beliefs evolve in response to new information than a simple line plot of the average forecast could ever tell us. Look for more of this work as our early-warning system comes online, hopefully in the next few months.

UPDATE (7 Feb): At the urging of Trey Causey, I tried making another version of this animation in which the area under the density plot is filled in. I also decided to add a vertical line to show each day’s average forecast, which is what we currently report as the single-best forecast at any given time. Here’s what that looks like, using data from a question on the risk of a mass killing occurring in the Central African Republic before 2015. We closed this question on 19 December 2013, when it became clear through reporting by Human Rights Watch and others that an episode of mass killing has occurred.

kplot2.car.20140207

Background Information

We will consider a mass killing to have occurred when the deliberate actions of state security forces or other armed groups result in the deaths of at least 1,000 noncombatant civilians over a period of one year or less.

  • A noncombatant civilian is any person who is not a current member of a formal or irregular military organization and who does not apparently pose an immediate threat to the life, physical safety, or property of other people.
  • The reference to deliberate actions distinguishes mass killing from deaths caused by natural disasters, infectious diseases, the accidental killing of civilians during war, or the unanticipated consequences of other government policies. Fatalities should be considered intentional if they result from actions designed to compel or coerce civilian populations to change their behavior against their will, as long as the perpetrators could have reasonably expected that these actions would result in widespread death among the affected populations. Note that this definition also covers deaths caused by other state actions, if, in our judgment, perpetrators enacted policies/actions designed to coerce civilian population and could have expected that these policies/actions would lead to large numbers of civilian fatalities. Examples of such actions include, but are not limited to: mass starvation or disease-related deaths resulting from the intentional confiscation, destruction, or medicines or other healthcare supplies; and deaths occurring during forced relocation or forced labor.
  • To distinguish mass killing from large numbers of unrelated civilian fatalities, the victims of mass killing must appear to be perceived by the perpetrators as belonging to a discrete group. That group may be defined communally (e.g., ethnic or religious), politically (e.g., partisan or ideological), socio-economically (e.g., class or professional), or geographically (e.g., residents of specific villages or regions). In this way, apparently unrelated executions by police or other state agents would not qualify as mass killing, but capital punishment directed against members of a specific political or communal group would.

The determination of whether or not a mass killing has occurred will be made by the administrators of this system using publicly available secondary sources and in consultation with subject-matter experts. Relevant evidence will be summarized in a blog post published when the determination is announced, and any dissenting views will be discussed as well.

Will Unarmed Civilians Soon Get Massacred in Ukraine?

According to one pool of forecasters, most probably not.

As part of a public atrocities early-warning system I am currently helping to build for the U.S. Holocaust Memorial Museum’s Center for the Prevention of Genocide (see here), we are running a kind of always-on forecasting survey called an opinion pool. An opinion pool is similar in spirit to a prediction market, but instead of having participants trade shares tied the occurrence of some future event, we simply ask participants to estimate the probability of each event’s occurrence. In contrast to a traditional survey, every question remains open until the event occurs or the forecasting window closes. This way, participants can update their forecasts as often as they like, as they see or hear relevant information or just change their minds.

With generous support from Inkling, we started up our opinion pool in October, aiming to test and refine it before our larger early-warning system makes its public debut this spring (we hope). So far, we have only recruited opportunistically among colleagues and professional acquaintances, but we already have more than 70 registered participants. In the first four months of operation, we have used the system to ask more than two dozen questions, two of which have since closed because the relevant events occurred (mass killing in CAR and the Geneva II talks on Syria).

Over the next few years, we aim to recruit a large and diverse pool of volunteer forecasters from around the world with some claim to topical expertise or relevant local knowledge. The larger and more diverse our pool, the more accurate we expect our forecasts to be, and the wider the array of questions we can ask. (If you are interested in participating, please drop me a line at ulfelder <at> gmail <dot> com.)

A few days ago and prompted by a couple of our more active members, I posted a question to our pool asking, “Before 1 March 2014, will any massacres occur in Ukraine?” As of this morning, our pool had made a total of 13 forecasts, and the unweighted average of the latest of those estimates from each participating forecaster was just 15 percent. Under the criteria we specified (see Background Information below), this forecast does not address the risk of large-scale violence against or among armed civilians, nor does it exclude the possibility of a series of small but violent encounters that cumulatively produce a comparable or larger death toll. Still, for those of us concerned that security forces or militias will soon kill nonviolent protesters in Ukraine on a large scale, our initial forecast implies that those fears are probably unwarranted.

Crowd-Estimated Probability of Any Massacres in Ukraine Before 1 March 2014

Crowd-Estimated Probability of Any Massacres in Ukraine Before 1 March 2014

Obviously, we don’t have a crystal ball, and this is just an aggregation of subjective estimates from a small pool of people, none of whom (I think) is on the scene in Ukraine or has inside knowledge of the decision-making of relevant groups. Still, a growing body of evidence shows that aggregations of subjective forecasts like this one can often be usefully accurate (see here), even with a small number of contributing forecasters (see here). On this particular question, I very much hope our crowd is right. Whatever happens in Ukraine over the next few weeks, though, principle and evidence suggest that the method is sound, and we soon expect to be using this system to help assess risks of mass atrocities all over the world in real time.

Background Information

We define a “massacre” as an event that has the following features:

  • At least 10 noncombatant civilians are killed in one location (e.g., neighborhood, town, or village) in less than 48 hours. A noncombatant civilian is any person who is not a current member of a formal or irregular military organization and who does not apparently pose an immediate threat to the life, physical safety, or property of other people.
  • The victims appear to have been the primary target of the violence that killed them.
  • The victims do not appear to have been engaged in violent action or criminal activity when they were killed, unless that violent action was apparently in self-defense.
  • The relevant killings were carried out by individuals affiliated with a social group or organization engaged in a wider political conflict and appear to be connected to each other and to that wider conflict.

Those features will not always be self-evident or uncontroversial, so we use the following series of ad hoc rules to make more consistent judgments about ambiguous events.

  • Police, soldiers, prison guards, and other agents of state security are never considered noncombatant civilians, even if they are killed while off duty or out of uniform.
  • State officials and bureaucrats are not considered civilians when they are apparently targeted because of their professional status (e.g., assassinated).
  • Civilian deaths that occur in the context of operations by uniformed military-service members against enemy combatants are considered collateral damage, not atrocities, and should be excluded unless there is strong evidence that the civilians were targeted deliberately. We will err on the side of assuming that they were not.
  • Deaths from state repression of civilians engaged in nonviolent forms of protest are considered atrocities. Deaths resulting from state repression targeting civilians who were clearly engaged in rioting, looting, attacks on property, or other forms of collective aggression or violence are not.
  • Non-state militant or paramilitary groups, such as militias, gangs, vigilante groups, or raiding parties, are considered combatants, not civilians.

We will use contextual knowledge to determine whether or not a discrete event is linked to a wider conflict or campaign of violence, and we will err on the side of assuming that it is.

Determinations of whether or not a massacre has occurred will be made by the administrator of this system using publicly available secondary sources. Relevant evidence will be summarized in a blog post published when the determination is announced, and any dissenting views will be discussed as well.

Disclosure

I have argued on this blog that scholars have an obligation to disclose potential conflicts of interest when discussing their research, so let me do that again here: For the past two years, I have been paid as a contractor by the U.S. Holocaust Memorial Museum for my work on the atrocities early-warning system discussed in this post. Since the spring of 2013, I have also been paid to write questions for the Good Judgment Project, in which I participated as a forecaster the year before. To the best of my knowledge, I have no financial interests in, and have never received any payments from, any companies that commercially operate prediction markets or opinion pools.

A New Statistical Approach to Assessing Risks of State-Led Mass Killing

Which countries around the world are currently at greatest risk of an onset of state-led mass killing? At the start of the year, I posted results from a wiki survey that asked this question. Now, here in heat-map form are the latest results from a rejiggered statistical process with the same target. You can find a dot plot of these data at the bottom of the post, and the data and code used to generate them are on GitHub.

Estimated Risk of New Episode of State-Led Mass Killing

These assessments represent the unweighted average of probabilistic forecasts from three separate models trained on country-year data covering the period 1960-2011. In all three models, the outcome of interest is the onset of an episode of state-led mass killing, defined as any episode in which the deliberate actions of state agents or other organizations kill at least 1,000 noncombatant civilians from a discrete group. The three models are:

  • PITF/Harff. A logistic regression model approximating the structural model of genocide/politicide risk developed by Barbara Harff for the Political Instability Task Force (PITF). In its published form, the Harff model only applies to countries already experiencing civil war or adverse regime change and produces a single estimate of the risk of a genocide or politicide occurring at some time during that crisis. To build a version of the model that was more dynamic, I constructed an approximation of the PITF’s global model for forecasting political instability and use the natural log of the predicted probabilities it produces as an additional input to the Harff model. This approach mimics the one used by Harff and Ted Gurr in their ongoing application of the genocide/politicide model for risk assessment (see here).
  • Elite Threat. A logistic regression model that uses the natural log of predicted probabilities from two other logistic regression models—one of civil-war onset, the other of coup attempts—as its only inputs. This model is meant to represent the argument put forth by Matt Krain, Ben Valentino, and others that states usually engage in mass killing in response to threats to ruling elites’ hold on power.
  • Random Forest. A machine-learning technique (see here) applied to all of the variables used in the two previous models, plus a few others of possible relevance, using the ‘randomforest‘ package in R. A couple of parameters were tuned on the basis of a gridded comparison of forecast accuracy in 10-fold cross-validation.

The Random Forest proved to be the most accurate of the three models in stratified 10-fold cross-validation. The chart below is a kernel density plot of the areas under the ROC curve for the out-of-sample estimates from that cross-validation drill. As the chart shows, the average AUC for the Random Forest was in the low 0.80s, compared with the high 0.70s for the PITF/Harff and Elite Threat models. As expected, the average of the forecasts from all three performed even better than the best single model, albeit not by much. These out-of-sample accuracy rates aren’t mind blowing, but they aren’t bad either, and they are as good or better than many of the ones I’ve seen from similar efforts to anticipate the onset of rare political crises in countries worldwide.

cpg.statrisk2014.val.auc.by.fold

Distribution of Out-of-Sample AUC Scores by Model in 10-Fold Cross-Validation

The decision to use an unweighted average for the combined forecast might seem simplistic, but it’s actually a principled choice in this instance. When examples of the event of interest are hard to come by and we have reason to believe that the process generating those events may be changing over time, sticking with an unweighted average is a reasonable hedge against risks of over-fitting the ensemble to the idiosyncrasies of the test set used to tune it. For a longer discussion of this point, see pp. 7-8 in the last paper I wrote on this work and the paper by Andreas Graefe referenced therein.

Any close readers of my previous work on this topic over the past couple of years (see here and here) will notice that one model has been dropped from the last version of this ensemble, namely, the one proposed by Michael Colaresi and Sabine Carey in their 2008 article, “To Kill or To Protect” (here). As I was reworking my scripts to make regular updating easier (more on that below), I paid closer attention than I had before to the fact that the Colaresi and Carey model requires a measure of the size of state security forces that is missing for many country-years. In previous iterations, I had worked around that problem by using a categorical version of this variable that treated missingness as a separate category, but this time I noticed that there were fewer than 20 mass-killing onsets in country-years for which I had a valid observation of security-force size. With so few examples, we’re not going to get reliable estimates of any pattern connecting the two. As it happened, this model—which, to be fair to its authors, was not designed to be used as a forecasting device—was also by far the least accurate of the lot in 10-fold cross-validation. Putting two and two together, I decided to consign this one to the scrap heap for now. I still believe that measures of military forces could help us assess risks of mass killing, but we’re going to need more and better data to incorporate that idea into our multimodel ensemble.

The bigger and in some ways more novel change from previous iterations of this work concerns the unorthodox approach I’m now using to make the risk assessments as current as possible. All of the models used to generate these assessments were trained on country-year data, because that’s the only form in which most of the requisite data is produced. To mimic the eventual forecasting process, the inputs to those models are all lagged one year at the model-estimation stage—so, for example, data on risk factors from 1985 are compared with outcomes in 1986, 1986 inputs to 1987 outcomes, and so on.

If we stick rigidly to that structure at the forecasting stage, then I need data from 2013 to produce 2014 forecasts. Unfortunately, many of the sources for the measures used in these models won’t publish their 2013 data for at least a few more months. Faced with this problem, I could do something like what I aim to do with the coup forecasts I’ll be producing in the next few days—that is, only use data from sources that quickly and reliably update soon after the start of each year. Unfortunately again, though, the only way to do that would be to omit many of the variables most specific to the risk of mass atrocities—things like the occurrence of violent civil conflict or the political salience of elite ethnicity.

So now I’m trying something different. Instead of waiting until every last input has been updated for the previous year and they all neatly align in my rectangular data set, I am simply applying my algorithms to the most recent available observation of each input. It took some trial and error to write, but I now have an R script that automates this process at the country level by pulling the time series for each variable, omitting the missing values, reversing the series order, snipping off the observation at the start of that string, collecting those snippets in a new vector, and running that vector through the previously estimated model objects to get a forecast (see the section of this starting at line 284).

One implicit goal of this approach is to make it easier to jump to batch processing, where the forecasting engine routinely and automatically pings the data sources online and updates whenever any of the requisite inputs has changed. So, for example, when in a few months the vaunted Polity IV Project releases its 2013 update, my forecasting contraption would catch and ingest the new version and the forecasts would change accordingly. I now have scripts that can do the statistical part but am going to be leaning on other folks to automate the wider routine as part of the early-warning system I’m helping build for the U.S. Holocaust Memorial Museum’s Center for the Prevention of Genocide.

The big upside of this opportunistic approach to updating is that the risk assessments are always as current as possible, conditional on the limitations of the available data. The way I figure, when you don’t have information that’s as fresh as you’d like, use the freshest information you’ve got.

The downside of this approach is that it’s not clear exactly what the outputs from that process represent. Technically, a forecast is a probabilistic statement about the likelihood of a specific event during a specific time period. The outputs from this process are still probabilistic statements about the likelihood of a specific event, but they are no longer anchored to a specific time period. The probabilities mapped at the top of this post mostly use data from 2012, but the inputs for some variables for some cases are a little older, while the inputs for some of the dynamic variables (e.g., GDP growth rates and coup attempts) are essentially current. So are those outputs forecasts for 2013, or for 2014, or something else?

For now, I’m going with “something else” and am thinking of the outputs from this machinery as the most up-to-date statistical risk assessments I can produce, but not forecasts as such. That description will probably sound like fudging to most statisticians, but it’s meant to be an honest reflection of both the strengths and limitations of the underlying approach.

Any gear heads who’ve read this far, I’d really appreciate hearing your thoughts on this strategy and any ideas you might have on other ways to resolve this conundrum, or any other aspect of this forecasting process. As noted at the top, the data and code used to produce these estimates are posted online. This work is part of a soon-to-launch, public early-warning system, so we hope and expect that they will have some effect on policy and advocacy planning processes. Given that aim, it behooves us to do whatever we can to make them as accurate as possible, so I would very much welcome any suggestions on how to do or describe this better.

Finally and as promised, here is a dot plot of the estimates mapped above. Countries are shown in descending order by estimated risk. The gray dots mark the forecasts from the three component models, and the red dot marks the unweighted average.

dotplot.20140122

PS. In preparation for a presentation on this work at an upcoming workshop, I made a new map of the current assessments that works better, I think, than the one at the top of this post. Instead of coloring by quintiles, this new version (below) groups cases into several bins that roughly represent doublings of risk: less than 1%, 1-2%, 2-4%, 4-8%, and 8-16%. This version more accurately shows that the vast majority of countries are at extremely low risk and more clearly shows variations in risk among the ones that are not.

Estimated Risk of New State-Led Mass Killing

Estimated Risk of New State-Led Mass Killing

Using GDELT to Monitor Atrocities, Take 2

Last May, I wrote a post about my preliminary efforts to use a new data set called GDELT to monitor reporting on atrocities around the world in near-real time. Those efforts represent one part of the work I’m doing on a public early-warning system for the U.S. Holocaust Memorial Museum’s Center for the Prevention of Genocide, and they have continued in fits and starts over the ensuing eight months. With help from Dartmouth’s Dickey Center, Palantir, and the GDELT crew, we’ve made a lot of progress. I thought I’d post an update now because I’m excited about the headway we’ve made; I think others might benefit from seeing what we’re doing; and I hope this transparency can help us figure out how to do this task even better.

So, let’s cut to the chase: Here is a screenshot of an interactive map locating the nine events captured in GDELT in the first week of January 2014 that looked like atrocities to us and occurred in a place that the Google Maps API recognized when queried. (One event was left off the map because Google Maps didn’t recognize its reported location.) The size of the bubbles corresponds to the number of civilian deaths, which in this map range from one to 31. To really get a feel for what we’re trying to do, though, head over to the original visualization on CartoDB (here), where you can zoom in and out and click on the bubbles to see a hyperlink to the story from which each event was identified.

atrocities.monitoring.screenshot.20140113

Looks simple, right? Well, it turns out it isn’t, not by a long shot.

As this blog’s regular readers know, GDELT uses software to scour the web for new stories about political interactions all around the world and parses those stories to identify and record information about who did or said what to whom, when, and where. It currently covers the period 1979-present and is now updated every day, and each of those daily updates contains some 100,000-140,000 new records. Miraculously and crucial to a non-profit pilot project like ours, GDELT is also available for free. 

The nine events plotted in the map above were sifted from the tens of thousands of records GDELT dumped on us in the first week of 2014. Unfortunately, that data-reduction process is only partially automated.

The first step in that process is the quickest. As originally envisioned back in May, we are using an R script (here) to download GDELT’s daily update file and sift it for events that look, from the event type and actors involved, like they might involve what we consider to be an atrocity—that is, deliberate, deadly violence against one or more noncombatant civilians in the context of a wider political conflict.

Unfortunately, the stack of records that filtering script returns—something like 100-200 records per day—still includes a lot of stuff that doesn’t interest us. Some records are properly coded but involve actions that don’t meet our definition of an atrocity (e.g., clashes between rioters and police or rebels and troops); some involve atrocities but are duplicates of events we’ve already captured; and some are just miscoded (e.g., a mention of the film industry “shooting” movies that gets coded as soldiers shooting civilians).

After we saw how noisy our data set would be if we stopped screening there, we experimented with a monitoring system that would acknowledge GDELT’s imperfections and try to work with them. As Phil Schrodt recommended at the recent GDELT DC Hackathon, we looked to “embrace the suck.” Instead of trying to use GDELT to generate a reliable chronicle of atrocities around the world, we would watch for interesting and potentially relevant perturbations in the information stream, noise and all, and those perturbations would produce alerts that users of our system could choose to investigate further. Working with Palantir, we built a system that would estimate country-specific prior moving averages of daily event counts returned by our filtering script and would generate an alert whenever a country’s new daily count landed more than two standard deviations above or below that average.

That system sounded great to most of the data pros in our figurative room, but it turned out to be a non-starter with some other constituencies of importance to us. The issue was credibility. Some of the events causing those perturbations in the GDELT stream were exactly what we were looking for, but others—a pod of beached whales in Brazil, or Congress killing a bill on healthcare reform—were laughably far from the mark. If our supposedly high-tech system confused beached whales and Congressional procedures for mass atrocities, we would risk undercutting the reputation for reliability and technical acumen that we are striving to achieve.

So, back to the drawing board we went. To separate the signal from the static and arrive at something more like that valid chronicle we’d originally envisioned, we decided that we needed to add a second, more laborious step to our data-reduction process. After our R script had done its work, we would review each of the remaining records by hand to decide if it belonged in our data set or not and, when necessary, to correct any fields that appeared to have been miscoded. While we were at it, we would also record the number of deaths each event produced. We wrote a set of rules to guide those decisions; had two people (a Dartmouth undergraduate research assistant and I) apply those rules to the same sets of daily files; and compared notes and made fixes. After a few iterations of that process over a few months, we arrived at the codebook we’re using now (here).

This process radically reduces the amount of data involved. Each of those two steps drops us down multiple orders of magnitude: from 100,000-140,000 records in the daily updates, to about 150 in our auto-filtered set, to just one or two in our hand-filtered set. The figure below illustrates the extent of that reduction. In effect, we’re treating GDELT as a very powerful but error-prone search and coding tool, a source of raw ore that needs refining to become the thing we’re after. This isn’t the only way to use GDELT, of course, but for our monitoring task as presently conceived, it’s the one that we think will work best.

monitoring.data.reduction.graphic

Once that second data-reduction step is done, we still have a few tasks left to enable the kind of mapping and analysis we aim to do. We want to trim the data set to keep only the atrocities we’ve identified, and we need to consolidate the original and corrected fields in those remaining records and geolocate them. All of that work gets done with a second R script (here), which is applied to the spreadsheet the coder saves after completing her work. The much smaller file that script produces is then ready to upload to a repository where it can be combined with other days’ outputs to produce the global chronicle our monitoring project aims to produce.

From start to finish, each daily update now takes about 45 minutes, give or take 15. We’d like to shrink that further if we can but don’t see any real opportunities to do so at the moment. Perhaps more important, we still have to figure out the bureaucratic procedures that will allow us to squeeze daily updates from a “human in the loop” process in a world where there are weekends and holidays and people get sick and take vacations and sometimes even quit. Finally, we also have not yet built the dashboard that will display and summarize and provide access to these data on our program’s web site, which we expect to launch some time this spring.

We know that the data set this process produces will be incomplete. I am 100-percent certain that during the first week of January 2014, more than 10 events occurred around the world that met our definition of an atrocity. Unfortunately, we can only find things where GDELT looks, and even a scan of every news story produced every day everywhere in the world would fail to see the many atrocities that never make the news.

On the whole, though, I’m excited about the progress we’ve made. As soon as we can launch it, this monitoring process should help advocates and analysts more efficiently track atrocities globally in close to real time. As our data set grows, we also hope it will serve as the foundation for new research on forecasting, explaining, and preventing this kind of violence. Even with its evident shortcomings, we believe this data set will prove to be useful, and as GDELT’s reach continues to expand, so will ours.

PS For a coda discussing the great ideas people had in response to this post, go here.

[Erratum: The original version of this post said there were about 10,000 records in each daily update from GDELT. The actual figure is 100,000-140,000. The error has been corrected and the illustration of data reduction updated accordingly.]

Relative Risks of State-Led Mass Killing Onset in 2014: Results from a Wiki Survey

In early December, as part of our ongoing work for the Holocaust Museum’s Center for the Prevention of Genocide, Ben Valentino and I launched a wiki survey to help assess risks of state-led mass killing onsets in 2014 (here).

The survey is now closed and the results are in. Here, according to our self-selected crowd on five continents and the nearly 5,000 pairwise votes it cast, is a map of how the world looks right now on this score. The darker the shade of gray, the greater the relative risk that in 2014 we will see the start of an episode of mass killing in which the deliberate actions of state agents or other groups acting at their behest result in the deaths of at least 1,000 noncombatant civilians from a discrete group over a period of a year or less.

wikisurvey.masskilling.state.2014.map

Smaller countries are hard to find on that map, and it’s difficult to compare colors across regions, so here is a dot plot of the same data in rank order. Countries with red dots are ones that had ongoing episodes of state-led mass killing at the end of 2013: DRC, Egypt, Myanmar, Nigeria, North Korea, Sudan, and Syria. It’s possible that these countries will experience additional onsets in 2014, but we wonder if some of our respondents didn’t also conflate the risk of a new onset with the presence or intensity of an ongoing one. Also, there’s an ongoing episode in CAR that was arguably state-led for a time in 2013, but the Séléka militias no longer appear to be acting at the behest of the supposed government, so we didn’t color that dot. And, of course, there are at least a few ongoing episodes of mass killing being perpetrated by non-state actors (see this recent post for some ideas), but that’s not what we asked our crowd to consider in this survey.

wikisurvey.masskilling.state.2014.dotplot

It is very important to understand that the scores being mapped and plotted here are not probabilities of mass-killing onset. Instead, they are model-based estimates of the probability that the country in question is at greater risk than any other country chosen at random. In other words, these scores tell us which countries our crowd thinks we should worry about more, not how likely our crowd thinks a mass-killing onset is.

We think the results of this survey are useful in their own right, but we also plan to compare them to, and maybe even combine them with, other forecasts of mass killing onsets as part of the public early-warning system we expect to launch later this year.

In the meantime, if you’re interested in tinkering with the scores and our plots of them, you can find the code I used to make the map and dot plot on GitHub (here) and the data in .csv format on my Google Drive (here). If you have better ideas on how to visualize this information, please let us know and share your code.

UPDATE: Bad social scientist! With a tweet, Alex Hanna reminded me that I really need to say more about the survey method and respondents. So:

We used All Our Ideas to conduct this survey, and we embedded that survey in a blog post that defined our terms and explained the process. The blog post was published on December 1, and we publicized it through a few channels, including: a note to participants in a password-protected opinion pool we’re running to forecast various mass atrocities-related events; a posting to a Conflict Research group on Facebook; an email to the president of the American Association of Genocide Scholars asking him to announce it on their listserv; and a few tweets from my Twitter account at the beginning and end of the month. Some of those tweets were retweeted, and I saw a few other people post or tweet their own links to the blog post or survey as well.

As for Alex’s specific question about who comprised our crowd, the short answer is that we don’t and can’t know. Participation in All Our Ideas surveys is anonymous, and our blog post was not private. From the vote-level data (here), I can see that we ended the month with 4,929 valid votes from 147 unique voting sessions. I know for a fact that some people voted in more than one session—I cast a small number of votes on a few occasions, and I know at least one colleague voted more than once—so the number of people who participated was some unknown number smaller than 147 who found their way to the survey through those postings and tweets.

Trends over Time in State-Sponsored Mass Killing

UPDATE: After posting this on July 25, I discovered some version-related bugs in the R script used to generate the maps and animate them. I fixed those bugs in the public script on Github (see below) and generated a new version of the animation that runs through 2012.

As part of the work I’m doing for the U.S. Holocaust Memorial Museum’s Center for the Prevention of Genocide, I’m scheduled to present a paper at next month’s Annual Meeting of the American Political Science Association on an ensemble of statistical models that can be used to assess countries’ risks of state-sponsored mass killing. I plan to post the paper and associated data and code soon in an attempt to vet the work I’ve already done and to catalyze new work on the topic by other researchers. In the meantime, I thought I would post some visualizations of the historical data on episodes of state-sponsored mass killing in hopes of piquing peoples’ curiosity about this subject and the early-warning system we’re developing.

The data set I’m using for this project is based on one that Dartmouth’s Ben Valentino created for the Political Instability Task Force several years ago. I’ve written an R script that puts the published list version of Ben’s data—see the appendix to this paper—into the usual country-year format and, in consultation with Ben, extends the data through 2012. Of course, I am solely responsible for the results. You can download the country-year data in .csv format from my Google Drive, here.

For purposes of that PITF project, Ben defined a mass killing as any episode in which the actions of state agents result in the intentional death of at least 1,000 noncombatants from a discrete group in a period of sustained violence. Mass killing episodes are considered to have begun in the first year in which at least 100 intentional noncombatant fatalities occur. If fewer than 100 total fatalities are recorded annually for any three consecutive years during the episode, the episode is considered to have ended during the first year within that three-year run, even if killings continue at lower levels in later years. If the state were to resume killing more than 100 civilians per year from that same group and that new wave of killings eventually produced more than 1,000 deaths in total, that resumption would be identified as a new onset.

The stack of charts below shows trends over time in episodes of state-sponsored mass killing by looking at 1) annual counts of onsets, 2) annual rates of onsets (i.e., the raw count divided by the number of countries in the world), and 3) the proportion of countries worldwide that had ongoing episodes. As the bar chart at the top shows, since World War II, most years have seen no more than two onsets worldwide, and these events have become even less common in the two decades since the disintegration of the Soviet Union. As the middle chart shows, those counts translate to annual incidences under 2 percent in most years, with peak rates of just 4 to 6 percent. In other words, these onsets are very rare events. In the bottom chart, we see that the global prevalence has historically been much higher, because these episodes often last for many years after they start. The share of countries with ongoing episodes of state-sponsored mass killing hovered at around 15 percent for most of the Cold War era, peaked at nearly 25 percent in the years immediately after the Cold War’s end, and then declined to recent historical lows by the start of the 2010s.

Occurrences of State-Sponsored Mass Killing Worldwide, 1945-2012

Occurrences of State-Sponsored Mass Killing Worldwide, 1945-2012

We can also look at changes over time in the geographic distribution of these episodes by mapping them. The GIF below animates a sequence of yearly maps from 1946 through 2012 that show where new episodes erupted (brown) and where ongoing ones continued (orange). You’ll need to click on the image to get it to play. I made the maps in R’s ‘worldmap‘ package and used the aptly named ‘animation‘ package to animate them. As the maps show, in the immediate aftermath of World War II, virtually all of the identified episodes of state-sponsored mass killing were occurring in Europe. Since the 1950s, though, the locus of these events has shifted, mostly to Asia, Africa, and the former Soviet Union, and a few clusters of onsets have occurred in conjunction with phases of decolonization or state dissolution.

mkeps19462012

Watch this space for more on the multimodel ensemble I’m building to forecast these onsets and the pilot early-warning program those forecasts will inform.

Update: You can now find the R script I used to produce these visualizations on Github, here. The data are on my Google Drive, here.

Road-Testing GDELT as a Resource for Monitoring Atrocities

As I said here a few weeks ago, I think the Global Dataset on Events, Location, and Tone (GDELT) is a fantastic new resource that really embodies some of the ways in which technological changes are coming together to open lots of new doors for social-scientific research. GDELT’s promise is obvious: more than 200 million political events from around the world over the past 30 years, all spotted and coded by well-trained software instead of the traditional armies of undergrad RAs, and with daily updates coming online soon. Or, as Adam Elkus’ t-shirt would have it, “200 million observations. Only one boss.”

BUT! Caveat emptor! Like every other data-collection effort ever, GDELT is not alchemy, and it’s important that people planning to use the data, or even just to consume analysis based on it, understand what its limitations are.

I’m starting to get a better feel for those limitations from my own efforts to use GDELT to help observe atrocities around the world, as part of a consulting project I’m doing for the U.S. Holocaust Memorial Museum’s Center for the Prevention of Genocide. The core task of that project is to develop plans for a public early-warning system that would allow us to assess the risk of onsets of atrocities in countries worldwide more accurately and earlier than current practice.

When I heard about GDELT last fall, though, it occurred to me that we could use it (and similar data sets in the pipeline) to support efforts to monitor atrocities as well. The CAMEO coding scheme on which GDELT is based includes a number of event types that correspond to various forms of violent attack and other variables indicating who was doing attacking whom. If we could develop a filter that reliably pulled events of interest to us from the larger stream of records, we could produce something like a near-real time bulletin on recent violence against civilians around the world. Our record would surely have some blind spots—GDELT only tracks a limited number of news sources, and some atrocities just don’t get reported, period—but I thought it would reliably and efficiently alert us to new episodes of violence against civilians and help us identify trends in ongoing ones.

Well, you know what they say about plans and enemies and first contact. After digging into GDELT, I still think we can accomplish those goals, but it’s going to take more human effort than I originally expected. Put bluntly, GDELT is noisier than I had anticipated, and for the time being the only way I can see to sharpen that signal is to keep a human in the loop.

Imagine (fantasize?) for a moment that there’s a perfect record somewhere of all the political interactions GDELT is trying to identify. For kicks, let’s call it the Encyclopedia Eventum (EE). Like any detection system, GDELT can mess up in two basic ways: 1) errors of omission, in which GDELT fails to spot something that’s in the EE; and 2) errors of commission, in which it mistakenly records an event that isn’t in the EE (or, relatedly, is in the EE but in a different place). We might also call these false negatives and false positives, respectively.

At this point, I can’t say anything about how often GDELT is making errors of omission, because I don’t have that Encyclopedia Eventum handy. A more realistic strategy for assessing the rate of errors of omission would involve comparing a subset of GDELT to another event data set that’s known to be a fairly reliable measure for some time and place of something GDELT is meant to track—say, protest and coercion in Europe—and see how well they match up, but that’s not a trivial task, and I haven’t tried it yet.

Instead, the noise I’m seeing is on the other side of that coin: the errors of commission, or false positives. Here’s what I mean:

To start developing my atrocities-monitoring filter, I downloaded the reduced and compressed version of GDELT recently posted on the Penn State Event Data Project page and pulled the tab-delimited text files for a couple of recent years. I’ve worked with event data before, so I’m familiar with basic issues in their analysis, but every data set has its own idiosyncrasies. After trading emails with a few CAMEO pros and reading Jay Yonamine’s excellent primer on event aggregation strategies, I started tinkering with a function in R that would extract the subset of events that appeared to involve lethal force against civilians. That function would involve rules to select on three features: event type, source (the doer), and target.

  • Event Type. For observing atrocities, type 20 (“Engage in Unconventional Mass Violence”) was an obvious choice. Based on advice from those CAMEO pros, I also focused on 18 (“Assault”) and 19 (“Fight”) but was expecting that I would need to be more restrictive about the subtypes, sources, and targets in those categories to avoid errors of commission.
  • Source. I’m trying to track violence by state and non-state agents, so I focused on GOV (government), MIL (Military), COP (police), and intelligence agencies (SPY) for the former and REB (militarized opposition groups) and SEP (separatist groups) for the latter. The big question mark was how to handle records with just a country code (e.g., “SYR” for Syria) and no indication of the source’s type. My CAMEO consultants told me these would usually refer in some way to the state, so I should at least consider including them.
  • Target. To identify violence against civilians, I figured I would get the most mileage out of the OPP (non-violent political opposition), CVL (“civilians,” people in general), and REF (refugees) codes, but I wanted to see if the codes for more specific non-state actors (e.g., LAB for labor, EDU for schools or students, HLH for health care) would also help flag some events of interest.

After tinkering with the data a bit, I decided to write to separate functions, one for events with state perpetrators and another for events with non-state perpetrators. If you’re into that sort of thing, you can see the state-perpetrator version of that filtering function on Github, here.

When I ran the more than 9 million records in the “2011.reduced.txt” file through that function, I got back 2,958 events. So far, so good. As soon as I started poking around in the results, though, I saw a lot of records that looked . The current release of GDELT doesn’t include text from or links to the source material, so it’s hard to say for sure what real-world event any one record describes. Still, some of the perpetrator-and-target combos looked odd to me, and web searches for relevant stories either came up empty or reinforced my suspicions that the records were probably errors of commission. Here are a few examples, showing the date, event type, source, and target:

  • 1/8/2011 193 USAGOV USAMED. Type 193 is “Fight with small arms and light weapons,” but I don’t think anyone from the U.S. government actually got in a shootout or knife fight with American journalists that day. In fact, that event-source-target combination popped up a lot in my subset.
  • 1/9/2011 202 USAMIL VNMCVL. Taken on its face, this record says that U.S. military forces killed Vietnamese civilians on January 9, 2011. My hunch is that the story on which this record is based was actually talking about something from the Vietnam War.
  • 4/11/2011 202 RUSSPY POLCVL. This record seems to suggest that Russian intelligence agents “engaged in mass killings” of Polish civilians in central Siberia two years ago. I suspect the story behind this record was actually talking about the Kaytn Massacre and associated mass deportations that occurred in April 1940.

That’s not to say that all the records looked wacky. Interleaved with these suspicious cases were records representing exactly the kinds of events I was trying to find. For example, my filter also turned up a 202 GOV SYRCVL for June 10, 2011, a day on which one headline blared “Dozens Killed During Syrian Protests.”

Still, it’s immediately clear to me that GDELT’s parsing process is not quite at the stage where we can peruse the codebook like a menu, identify the morsels we’d like to consume, phone our order in, and expect to have exactly the meal we imagined waiting for us when we go to pick it up. There’s lots of valuable information in there, but there’s plenty of chaff, too, and for the time being it’s on us as researchers to take time to try to sort the two out. This sorting will get easier to do if and when the posted version adds information about the source article and relevant text, but “easier” in this case will still require human beings to review the results and do the cross-referencing.

Over time, researchers who work on specific topics—like atrocities, or interstate war, or protest activity in specific countries—will probably be able to develop supplemental coding rules and tweak their filters to automate some of what they learn. I’m also optimistic that the public release of GDELT will accelerate improvements the software and dictionaries it uses, expanding its reach while shrinking the error rates. In the meantime, researchers are advised to stick to the same practices they’ve always used (or should have, anyway): take time to get to know your data; parse it carefully; and, when there’s no single parsing that’s obviously superior, check the sensitivity of your results to different permutations.

PS. If you have any suggestions on how to improve the code I’m using to spot potential atrocities or otherwise improve the monitoring process I’ve described, please let me know. That’s an ongoing project, and even marginal improvements in the fidelity of the filter would be a big help.

PPS. For more on these issues and the wider future of automated event coding, see this ensuing post from Phil Schrodt on his blog.

  • Author

  • Follow me on Twitter

  • Follow Dart-Throwing Chimp on WordPress.com
  • Enter your email address to follow this blog and receive notifications of new posts by email.

    Join 13,609 other subscribers
  • Archives

%d bloggers like this: