A Useful Data Set on Political Violence that Almost No One Is Using

For the past 10 years, the CIA has overtly funded the production of a publicly available data set on certain atrocities around the world that now covers the period from January 1995 to early 2014 and is still updated on a regular basis. If you work in a relevant field but didn’t know that, you’re not alone.

The data set in question is the Political Instability Task Force’s Worldwide Atrocities Dataset, which records information from several international press sources about situations in which five or more civilians are deliberately killed in the context of some wider political conflict. Each record includes information about who did what to whom, where, and when, along with a brief text description of the event, a citation for the source article(s), and, where relevant, comments from the coder. The data are updated monthly, although those updates are posted on a four-month lag (e.g., data from January become available in May).

The decision to limit collection to events involving at least five fatalities was a pragmatic one. As the data set’s codebook notes,

We attempted at one point to lower this threshold to one and the data collection demands proved completely overwhelming, as this involved assessing every murder and ambiguous accidental death reported anywhere in the world in the international media. “Five” has no underlying theoretical justification; it merely provides a threshold above which we can confidently code all of the reported events given our available resources.

For the past three years, the data set has also fudged this rule to include targeted killings that appear to have a political motive, even when only a single victim is killed. So, for example, killings of lawyers, teachers, religious leaders, election workers, and medical personnel are nearly always recorded, and these events are distinguished from ones involving five or more victims by a “Yes” in a field identifying “Targeted Assassinations” under a “Related Tactics” header.

The data set is compiled from stories appearing in a handful of international press sources that are accessed through Factiva. It is a computer-assisted process. A Boolean keyword search is used to locate potentially relevant articles, and then human coders read those stories and make data from the ones that turn out actually to be relevant. From the beginning, the PITF data set has pulled from Reuters, Agence France Press, Associated Press, and the New York Times. Early in the process, BBC World Monitor and CNN were added to the roster, and All Africa was also added a few years ago to improve coverage of that region.

The decision to restrict collection to a relatively small number of sources was also a pragmatic one. Unlike GDELT, for example—the routine production of which is fully automated—the Atrocities Data Set is hand-coded by people reading news stories identified through a keyword search. With people doing the coding, the cost of broadening the search to local and web-based sources is prohibitive. The hope is eventually to automate the process, either as a standalone project or as part of a wider automated event data collection effort. As GDELT shows, though, that’s hard to do well, and that day hasn’t arrived yet.

Computer-assisted coding is far more labor intensive than fully automated coding, but it also carries some advantages. Human coders can still discern better than the best automated coding programs when numerous reports are all referring to the same event, so the PITF data set does a very good job eliminating duplicate records. Also, the “where” part of each record in the PITF data set includes geocoordinates, and its human coders can accurately resolve the location of nearly every event to at least the local administrative area, a task over which fully automated processes sometimes still stumble.

Of course, press reports only capture a fraction of all the atrocities that occur in most conflicts, and journalists writing about hard-to-cover conflicts often describe these situations with stories that summarize episodes of violence (e.g., “Since January, dozens of villagers have been killed…”). The PITF data set tries to accommodate this pattern by recording two distinct kinds of events: 1) incidents, which occur in a single place in short period of time, usually a single day; and 2) campaigns, which involve the same perpetrator and target group but may occur in multiple places over a longer period of time—usually days but sometimes weeks or months.

The inclusion of these campaigns alongside discrete events allows the data set to capture more information, but it also requires careful attention when using the results. Most statistical applications of data sets like this one involve cross-tabulations of events or deaths at a particular level during some period of time—say, countries and months. That’s relatively easy to do with data on discrete events located in specific places and days. Here, though, researchers have to decide ahead of time if and how they are going to blend information about the two event types. There are two basic options: 1) ignore the campaigns and focus exclusively on the incidents, treating that subset of the data set like a more traditional one and ignoring the additional information; or 2) make a convenient assumption about the distribution of the incidents of which campaigns are implicitly composed and apportion them accordingly.

For example, if we are trying to count monthly deaths from atrocities at the country level, we could assume that deaths from campaigns are distributed evenly over time and assign equal fractions of those deaths to all months over which they extend. So, a campaign in which 30 people were reportedly killed in Somalia between January and March would add 10 deaths to the monthly totals for that country in each of those three months. Alternatively, we could include all of the deaths from a campaign in the month or year in which it began. Either approach takes advantage of the additional information contained in those campaign records, but there is also a risk of double counting, as some of the events recorded as incidents might be part of the violence summarized in the campaign report.

It is also important to note that this data set does not record information about atrocities in which the United States is either the alleged perpetrator or the target (e.g., 9/11) of an atrocity because of legal restrictions on the activities of the CIA, which funds the data set’s production. This constraint presumably has a bigger impact on some cases, such as Iraq and Afghanistan, than others.

To provide a sense of what the data set contains and to make it easier for other researchers to use it, I wrote an R script that ingests and cross-tabulates the latest iteration of the data in country-month and country-year bins and then plots some of the results. That script is now posted on Github (here).

One way to see how well the data set is capturing the trends we hope it will capture is to compare the figures it produces with ones from data sets in which we already have some confidence. While I was writing this post, Colombian “data enthusiast” Miguel Olaya tweeted a pair of graphs summarizing data on massacres in that country’s long-running civil war. The data behind his graphs come from the Rutas de Conflicto project, an intensive and well-reputed effort to document as many as possible of the massacres that have occurred in Colombia since 1980. Here is a screenshot of Olaya’s graph of the annual death counts from massacres in the Rutas data set since 1995, when the PITF data pick up the story:

Annual Deaths from Massacres in Colombia by Perpetrator (Source: Rutas de Conflicta)

Annual Deaths from Massacres in Colombia by Perpetrator (Source: Rutas de Conflicta)

Now here is a graph of deaths from the incidents in the PITF data set:

deaths.yearly.colombia

Just eyeballing the two charts, the correlation looks pretty good. Both show a sharp increase in the tempo of killing in the mid-1990s; a sustained peak around 2000; a steady decline over the next several years; and a relatively low level of lethality since the mid-2000s. The annual counts from the Rutas data are two or three times larger than the ones from the PITF data during the high-intensity years, but that makes sense when we consider how much deeper of a search that project has conducted. There’s also a dip in the PITF totals in 1999 and 2000 that doesn’t appear in the Rutas data, but the comparisons over the larger span hold up. All things considered, this comparison makes the PITF data look quite good, I think.

Of course, the insurgency in Colombia has garnered better coverage from the international press than conflicts in parts of the world that are even harder to reach or less safe for correspondents than the Colombian highlands. On a couple of recent crises in exceptionally under-covered areas, the PITF data also seems to do a decent job capturing surges in violence, but only when we include campaigns as well as incidents in the counting.

The plots below show monthly death totals from a) incidents only and b) incidents and campaigns combined in the Central African Republic since 1995 and South Sudan since its independence in mid-2011. Here, deaths from campaigns have been assigned to the month in which the campaign reportedly began. In CAR, the data set identifies the upward trend in atrocities through 2013 and into 2014, but the real surge in violence that apparently began in late 2013 is only captured when we include campaigns in the cross-tabulation (the dotted line).

deaths.monthly.car

The same holds in South Sudan. There, the incident-level data available so far miss the explosion of civilian killings that began in December 2013 and reportedly continue, but the combination of campaign and incident data appears to capture a larger fraction of it, along with a notable spike in July 2013 related to clashes in Jonglei State.

deaths.monthly.southsudan

These examples suggest that the PITF Worldwide Atrocities Dataset is doing a good job at capturing trends over time in lethal violence against civilians, even in some of the hardest-to-cover cases. To my knowledge, though, this data set has not been widely used by researchers interested in atrocities or political violence more broadly. Probably its most prominent use to date was in the Model component of the Tech Challenge for Atrocities Prevention, a 2013 crowdsourced competition funded by USAID and Humanity United. That challenge produced some promising results, but it remains one of the few applications of this data set on a subject for which reliable data are scarce. Here’s hoping this post helps to rectify that.

Disclosure: I was employed by SAIC as research director of PITF from 2001 until 2011. During that time, I helped to develop the initial version of this data set and was involved in decisions to fund its continued production. Since 2011, however, I have not been involved in either the production of the data or decisions about its continued funding. I am part of a group that is trying to secure funding for a follow-on project to the Model part of the Tech Challenge for Atrocities Prevention, but that effort would not necessarily depend on this data set.

Demography, Democracy, and Complexity

Five years ago, demographer Richard Cincotta claimed in a piece for Foreign Policy that a country’s age structure is a powerful predictor of its prospects for attempting and sustaining liberal democracy. “A country’s chances for meaningful democracy increase,” he wrote, “as its population ages.” Applying that superficially simple hypothesis to the data at hand, he ventured a forecast:

The first (and perhaps most surprising) region that promises a shift to liberal democracy is a cluster along Africa’s Mediterranean coast: Morocco, Algeria, Tunisia, Libya, and Egypt, none of which has experienced democracy in the recent past. The other area is in South America: Ecuador, Colombia, and Venezuela, each of which attained liberal democracy demographically “early” but was unable to sustain it. Interpreting these forecasts conservatively, we can expect there will be one, maybe two, in each group that will become stable democracies by 2020.

I read that article when it was published, and I recall being irritated by it. At the time, I had been studying democratization for more than 15 years and was building statistical models to forecast transitions to and from democracy as part of my paying job. Seen through those goggles, Cincotta’s construct struck me as simplistic to the point of naiveté. Democratization is a hard theoretical problem. States have arrived at and departed from democracy by many different pathways, so how could what amounts to a one-variable model possibly have anything useful to say about it?

Revisiting Cincotta’s work in 2014, I like it a lot more for a couple of reasons. First, I like the work better now because I have come to see it as an elegant representation of a larger idea. As Cincotta argues in that Foreign Policy article and another piece he published around the same time, demographic structure is one component of a much broader and more complex syndrome in which demography is both effect and cause. Changes in fertility rates, and through them age structure, are strongly shaped by other social changes like education and urbanization, which are correlated with, but hardly determined by, increases in national wealth.

Of course, that syndrome is what we conventionally call “development,” and the pattern Cincotta observes has a strong affinity with modernization theory. Cincotta’s innovation was to move the focus away from wealth, which has turned out to be unreliable as a driver and thus as a proxy for development in a larger sense, to demographic structure, which is arguably a more sensitive indicator of it. As I see it now, what we now call development is part of a “state shift” occurring in human society at the global level that drives and is reinforced by long-term trends in democratization and violent conflict. As in any complex system, though, the visible consequences of that state shift aren’t evenly distributed.

In this sense, Cincotta’s argument is similar to one I often find myself making about the value of using infant mortality rates instead of GDP per capita as a powerful summary measure in models of a country’s susceptibility to insurgency and civil war. The idea isn’t that dead children motivate people to attack their governments, although that may be one part of the story. Instead, the idea is that infant mortality usefully summarizes a number of other things that are all related to conflict risk. Among those things are the national wealth we can observe directly (if imperfectly) with GDP, but also the distribution of that wealth and the state’s will and ability to deliver basic social services to its citizens. Seen through this lens, higher-than-average infant mortality helps us identify states suffering from a broader syndrome that renders them especially susceptible to violent conflict.

Second, I have also come to appreciate more what Cincotta was and is doing because I respect his willingness to apply his model to generate and publish probabilistic forecasts in real time. In professional and practical terms, that’s not always easy for scholars to do, but doing it long enough to generate a real track record can yield valuable scientific dividends.

In this case, it doesn’t hurt that the predictions Cincotta made six years ago are looking pretty good right now, especially in contrast to the conventional wisdom of the late 2000s on the prospects for democratization in North Africa. None of the five states he lists there yet qualifies as a liberal democracy on his terms, a “free” designation from Freedom House). Still, it’s only 2014, one of them (Tunisia) has moved considerably in that direction, and two others (Egypt and Libya) have seen seemingly frozen political regimes crumble and substantial attempts at democratization ensue. Meanwhile, the long-dominant paradigm in comparative democratization would have left us watching for splits among ruling elites that really only happened in those places as their regimes collapsed, and many area experts were telling us in 2008 to expect more of the same in North Africa as far as the mind could see. Not bad for a “one-variable model.”

  • Author

  • Follow me on Twitter

  • Follow Dart-Throwing Chimp on WordPress.com
  • Enter your email address to follow this blog and receive notifications of new posts by email.

    Join 13,607 other subscribers
  • Archives

%d bloggers like this: