A Useful Data Set on Political Violence that Almost No One Is Using

For the past 10 years, the CIA has overtly funded the production of a publicly available data set on certain atrocities around the world that now covers the period from January 1995 to early 2014 and is still updated on a regular basis. If you work in a relevant field but didn’t know that, you’re not alone.

The data set in question is the Political Instability Task Force’s Worldwide Atrocities Dataset, which records information from several international press sources about situations in which five or more civilians are deliberately killed in the context of some wider political conflict. Each record includes information about who did what to whom, where, and when, along with a brief text description of the event, a citation for the source article(s), and, where relevant, comments from the coder. The data are updated monthly, although those updates are posted on a four-month lag (e.g., data from January become available in May).

The decision to limit collection to events involving at least five fatalities was a pragmatic one. As the data set’s codebook notes,

We attempted at one point to lower this threshold to one and the data collection demands proved completely overwhelming, as this involved assessing every murder and ambiguous accidental death reported anywhere in the world in the international media. “Five” has no underlying theoretical justification; it merely provides a threshold above which we can confidently code all of the reported events given our available resources.

For the past three years, the data set has also fudged this rule to include targeted killings that appear to have a political motive, even when only a single victim is killed. So, for example, killings of lawyers, teachers, religious leaders, election workers, and medical personnel are nearly always recorded, and these events are distinguished from ones involving five or more victims by a “Yes” in a field identifying “Targeted Assassinations” under a “Related Tactics” header.

The data set is compiled from stories appearing in a handful of international press sources that are accessed through Factiva. It is a computer-assisted process. A Boolean keyword search is used to locate potentially relevant articles, and then human coders read those stories and make data from the ones that turn out actually to be relevant. From the beginning, the PITF data set has pulled from Reuters, Agence France Press, Associated Press, and the New York Times. Early in the process, BBC World Monitor and CNN were added to the roster, and All Africa was also added a few years ago to improve coverage of that region.

The decision to restrict collection to a relatively small number of sources was also a pragmatic one. Unlike GDELT, for example—the routine production of which is fully automated—the Atrocities Data Set is hand-coded by people reading news stories identified through a keyword search. With people doing the coding, the cost of broadening the search to local and web-based sources is prohibitive. The hope is eventually to automate the process, either as a standalone project or as part of a wider automated event data collection effort. As GDELT shows, though, that’s hard to do well, and that day hasn’t arrived yet.

Computer-assisted coding is far more labor intensive than fully automated coding, but it also carries some advantages. Human coders can still discern better than the best automated coding programs when numerous reports are all referring to the same event, so the PITF data set does a very good job eliminating duplicate records. Also, the “where” part of each record in the PITF data set includes geocoordinates, and its human coders can accurately resolve the location of nearly every event to at least the local administrative area, a task over which fully automated processes sometimes still stumble.

Of course, press reports only capture a fraction of all the atrocities that occur in most conflicts, and journalists writing about hard-to-cover conflicts often describe these situations with stories that summarize episodes of violence (e.g., “Since January, dozens of villagers have been killed…”). The PITF data set tries to accommodate this pattern by recording two distinct kinds of events: 1) incidents, which occur in a single place in short period of time, usually a single day; and 2) campaigns, which involve the same perpetrator and target group but may occur in multiple places over a longer period of time—usually days but sometimes weeks or months.

The inclusion of these campaigns alongside discrete events allows the data set to capture more information, but it also requires careful attention when using the results. Most statistical applications of data sets like this one involve cross-tabulations of events or deaths at a particular level during some period of time—say, countries and months. That’s relatively easy to do with data on discrete events located in specific places and days. Here, though, researchers have to decide ahead of time if and how they are going to blend information about the two event types. There are two basic options: 1) ignore the campaigns and focus exclusively on the incidents, treating that subset of the data set like a more traditional one and ignoring the additional information; or 2) make a convenient assumption about the distribution of the incidents of which campaigns are implicitly composed and apportion them accordingly.

For example, if we are trying to count monthly deaths from atrocities at the country level, we could assume that deaths from campaigns are distributed evenly over time and assign equal fractions of those deaths to all months over which they extend. So, a campaign in which 30 people were reportedly killed in Somalia between January and March would add 10 deaths to the monthly totals for that country in each of those three months. Alternatively, we could include all of the deaths from a campaign in the month or year in which it began. Either approach takes advantage of the additional information contained in those campaign records, but there is also a risk of double counting, as some of the events recorded as incidents might be part of the violence summarized in the campaign report.

It is also important to note that this data set does not record information about atrocities in which the United States is either the alleged perpetrator or the target (e.g., 9/11) of an atrocity because of legal restrictions on the activities of the CIA, which funds the data set’s production. This constraint presumably has a bigger impact on some cases, such as Iraq and Afghanistan, than others.

To provide a sense of what the data set contains and to make it easier for other researchers to use it, I wrote an R script that ingests and cross-tabulates the latest iteration of the data in country-month and country-year bins and then plots some of the results. That script is now posted on Github (here).

One way to see how well the data set is capturing the trends we hope it will capture is to compare the figures it produces with ones from data sets in which we already have some confidence. While I was writing this post, Colombian “data enthusiast” Miguel Olaya tweeted a pair of graphs summarizing data on massacres in that country’s long-running civil war. The data behind his graphs come from the Rutas de Conflicto project, an intensive and well-reputed effort to document as many as possible of the massacres that have occurred in Colombia since 1980. Here is a screenshot of Olaya’s graph of the annual death counts from massacres in the Rutas data set since 1995, when the PITF data pick up the story:

Annual Deaths from Massacres in Colombia by Perpetrator (Source: Rutas de Conflicta)

Annual Deaths from Massacres in Colombia by Perpetrator (Source: Rutas de Conflicta)

Now here is a graph of deaths from the incidents in the PITF data set:

deaths.yearly.colombia

Just eyeballing the two charts, the correlation looks pretty good. Both show a sharp increase in the tempo of killing in the mid-1990s; a sustained peak around 2000; a steady decline over the next several years; and a relatively low level of lethality since the mid-2000s. The annual counts from the Rutas data are two or three times larger than the ones from the PITF data during the high-intensity years, but that makes sense when we consider how much deeper of a search that project has conducted. There’s also a dip in the PITF totals in 1999 and 2000 that doesn’t appear in the Rutas data, but the comparisons over the larger span hold up. All things considered, this comparison makes the PITF data look quite good, I think.

Of course, the insurgency in Colombia has garnered better coverage from the international press than conflicts in parts of the world that are even harder to reach or less safe for correspondents than the Colombian highlands. On a couple of recent crises in exceptionally under-covered areas, the PITF data also seems to do a decent job capturing surges in violence, but only when we include campaigns as well as incidents in the counting.

The plots below show monthly death totals from a) incidents only and b) incidents and campaigns combined in the Central African Republic since 1995 and South Sudan since its independence in mid-2011. Here, deaths from campaigns have been assigned to the month in which the campaign reportedly began. In CAR, the data set identifies the upward trend in atrocities through 2013 and into 2014, but the real surge in violence that apparently began in late 2013 is only captured when we include campaigns in the cross-tabulation (the dotted line).

deaths.monthly.car

The same holds in South Sudan. There, the incident-level data available so far miss the explosion of civilian killings that began in December 2013 and reportedly continue, but the combination of campaign and incident data appears to capture a larger fraction of it, along with a notable spike in July 2013 related to clashes in Jonglei State.

deaths.monthly.southsudan

These examples suggest that the PITF Worldwide Atrocities Dataset is doing a good job at capturing trends over time in lethal violence against civilians, even in some of the hardest-to-cover cases. To my knowledge, though, this data set has not been widely used by researchers interested in atrocities or political violence more broadly. Probably its most prominent use to date was in the Model component of the Tech Challenge for Atrocities Prevention, a 2013 crowdsourced competition funded by USAID and Humanity United. That challenge produced some promising results, but it remains one of the few applications of this data set on a subject for which reliable data are scarce. Here’s hoping this post helps to rectify that.

Disclosure: I was employed by SAIC as research director of PITF from 2001 until 2011. During that time, I helped to develop the initial version of this data set and was involved in decisions to fund its continued production. Since 2011, however, I have not been involved in either the production of the data or decisions about its continued funding. I am part of a group that is trying to secure funding for a follow-on project to the Model part of the Tech Challenge for Atrocities Prevention, but that effort would not necessarily depend on this data set.

Is the World Boiling Over or Just Getting Back to Normal?

Here’s a plot of observed and “predicted” rates of political instability onset around the world from 1956 to 2012, the most recent year for which I now have data. The dots are the annual rates, and the lines are smoothing curves fitted from those annual rates using local regression (or loess).

  • The observed rates come from the U.S. government-funded Political Instability Task Force (PITF), which identifies political instability through the occurrence of civil war, state collapse, contested state break-up, abrupt declines in democracy, or genocide or politicide. The observed rate is just the number of onsets that occurred that year divided by the number of countries in the world at the time.
  • The “predicted” probabilities come from an approximation of a model the PITF developed to assess risks of instability onset in countries worldwide. That model includes measures of infant mortality, political regime type, state-led communal discrimination, armed conflict in nearby states, and geographic region. (See this 2010 journal article on which I was a co-author for more info.) In the plot, the “predicted” rate (green) is the sum of the predicted probabilities for the year divided by the number of countries with predicted probabilities that year. I put predicted in quotes because these are in-sample estimates and not actual forecasts.
Observed and Predicted Rates of Political Instability Onset Worldwide, 1956-2012

Observed and Predicted Rates of Political Instability Onset Worldwide, 1956-2012

I see a couple of interesting things in that plot.

First, these data suggest that the anomaly we need to work harder to explain isn’t the present but the recent past. As the right-most third of the plot shows, the observed incidence of political instability was unusually low in the 1990s and 2000s. For the previous several decades, the average annual rate of instability onset was about 4 percent. Apart from some big spikes around decolonization and the end of the Cold War, the trend over time was pretty flat. Then came the past 20 years, when the annual rate has hovered around 2 percent, and the peaks have barely reached the Cold War–era average. In the context of the past half-century, then, any upticks we’ve seen in the past few years don’t seem so unusual. To answer the question in this post’s title, it looks like the world isn’t boiling over after all. Instead, it looks more like we’re returning to a state of affairs that was, until recently, normal.

Second, the differences between the observed and “predicted” rates suggest that the recent window of comparative stability can’t be explained by generic trends in the structural factors that best predict instability. If anything, the opposite is true. According to our structural model of instability risk, we should have seen an increase in the rate of these crises in the past 20 years, as more countries moved from dictatorial regimes to various transitional and hybrid forms of government. Instead, we saw the opposite. He or she who can explain why that’s so with a theory that accurately predicts where this trend is now headed deserves a…well, whatever prize political scientists would get if we had our own Fields Medal.

For the latest data on the political instability events PITF tracks, see the Center for Systemic Peace’s data page. For the data and code used to approximate the PITF’s global instability model, see this GitHub repository of mine.

A Coda to “Using GDELT to Monitor Atrocities, Take 2″

I love doing research in the Internet Age. As I’d hoped it would, my post yesterday on the latest iteration of our atrocities-monitoring system in the works has already sparked a lot of really helpful responses. Some of those responses are captured in comments on the post, but not all of them are. So, partly as a public good and partly for my own record-keeping, I thought I’d write a coda to that post enumerating the leads it generated and some of my reactions to them.

Give the Machines Another Shot at It

As a way to reduce or even eliminate the burden placed on our human(s) in the loop, several people suggested something we’ve been considering for a while: use machine-learning techniques to develop classifiers that can be used to further reduce the data left after our first round of filtering. These classifiers could consider all of the features in GDELT, not just the event and actor types we’re using in our R script now. If we’re feeling really ambitious, we could go all the way back to the source stories and use natural-language processing to look for additional discriminatory power there. This second round might not eliminate the need for human review, but it certainly could lighten the load.

The comment threads on this topic (here and here) nicely capture what I see as the promise and likely limitations of this strategy, so I won’t belabor it here. For now, I’ll just note that how well this would work is an empirical question, and it’s one we hope to get a chance to answer once we’ve accumulated enough screened data to give those classifiers a fighting chance.

Leverage GDELT’s Global Knowledge Graph

Related to the first idea, GDELT co-creator Kalev Leetaru has suggested on a couple of occasions that we think about ways to bring the recently-created GDELT Global Knowledge Graph (GKG) to bear on our filtering task. As Kalev describes in a post on the GDELT blog, GKG consists of two data streams, one that records mentions of various counts and another that captures connections  in each day’s news between “persons, organizations, locations, emotions, themes, counts, events, and sources.” That second stream in particular includes a bunch of data points that we can connect to specific event records and thus use as additional features in the kind of classifiers described under the previous header. In response to my post, Kalev sent this email to me and a few colleagues:

I ran some very very quick numbers on the human coding results Jay sent me where a human coded 922 articles covering 9 days of GDELT events and coded 26 of them as atrocities. Of course, 26 records isn’t enough to get any kind of statistical latch onto to build a training model, but the spectral response of the various GKG themes is quite informative. For events tagged as being an atrocity, themes such as ETHNICITY, RELIGION, HUMAN_RIGHTS, and a variety of functional actors like Villagers, Doctors, Prophets, Activists, show up in the top themes, whereas in the non-atrocities the roles are primarily political leaders, military personnel, authorities, etc. As just a simple example, the HUMAN_RIGHTS theme appeared in just 6% of non-atrocities, but 30% of atrocities, while Activists show up in 33% of atrocities compared with just 4% of non-atrocities, and the list goes on.

Again, 26 articles isn’t enough to build a model on, but just glancing over the breakdown of the GKG themes for the two there is a really strong and clear breakage between the two across the entire set of themes, and the breakdown fits precisely what baysean classifiers like (they are the most accurate for this kind of separation task and outperform SVM and random forest).

So, Jay, the bottom line is that if you can start recording each day the list of articles that you guys review and the ones you flag as an atrocity and give me a nice dataset over time, should be pretty easy to dramatically filter these down for you at the very least.

As I’ve said throughout this process, its not that event data can’t do what is needed, its that often you have to bring additional signals into the mix to accomplish your goals when the thing you’re after requires signals beyond what the event records are capturing.

What Kalev suggests at the end there—keep a record of all the events we review and the decisions we make on them—is what we’re doing now, and I hope we can expand on his experiment in the next several months.

Crowdsource It

Jim Walsh left a thoughtful comment suggesting that we crowdsource the human coding:

Seems to me like a lot of people might be willing to volunteer their time for this important issue–human rights activists and NGO types, area experts, professors and their students (who might even get some credit and learn about coding). If you had a large enough cadre of volunteers, could assign many (10 or more?) to each day’s data and generate some sort of average or modal response. Would need someone to organize the volunteers, and I’m not sure how this would be implemented online, but might be do-able.

As I said in my reply to him, this is an approach we’ve considered but rejected for now. We’re eager to take advantage of the wisdom of interested crowds and are already doing so in big ways on other parts of our early-warning system, but I have two major concerns about how well it would work for this particular task.

The first is the recruiting problem, and here I see a Catch-22: people are less inclined to do this if they don’t believe the system works, but it’s hard to convince them that the system works if we don’t already have a crowd involved to make it go. This recruiting problem becomes especially acute in a system with time-sensitive deliverables. If we promise daily updates, we need to produce daily updates, and it’s hard to do that reliably if we depend on self-organized labor.

My second concern is the principal-agent problem. Our goal is to make reliable and valid data in a timely way, but there are surely people out there who would bring goals to the process that might not align with ours. Imagine, for example, that Absurdistan appears in the filtered-but-not-yet-coded data to be committing atrocities, but citizens (or even paid agents) of Absurdistan don’t like that idea and so organize to vote those events out of the data set. It’s possible that our project would be too far under the radar for anyone to bother, but our ambitions are larger than that, so we don’t want to assume that will be true. If we succeed at attracting the kind of attention we hope to attract, the deeply political and often controversial nature of our subject matter would make crowdsourcing this task more vulnerable to this kind of failure.

Use Mechanical Turk

Both of the concerns I have about the downsides of crowdsourcing the human-coding stage could be addressed by Ryan Briggs’ suggestion via Twitter to have Amazon Mechanical Turk do it. A hired crowd is there when you need it and (usually) doesn’t bring political agendas to the task. It’s also relatively cheap, and you only pay for work performed.

Thanks to our collaboration with Dartmouth’s Dickey Center, the marginal cost of the human coding isn’t huge, so it’s not clear that Mechanical Turk would offer much advantage on that front. Where it could really help is in routinizing the daily updates. As I mentioned in the initial post, when you depend on human action and have just one or a few people involved, it’s hard to establish a set of routines that covers weekends and college breaks and sick days and is robust to periodic changes in personnel. Primarily for this reason, I hope we’ll be able to run an experiment with Mechanical Turk where we can compare its cost and output to what we’re paying and getting now and see if this strategy might make sense for us.

Don’t Forget About Errors of Omission

Last but not least, a longtime colleague had this to say in an email reacting to the post (hyperlinks added):

You are effectively describing a method for reducing errors of commission, events coded by GDELT as atrocities that, upon closer inspection, should not be. It seems like you also need to examine errors of omission. This is obviously harder. Two possible opportunities would be to compare to either [the PITF Worldwide Atrocities Event Data Set] or to ACLED.  There are two questions. Is GDELT “seeing” the same source info (and my guess is that it is and more, though ACLED covers more than just English sources and I’m not sure where GDELT stands on other languages). Then if so (and there are errors of omission) why aren’t they showing up (coded as different types of events or failed to trigger any coding at all)[?]

It’s true that our efforts so far have focused almost exclusively on avoiding errors of commission, with the important caveat that it’s really our automated filtering process, not GDELT, that commits most of these errors. The basic problem for us is that GDELT, or really the CAMEO scheme on which it’s based, wasn’t designed to spot atrocities per se. As a result, most of what we filter out in our human-coding second stage aren’t things that were miscoded by GDELT. Instead, they’re things that were properly coded by GDELT as various forms of violent action but upon closer inspection don’t appear to involve the additional features of atrocities as we define them.

Of course, that still leaves us with this colleague’s central concern about errors of omission, and on that he’s absolutely right. I have experimented with different actor and event-type criteria to make sure we’re not missing a lot of events of interest in GDELT, but I haven’t yet compared what we’re finding in GDELT to what related databases that use different sources are seeing. Once we accumulate a few month’s worth of data, I think this is something we’re really going to need to do.

Stay tuned for Take 3…

Common Predictive Analytics Screw-Ups, Social Science Edition

Computerworld ran a great piece last week called “12 Predictive Analytics Screw-Ups” that catalogs some mistakes commonly made in statistical forecasting projects. Unfortunately, the language and examples are all from “industry”—what we used to call the private sector, I guess—so social scientists might read it and struggle to see the relevance to their work. To make that relevance clearer, I thought I’d give social-science-specific examples of the blunders that looked most familiar to me.

1. Begin without the end in mind.

I read this one as an admonition to avoid forecasting something just because you can, even if it’s not clear what those forecasts are useful for. More generally, though, I think it can also be read as a warning against fishing expeditions. If you poke around enough in a large data set on interstate wars, you’re probably going to find some variables that really boost your R-squared or Area under the ROC Curve,  but the models you get from that kind of dredging will often perform a lot worse when you try to use them to forecast in real time.

2. Define the project around a foundation your data can’t support.

It’s really important to think early and often about how your forecasting process will work in real time. The data you have in hand when generating a forecast will often be incomplete and noisier than the nice, tidy matrix you had when estimated the model(s) you’re applying, and it’s a good idea to plan around that fact when you’re doing the estimating.

Here’s the sort of thing I have in mind: Let’s say you discover that a country’s score on CIRI‘s physical integrity index at the end of one year is a useful predictor of its risk of civil war onset during the next year. Awesome…until you remember that CIRI isn’t updated until late in the calendar year. Now what? You can lag it further, but that’s liable to weaken the predictive signal if the variable is dynamic and recent changes are informative. Alternatively, you can keep the one-year lag and try to update by hand, but then you risk adding even more noise to an already-noisy set of inputs. There’s a reason you use the data made by people who’ve spent a lot of time working on the topic and the coding procedures.

Unfortunately, there’s no elegant escape from this dilemma. The only general rules I can see are 1) to try to anticipate these problems and address them in the design phase and, where possible, 2) to check the impact of these choices on the accuracy your forecasts and revise when the evidence suggests something else would work better.

3. Don’t proceed until your data is the best it can be.

4. When reviewing data quality, don’t bother to take out the garbage.

6. Don’t just proceed but rush the process because you know your data is perfect.

These three screw-ups underscore the tension between the need to avoid garbage in, garbage out (GIGO) modeling on the one hand and the need to be opportunistic on the other.

On many topics of interest to social scientists, we either have no data or the data we do have are sparse or lousy or both (see here for more on this point). Under these circumstances, you need to find ways to make the most of the information you’ve got, but you don’t want to pretend that you can spin straw into gold.

Again, there’s no elegant escape from these trade-offs. That said, I think it’s generally true that there’s almost always a significant payoff to be had from to getting familiar with the data you’re using, and from doing what you can to make those data cleaner or more complete without setting yourself up for failure at the forecasting stage (e.g., your multiple-imputation algorithm might expand your historical sample, but it won’t give you the latest values of the predictors it was used to infill.)

5. Use data from the future to predict the future.

So you’ve estimated a logistic regression model and, using cross-validation, discovered that it works really well out of sample. Then you look at the estimated coefficients and discover that one variable really seems to be driving that result. Then you look closer and discover that this variable is actually a consequence of the dependent variable. Doh!

I had this happen once when I was trying to develop a model that could be used to forecast transitions to and from democracy (see here for the results). At the exploratory stage, I found that a variable which counts the number of democracy episodes a country has experienced was a really powerful predictor of transitions to democracy. Then I remembered that this counter—which I’d coded—ticks up in the year that a transition occurs, so of course higher values were associated with a higher probability of transition. In this case, the problem was easily solved by lagging the predictor, but the problem and its solution won’t always be that obvious. Again, knowing your data should go a long way toward protecting you against this error.

8. Ignore the subject-matter experts when building your model.

For me, a forecasting tournament we conducted as part of the work of the Political Instability Task Force (PITF) really drove this point home (see here). We got better results when we restricted our vision to a smaller set of variables selected by researchers who’d been immersed in the material for a long time than we did when we applied those same methods to a much larger set of variables that we happened to have available to us.

This is probably less likely to be a problem for academics, who are more likely to try to forecast on topics they know and care about, than it is for “data scientists” and “hackers” who are often asked to throw the methods they know at problems on all sorts of unfamiliar topics. Still, even when you’re covering territory that seems familiar, it’s always a good idea to brush up on the relevant literature and ask around as you get started. A single variable often makes a significant difference in the predictive power of a forecasting algorithm.

9. Just assume that the keepers of the data will be fully on board and cooperative.

Data sets that are licensed and are therefore either too expensive to keep buying or impossible to include in replication files. Boutique data sets that cover really important topics but were created once but aren’t updated. Data that are embargoed while someone waits to publish from them. Data sets whose creators start acting differently when they hear that their data are useful for forecasting.

These are all problems I’ve run into, and they can effectively kill an applied forecasting project. Better to clear them up early or set aside the relevant data sets than to paint yourself into this kind of corner, which can be very frustrating.

10. If you build it, they will come; don’t worry about how to serve it up.

I still don’t feel like I have a great handle on how to convey probabilistic forecasts to non-statistical audiences, or which parts of those forecasts are most useful to which kinds of audiences. This is a really hard problem that has a huge effect on the impact of the work, and in my experience, having forecasts that are demonstrably accurate doesn’t automatically knock these barriers down (just ask Nate Silver).

The two larger lessons I take from my struggles with this problem are 1) to incorporate thinking about how the forecasts will be conveyed into the research design and 2) to consider presenting the forecasts in different ways to different audiences.

Regarding the first, the idea is to avoid methods that you’re intended audience won’t understand or at least tolerate. For example, if your audience is going to want information about the relative influence of various predictors on the forecasts in specific cases, you’re going to want to avoid “black box” algorithms that make it hard or impossible to recover that information.

Regarding the second, my point is not to assume that you know the single “right” way to communicate your forecasts. In fact, I think it’s a good idea to be experimental if you can. Try presenting the forecasts a few different ways—maps or dot plots, point estimates or distributions, cross-sectional comparisons or time series—see which ones resonate with which audiences, and  tailor your publications and presentations accordingly.

11. If the results look obvious, throw out the model.

Even if it’s not generating a lot of counter-intuitive results, a reasonably accurate forecasting model can still be really valuable in a couple of ways. First, it’s a great baseline for further research. Second, when a model like that occasionally does serve up a counter-intuitive result, that forecast will often reward a closer review. Closer review of the cases that do land far from the regression line may hold some great leads on variables your initial model overlooked.

This often comes up in my work on forecasting rare events like coups and mass killings. Most years, most of the countries that my forecasts identify as riskiest are pretty obvious. It doesn’t take a model to tell me that Sweden probably won’t have a coup this year but Mali or Sudan might, so people often respond to the forecasts by saying, “I already knew all that.” When they slow down and give the forecasts a closer look, though, they’ll usually find at least a few cases on either tail of the distribution that don’t match their priors. To my mind, these handfuls of surprises are really the point of the exercise. The conversations that start in response to these counter-intuitive results are the reason we use statistical models instead of just asking people what they think.

If I had to sum up all of these lessons into a single suggestion, it would be to learn by doing. Applied forecasting is a very different problem from hypothesis testing and even from data mining. You have to live the process a few times to really appreciate its difficulties, and those difficulties can vary widely across different forecasting problems. Ideally, you’ll pick a problem, work it, and generate forecasts in real time for a while so you get feedback not just on your accuracy, but also on how sustainable your process is. To avoid hindsight bias, make the forecasts public as you produce them, or at least share them with some colleagues as you go.

Lost in the Fog of Civil War in Syria

On Twitter a couple of days ago, Adam Elkus called out a recent post on Time magazine’s World blog as evidence of the way that many peoples’ expectations about the course of Syria’s civil war have zigged and zagged over the past couple of years. “Last year press was convinced Assad was going to fall,” Adam tweeted. “Now it’s that he’s going to win. Neither perspective useful.” To which the eminent civil-war scholar Stathis Kalyvas replied simply, “Agreed.”

There’s a lesson here for anyone trying to glean hints about the course of a civil war from press accounts of a war’s twists and turns. In this case, it’s a lesson I’m learning through negative feedback.

Since early 2012, I’ve been a participant/subject in the Good Judgment Project (GJP), a U.S. government-funded experiment in “wisdom of crowds” forecasting. Over the past year, GJP participants have been asked to estimate the probability of several events related to the conflict in Syria, including the likelihood that Bashar al-Assad would leave office and the likelihood that opposition forces would seize control of the city of Aleppo.

I wouldn’t describe myself as an expert on civil wars, but during my decade of work for the Political Instability Task Force, I spent a lot of time looking at data on the onset, duration, and end of civil wars around the world. From that work, I have a pretty good sense of the typical dynamics of these conflicts. Most of the civil wars that have occurred in the past half-century have lasted for many years. A very small fraction of those wars flared up and then ended within a year. The ones that didn’t end quickly—in other words, the vast majority of these conflicts—almost always dragged on for several more years at least, sometimes even for decades. (I don’t have my own version handy, but see Figure 1 in this paper by Paul Collier and Anke Hoeffler for a graphical representation of this pattern.)

On the whole, I’ve done well in the Good Judgment Project. In the year-long season that ended last month, I ranked fifth among the 303 forecasters in my experimental group, all while the project was producing fairly accurate forecasts on many topics. One thing that’s helped me do well is my adherence to what you might call the forecaster’s version of the Golden Rule: “Don’t neglect the base rate.” And, as I just noted, I’m also quite familiar with the base rates of civil-war duration.

So what did I do when asked by GJP to think about what would happen in Syria? I chucked all that background knowledge out the window and chased the very narrative that Elkus and Kalyvas rightly decry as misleading.

Here’s a chart showing how I assessed the probability that Assad wouldn’t last as president beyond the end of March 2013, starting in June 2012. The actual question asked us to divide the probability of his exiting office across several time periods, but for simplicity’s sake I’ve focused here on the part indicating that he would stick around past April 1. This isn’t the same thing as the probability that the war would end, of course, but it’s closely related, and I considered the two events as tightly linked. As you can see, until early 2013, I was pretty confident that Assad’s fall was imminent. In fact, I was so confident that at a couple of points in 2012, I gave him zero chance of hanging on past March of this year—something a trained forecaster really never should do.

gjp assad chart

Now here’s another chart showing my estimates of the likelihood that rebels would seize control of Aleppo before May 1, 2013. The numbers are a little different, but the basic pattern is the same. I started out very confident that the rebels would win the war soon and only swung hard in the opposite direction in early 2013, as the boundaries of the conflict seemed to harden.

gjp aleppo chart

It’s impossible to say what the true probabilities were in this or any other uncertain situation. Maybe Assad and Aleppo really were on the brink of falling for a while and then the unlikely-but-still-possible version happened anyway.

That said, there’s no question that forecasts more tightly tied to the base rate would have scored a lot better in this case. Here’s a chart showing what my estimates might have looked like had I followed that rule, using approximations of the hazard rate from the chart in the Collier and Hoeffler paper. If anything, these numbers overstate the likelihood that a civil war will end at a given point in time.

gjp baserate chart

I didn’t keep a log spelling out my reasoning at each step, but I’m pretty confident that my poor performance here is an example of motivated reasoning. I wanted Assad to fall and the pro-democracy protesters who dominated the early stages of the uprising to win, and that desire shaped what I read and then remembered when it came time to forecast. I suspect that many of the pieces I was reading were slanted by similar hopes, creating a sort of analytic cascade similar to the herd behavior thought to drive many financial-market booms and busts. I don’t have the data to prove it, but I’m pretty sure the ups and downs in my forecasts track the evolving narrative in the many newspaper and magazine stories I was reading about the Syrian conflict.

Of course, that kind of herding happens on a lot of topics, and I was usually good at avoiding it. For example, when tensions ratcheted up on the Korean Peninsula earlier this year, I hewed to the base rate and didn’t substantially change my assessment of the risk that real clashes would follow.

What got me in the case of Syria was, I think, a sense of guilt. The Assad government has responded to a legitimate popular challenge with mass atrocities that we routinely read about and sometimes even see. In parts of the country, the resulting conflict is producing scenes of absurd brutality. This isn’t a “problem from hell,” as Samantha Powers’ book title would have it; it is a glimpse of hell. And yet, in the face of that horror, I have publicly advocated against American military intervention. Upon reflection, I wonder if my wildly optimistic forecasting about the imminence of Assad’s fall wasn’t my unconscious attempt to escape the discomfort of feeling complicit in the prolongation of that suffering.

As a forecaster, if I were doing these questions over, I would try to discipline myself to attend to the base rate, but I wouldn’t necessarily stop there. As I’ve pointed out in a previous post, the base rate is a valuable anchoring device, but attending to it doesn’t mean automatically ignoring everything else. My preferred approach, when I remember to have one, is to take that base rate as a starting point and then use Bayes’ theorem to update my forecasts in a more disciplined way. Still, I’ll bring a newly skeptical eye the flurry of stories predicting that Assad’s forces will soon defeat Syria’s rebels and keep their patron in power. Now that we’re a couple years into the conflict, quantified history tells us that the most likely outcome in any modest slice of time (say, months rather than years) is, tragically, more of the same.

And, as a human, I’ll keep hoping the world will surprise us and take a different turn.

Do Elections Trigger Mass Atrocities?

Kenya plans to hold general elections in early March this year, and many observers fear those contests will spur a reprisal of the mass violence that swept parts of that country after balloting in December 2007.  The Sentinel Project for Genocide Prevention says Kenya is at “high risk” of genocide in 2013, and a recent contingency-planning memo from Joel Barkan the Council on Foreign Relations asserts that “there will almost certainly be further incidents of violence in the run-up to the 2013 elections.” As a recent Africa Initiative backgrounder points out, this violence has roots that stretch much deeper than the 2007 elections, but the fear that mass violence will flare again around this year’s balloting seems well founded.

All of which got me wondering: is this a generic problem? We know that election-related violence is a real and multifaceted thing. We also have works by Jack Snyder and Amy Chua, among others, arguing that democratization actually makes some countries more susceptible to ethnic and nationalist conflict rather than less, as democracy promoters often claim. What I’m wondering, though—as someone who has long studied democratization and is currently working on tools to forecast genocide and other forms of mass killing—is whether or not elections substantially increase the risk of mass atrocities in particular, where “mass atrocities” means the deliberate killing of large numbers of unarmed civilians for apparently political ends.

Best I can tell, the short answer is no. After applying a few different statistical-modeling strategies to a few measures of atrocities, I see little evidence that elections commonly trigger the onset or intensification of this type of political violence. The absence of evidence isn’t the same thing as evidence of absence, but these results convince me that national elections aren’t a major risk factor for mass killing.

If you’re interested in the technical details, here’s what I did and what I found:

My first cut at the problem looked for a connection between national elections and the onset of state-sponsored mass killings, defined as “a period of sustained violence” in which ” the actions of state agents result in the intentional death of at least 1,000 noncombatants from a discrete group.” That latter definition comes from work Ben Valentino and I did for my old research program, the Political Instability Task Force, and it restricts the analysis to episodes of large-scale killing by states or other groups acting at their behest. Defined as such, mass killings are akin to genocide in their scale, and there have only been about 110 of them since 1945.

So, do national elections help trigger this type of mass killing? To try to answer this question, I thought of elections as a kind of experimental “treatment” that some country-years get and others don’t. I used the National Elections Across Democracy and Autocracy (NELDA) data set to identify country-years since 1945 with national elections for chief executive or legislature or both, regardless of how competitive those elections were. I then used the MatchIt package in R to set up a comparison of country-years with and without elections within 107 groups that matched exactly on several other variables identified by prior research as risk factors for mass-killing onset: autocracy vs. democracy, exclusionary elite ideology (yes/no), salient elite ethnicity (yes/no), ongoing armed conflict (yes/no), any mass killing since 1945 (yes/no), and Cold War vs. post-Cold War period. Finally, I used conditional logistic regression to estimate the difference in risk between election and non-election years within those groups.

The results? In my data, mass-killing episodes were 80% as likely to begin in election years as non-election years, other things being equal. The 95% confidence interval for this association was wide (45% to 145%), but the result suggests that, if anything, countries are actually somewhat less prone to suffer onsets of mass killing in election years as non-election years.

I wondered if the risk might differ by regime type, so I reran the analysis on the subset of cases that were plausibly democratic. The estimate was effectively unchanged (80%, CI of 35% to 185%). Then I thought it might be a post-Cold War thing and reran the analysis using only country-years from 1991 forward. The estimate moved, but in the opposite of the anticipated direction. Now it was down to 60%, with a CI of 17% to 215%.

These estimates got me worried that something had gone wacky in my data, so I reran the matching and conditional logistic regression using coup attempts (successful or failed) instead of elections as the “treatment” of interest. Several theorists have identified threats to incumbents’ power as a cause of mass atrocities, and coups are a visible and discrete manifestation of such threats. My analysis strongly confirmed this view, indicating that mass-killing episodes were nearly five times as likely to start in years with coup attempts as years without, other things being equal. More important for present purposes, this result increased my confidence in the reliability of my earlier finding on elections, as did the similar estimates I got from models with country fixed effects, country-specific intercepts (a.k.a. random effects), and interaction terms that allowed the effects of elections to vary across regime types and historical eras.

Then I wondered if this negative finding wasn’t an artifact of the measure I was using for mass atrocities. The 1,000-death threshold for “mass killing” is quite high, and the restriction to killings by states or their agents ignores situations of grave concern in which rebel groups or other non-state actors are the ones doing the murdering. Maybe the danger of election years would be clearer if I looked at atrocities on a smaller scale and ones perpetrated by non-state actors.

To do this, I took the UCDP One-Sided Violence Dataset v1.4 and wrote an R script that aggregated its values for specific conflicts into annual death counts by country and perpetrator (government or non-government). Then I used R’s ‘pscl’ package to estimate zero-inflated negative binomial regression (ZINB) models that treat the death counts as the observable results of a two-stage process: one that determines whether or not a country has any one-sided killing in a particular year, and then another that determines how many deaths occur, conditional on there being any. In addition to my indicator for election years, these models included all the risk factors used in the earlier matching exercise, plus population size and the logged counts of deaths from one-sided violence by government and non-government actors (separately) in the previous year. All of these variables were included in the logistic regression “hurdle” model; only elections, population size, and the lagged death counts were included in the conditional count models.

To my surprise once again, the results suggested that, if anything, atrocities the risk of mass atrocities is actually lower in years with national elections. In the model of government-perpetrated violence, the coefficient for the election indicator in the hurdle model was barely distinguishable from zero (0.04), and the association in the count portion was modestly negative (-0.20, s.e. of 0.20). In the model of violence perpetrated by other groups, the effect in the hurdle portion was modestly negative (-0.25, s.e. of 0.20), and the effect in the count portion was decidedly negative (-0.82, s.e. of 0.19). When I reran the models with separate indicators for executive and legislative elections, the results bounced around a little bit, but the basic patterns remained unchanged. None of the models showed a substantial, positive association between either type of election and the occurrence or scale of one-sided violence against civilians.

In light of the weakness of the observed effects, the noisiness of the measures employed, and my prior beliefs about the effects of elections on risks of mass killing—shaped in part by the Kenyan case I discussed at the start of this post—I’m not quite ready to assert that election years actually reduce the risk of mass atrocities. What I am more comfortable doing, however, is ignoring elections in statistical models meant to forecast mass atrocities across large numbers of countries.

If you’re interested in replicating or tweaking this analysis, please email me at ulfelder@gmail.com, and I’ll be happy to send you the data and R scripts (one to get country-year summaries of the UCDP data, another to run the matching and modeling) I used to do it. [UPDATE: I've put the scripts and data in a publicly accessible folder on Google Drive. If you try that link and it doesn't work, please let me know.] Ideally, I would cut out the middleman by putting them in a Github repository, but I haven’t quite figured out how to do that yet. If you’re in the DC area and interested in getting paid to walk me me through that process, please let me know.

Forecasting Round-Up No. 2

N.B. This is the second in an occasional series of posts I’m expecting to do on forecasting miscellany. You can find the first one here.

1. Over at Bad Hessian a few days ago, Trey Causey asked, “Where are the predictions in sociology?” After observing how the accuracy of some well-publicized forecasts of this year’s U.S. elections has produced “growing public recognition that quantitative forecasting models can produce valid results,” Trey wonders:

If the success of these models in forecasting the election results is seen as a victory for social science, why don’t sociologists emphasize the value of prediction and forecasting more? As far as I can tell, political scientists are outpacing sociologists in this area.

I gather that Trey intended his post to stimulate discussion among sociologists about the value of forecasting as an element of theory-building, and I’m all for that. As a political scientist, though, I found myself focusing on the comparison Trey drew between the two disciplines, and that got me thinking again about the state of forecasting in political science. On that topic, I had two brief thoughts.

First, my simple answer to why forecasting is getting more attention from political scientists that it used to is: money! In the past 20 years, arms of the U.S. government dealing with defense and intelligence seem to have taken a keener interest in using tools of social science to try to anticipate various calamities around the world. The research program I used to help manage, the Political Instability Task Force (PITF), got its start in the mid-1990s for that reason, and it’s still alive and kicking. PITF draws from several disciplines, but there’s no question that it’s dominated by political scientists, in large part because the events it tries to forecast—civil wars, mass killings, state collapses, and such—are traditionally the purview of political science.

I don’t have hard data to back this up, but I get the sense that the number and size of government contracts funding similar work has grown substantially since the mid-1990s, especially in the past several years. Things like the Department of Defense’s Minerva Initiative; IARPA’s ACE Program; the ICEWS program that started under DARPA and is now funded by the Office of Naval Research; and Homeland Security’s START consortium come to mind. Like PITF, all of these programs are interdisciplinary by design, but many of the topics they cover have their theoretical centers of gravity in political science.

In other words, through programs like these, the U.S. government is now spending millions of dollars each year to generate forecasts of things political scientists like to think about. Some of that money goes to private-sector contractors, but some of it is also flowing to research centers at universities. I don’t think any political scientists are getting rich off these contracts, but I gather there are bureaucratic and career incentives (as well as intellectual ones) that make the contracts rewarding to pursue. If that’s right, it’s not hard to understand why we’d be seeing more forecasting come out of political science than we used to.

My second reaction to Trey’s question is to point out that there actually isn’t a whole lot of forecasting happening in political science, either. That might seem like it contradicts the first, but it really doesn’t. The fact is that forecasting has long been pooh-poohed in academic social sciences, and even if that’s changing at the margins in some corners of the discipline, it’s still a peripheral endeavor.

The best evidence I have for this assertion is the brief history of the American Political Science Association’s Political Forecasting Group. To my knowledge—which comes from my participation in the group since its establishment—the Political Forecasting Group was only formed several years ago, and its membership is still too small to bump it up to the “organized section” status that groups representing more established subfields enjoy. What’s more, almost all of the panels the group has sponsored so far have focused on forecasts of U.S. elections. That’s partly because those papers are popular draws in election years, but it’s also because the group’s leadership has had a really hard time finding enough scholars doing forecasting on other topics to assemble panels.

If the discipline’s flagship association in one of the countries most culturally disposed to doing this kind of work has trouble cobbling together occasional panels on forecasts of things other than elections, then I think it’s fair to say that forecasting still isn’t a mainstream pursuit in political science, either.

2. Speaking of U.S. election forecasting, Drew Linzer recently blogged a clinic in how statistical forecasts should be evaluated. Via his web site, Votamatic, Drew:

1) began publishing forecasts about the 2012 elections well in advance of Election Day (so there couldn’t be any post hoc hemming and hawing about what his forecasts really were);

2) described in detail how his forecasting model works;

3) laid out a set of criteria he would use to judge those forecasts after the election; and then

4) walked us through his evaluations soon after the results were (mostly) in.

Oh, and in case you’re wondering: Drew’s model performed very well, thank you.

3. But you know what worked a little better than Drew’s election-forecasting model, and pretty much everyone else’s, too? An average of the forecasts from several of them. As it happens, this pattern is pretty robust. A well-designed statistical model is great for forecasting, but an average of forecasts from a number of them is usually going to be even better. Just ask the weather guys.

4. Finally, for those of you—like me—who want to keep holding pundits’ feet to the fire long after the election’s over, rejoice that Pundit Tracker is now up and running, and they even have a stream devoted specifically to politics. Among other things, they’ve got John McLaughlin on the record predicting that Hillary Clinton will win the presidency in 2016, and that President Obama will not nominate Susan Rice to be Secretary of State. McLaughlin’s hit rate so far is a rather mediocre 49 percent (18 of 37 graded calls correct), so make of those predictions what you will.

Forecasting Political Instability: Results from a Tournament of Methods

I’ve just posted to SSRN a report describing a statistical forecasting “tournament” undertaken by the CIA-funded Political Instability Task Force (PITF) in 2009–2010. I was PITF’s research director from 2001 until the start of 2011, and I designed and participated in this melee. You can download the full report here. As the abstract states,

The purpose of the tournament was to evaluate systematically the relative merits of several statistical techniques for forecasting various forms of political change in countries worldwide. Among other things, the tournament confirmed our belief that domain expertise and familiarity with relevant data help lead to more accurate forecasts. When knowledge of theory and data were held constant, the forecasts produced by most of the techniques we tried did not diverge by much. Unsurprisingly, this tournament also confirmed that forecasting rare forms of political instability as far as two years in advance is hard to do well. The forecasting tools the participants produced were generally quite good at discriminating high-risk cases from low-risk ones, but none was very precise.

The idea for the tournament came in 2009 from a story about the Netflix Prize, and I was really gratified to get to implement something a little bit like that process within PITF. I hope the report is useful to other practicing forecasters and would love to hear what folks make of the results.

House Votes to Defund Political Science Program: The Irony, It Burns

From the Monkey Cage this morning:

The Flake amendment Henry wrote about appears to have passed the House last night with a 218-208 vote. The amendment prohibits funding for NSF’s political science program, which among others funds many valuable data collection efforts including the National Election Studies. No other program was singled out like this…This is obviously not the last word on this. The provision may be scrapped in the conference committee (Sara?). But it is clear that political science research is in real danger of a very serious setback.

There’s real irony here in a Republican-controlled House of Representatives voting to defund a political-science program at a time when the Department of Defense and “intelligence community” are apparently increasing spending on similar work. With things like the Minerva Initiative, the Political Instability Task Force (on which I worked for 10 years), ICEWS, and IARPA’s Open Source Indicators programs, the parts of the government concerned with protecting national security seem to find growing value in social-science research and are spending accordingly. Meanwhile, the party that claims to be the stalwart defender of national security pulls in the opposite direction, like the opposing head on Dr. Doolittle’s Pushmi-pullyu. Nice work, fellas.

Building a Public Early-Warning System for Genocide and Mass Atrocities

Can we see genocides and other mass atrocities coming? If so, how, and how far in advance? And would public dissemination of those forecasts help policy-makers, advocates, and affected societies prevent those atrocities from occurring?

In October 2011, the U.S. Holocaust Memorial Museum (USHMM) convened a group of advocates and academics for a one-day seminar to ruminate on these questions. These are big and difficult problems, and the event really had a more practical goal at its heart: to help the Museum and other civil-society groups assess the potential for, and value of, a new public early-warning system focused on genocide and other mass atrocities.

Based on that conversation and the recommendations of USHMM Fellow and Dartmouth professor Ben Valentino, the Museum decided that the need and opportunity were sufficient to start considering what such a system might look like and how to build it. In March 2012, the Museum hired me for an eight-month consulting project, to finish in October, that’s meant to push this process forward.

My project has two main parts. First and most important, I’ve been asked to write a prospectus detailing the elements and funding this program would require. Second, I’ve been asked to build a statistical tool that could produce one set of forecasts for this program, if it gets built. Under Ben’s proposal, a second set of forecasts would come from some form of expert survey, and the two could be compared and combined to useful effect.

As I get deeper into the project, I expect to blog occasionally about what I’m working on and where I could use some help. I’ve already had very helpful exchanges with numerous people engaged in related projects, including former Political Instability Task Force colleagues Ted Gurr and Barbara Harff, who produces her own global genocide risk list each year, and Sentinel Project founder Christopher Tuckwood. I’m also slated to present results from a preliminary version of my statistical analysis at NYU’s Northeast Methods Program (NEMP) in early May, and my work will surely benefit from the constructive criticism that esteemed audience can provide.

In the meantime, I wanted to spread the word about the Museum’s interest in this endeavor and invite your reactions and suggestions. If you know of any relevant research or advocacy projects or might be interested in supporting this work in some fashion, please post a comment or drop me a line at ulfelder <at> gmail <dot> com.

Follow

Get every new post delivered to your Inbox.

Join 8,214 other followers

%d bloggers like this: