ACLED in R

The Armed Conflict Location & Event Data Project, a.k.a. ACLED, produces up-to-date event data on certain kinds of political conflict in Africa and, as of 2015, parts of Asia. In this post, I’m not going to dwell on the project’s sources and methods, which you can read about on ACLED’s About page, in the 2010 journal article that introduced the project, or in the project’s user’s guides. Nor am I going to dwell on the necessity of using all political event data sets, including ACLED, with care—understanding the sources of bias in how they observe events and error in how they code them and interpreting (or, in extreme cases, ignoring) the resulting statistics accordingly.

Instead, my only aim here is to share an R script I’ve written that largely automates the process of downloading and merging ACLED’s historical and current Africa data and then creates a new data frame with counts of events by type at the country-month level. If you use ACLED in R, this script might save you some time and some space on your hard drive.

You can find the R script on GitHub, here.

The chief problem with this script is that the URLs and file names of ACLED’s historical and current data sets change with every update, so the code will need to be modified each time that happens. If the names were modular and the changes to them predictable, it would be easy to rewrite the code to keep up with those changes automatically. Unfortunately, they aren’t, so the best I can do for now is to give step-by-step instructions in comments embedded in the script on how to update the relevant four fields by hand. As long as the basic structure of the .csv files posted by ACLED doesn’t change, though, the rest should keep working.

[UPDATE: I revised the script so it will scrape the link addresses from the ACLED website and parse the file names from them. The new version worked after ACLED updated its real-time file earlier today, when the old version would have broken. Unless ACLED changes its file-naming conventions or the structure of its website, the version should work for the rest of 2015. In case it does fail, instructions on how to hard-code a workaround are included as comments at the bottom of the script.]

It should also be easy to adapt the part of the script that generates country-month event counts to slice the data even more finely, or to count by something other than event type. To do that, you would just need to add variables to the group_by() part of the block of code that produces the object ACLED.cm. For example, if you wanted to get counts of events by type at the level of the state or province, you would revise that line to read group_by(gwno, admin1, year, month, event_type). Or, if you wanted country-month counts of events by the type(s) of actor involved, you could use group_by(gwno, year, month, interaction) and then see this user’s guide to decipher those codes. You get the drift.

The script also shows a couple of examples of how to use ‘gglot2’ to generate time-series plots of those monthly counts. Here’s one I made of monthly counts of battle events by country for the entire period covered by ACLED as of this writing: January 1997–June 2015. A production-ready version of this plot would require some more tinkering with the size of the country names and the labeling of the x-axis, but the kind of small-multiples chart offers a nice way to explore the data before analysis.

Monthly counts of battle events, January 1997-June 2015

Monthly counts of battle events, January 1997-June 2015

If you use the script and find flaws in it or have ideas on how to make it work better or do more, please email me at ulfelder <at> gmail <dot> com.

Down the Country-Month Rabbit Hole

Some big things happened in the world this week. Iran and the P5+1 agreed on a framework for a nuclear deal, and the agreement looks good. In a presidential election in Nigeria—the world’s seventh–most populous country, and one that few observers would have tagged as a democracy before last weekend—incumbent Goodluck Jonathan lost and then promptly and peacefully conceded defeat. The trickle of countries joining China’s new Asian Infrastructure Investment Bank turned into a torrent.

All of those things happened, but you won’t read more about them here, because I have spent the better part of the past week down a different rabbit hole. Last Friday, after years of almosts and any-time-nows, the event data produced for the Integrated Conflict Early Warning System (ICEWS) finally landed in the public domain, and I have been busy trying to figure out how to put them to use.

ICEWS isn’t the first publicly available trove of political event data, but it compares favorably to the field’s first mover, GDELT, and it currently covers a much longer time span than the other recent entrant, Phoenix.

The public release of ICEWS is exciting because it opens the door wider to dynamic modeling of world politics. Right now, nearly all of the data sets employed in statistical studies of politics around the globe use country-years as their units of observation. That’s not bad if you’re primarily interested in the effects or predictive power of structural features, but it’s pretty awful for explaining and anticipating faster-changing phenomena, like social unrest or violent conflict. GDELT broke the lock on that door, but its high noise-to-signal ratio and the opacity of its coding process have deterred me from investing too much time in developing monitoring or forecasting systems that depend on it.

With ICEWS on the Dataverse, that changes. I think we now have a critical mass of data sets in the public domain that: a) reliably cover important topics for the whole world over many years; b) are routinely updated; and, crucially, c) can be parsed to the month or even the week or day to reward investments in more dynamic modeling. Other suspects fitting this description include:

  • The spell-file version of Polity, which measures national patterns of political authority;
  • Lists of coup attempts maintained by Jonathan Powell and Clayton Thyne (here) and the Center for Systemic Peace (here); and
  • The PITF Worldwide Atrocities Event Dataset, which records information about events involving the deliberate killing of five or more noncombatant civilians (more on it here).

We also have high-quality data sets on national elections (here) and leadership changes (here, described here) that aren’t routinely updated by their sources but would be relatively easy to code by hand for applied forecasting.

With ICEWS, there is, of course, a catch. The public version of the project’s event data set will be updated monthly, but on a one-year delay. For example, when the archive was first posted in March, it ran through February 2014. On April 1, the Lockheed team added March 2014. This delay won’t matter much for scholars doing retrospective analyses, but it’s a critical flaw, if not a fatal one, for applied forecasters who can’t afford to pay—what, probably hundreds of thousands of dollars?—for a real-time subscription.

Fortunately, we might have a workaround. Phil Schrodt has played a huge role in the creation of the field of machine-coded political event data, including GDELT and ICEWS, and he is now part of the crew building Phoenix. In a blog post published the day ICEWS dropped, Phil suggested that Phoenix and ICEWS data will probably look enough alike to allow substituting the former for the latter, perhaps with some careful calibration. As Phil says, we won’t know for sure until we have a wider overlap between the two and can see how well this works in practice, but the possibility is promising enough for me to dig in.

And what does that mean? Well, a week has now passed since ICEWS hit the Dataverse, and so far I have:

  • Written an R function that creates a table of valid country-months for a user-specified time period, to use as scaffolding in the construction and agglomeration of country-month data sets;
  • Written scripts that call that function and some others to ingest and then parse or aggregate the other data sets I mentioned to the country-month level;
  • Worked out a strategy, and written the code, to partition the data into training and test sets for a project on predicting violence against civilians; and
  • Spent a lot of time staring at the screen thinking about, and a little time coding, ways to aggregate, reduce, and otherwise pre-process the ICEWS events and Polity data for that work on violence against civilians and beyond.

What I haven’t done yet—T plus seven days and counting—is any modeling. How’s that for push-button, Big Data magic?

The State of the Art in the Production of Political Event Data

Peter Nardulli, Scott Althaus, and Matthew Hayes have a piece forthcoming in Sociological Methodology (PDF) that describes what I now see as the cutting edge in the production of political event data: machine-human hybrid systems.

If you have ever participated in the production of political event data, you know that having people find, read, and code data from news stories and other texts takes a tremendous amount of work. Even boutique data sets on narrowly defined topics for short time periods in single cases usually require hundreds or thousands of person-hours to create, and the results still aren’t as pristine as we’d like or often believe.

Contrary to my premature congratulation on GDELT a couple of years ago, however, fully automated systems are not quite ready to take over the task, either. Once a machine-coding system has been built, the data come fast and cheap, but those data are, inevitably, still pretty noisy. (On that point, see here for some of my own experiences with GDELT and here, here, here, here, and here for other relevant discussions.)

I’m now convinced that the best current solution is one that borrows strength from both approaches—in other words, a hybrid. As Nardulli, Althaus, and Hayes argue in their forthcoming article, “Machine coding is no simple substitute for human coding.”

Until fully automated approaches can match the flexibility and contextual richness of human coding, the best option for generating near-term advances in social science research lies in hybrid systems that rely on both machines and humans for extracting information from unstructured texts.

As you might expect, Nardulli & co. have built and are operating such a system—the Social, Political, and Economic Event Database (SPEED)—to code data on a bunch of interesting things, including coups and civil unrest. Their hybrid process goes beyond supervised learning, where an algorithm gets trained on a data set carefully constructed by human coders and then put in the traces to make new data from fresh material. Instead, adopt a “progressive supervised-learning system,” which basically means two things:

  1. They keep humans in the loop for all steps where the error rate from their machine-only process remains intolerably high, making the results as reliable as possible; and
  2. They use those humans’ coding decisions as new training sets to continually check and refine their algorithms, gradually shrinking the load borne by the humans and mitigating the substantial risk of concept drift that attaches to any attempt to automate the extraction of data from a constantly evolving news-media ecosystem.

I think SPEED exemplifies the state of the art in a couple of big ways. The first is the process itself. Machine-learning processes have made tremendous gains in the past several years (see here, h/t Steve Mills), but we still haven’t arrived at the point where we can write algorithms that reliably recognize and extract the information we want from the torrent of news stories coursing through the Internet. As long as that’s the case—and I expect it will be for at least another several years—we’re going to need to keep humans in the loop to get data sets we really trust and understand. (And, of course, even then the results will still suffer from biases that even a perfect coding process can’t avoid; see here for Will Moore’s thoughtful discussion of that point.)

The second way in which SPEED exemplifies the state of the art is what Nardulli, Althaus, and Hayes’ paper explicitly and implicitly tells us about the cost and data-sharing constraints that come with building and running a system of this kind on that scale. Nardulli & co. don’t report exactly how much money has been spent on SPEED so far and how much it costs to keep running it, but they do say this:

The Cline Center began assembling its news archive and developing SPEED’s workflow system in 2006, but lacked an operational cyberinfrastructure until 2009. Seven years and well over a million dollars later, the Cline Center released its first SPEED data set.

Partly because of those high costs and partly because of legal issues attached to data coded from many news stories, the data SPEED produces are not freely available to the public. The project shares some historical data sets on its web site, but the content of those sets is limited, and the near-real-time data coveted by applied researchers like me are not made public. Here’s how the authors describe their situation:

While agreements with commercial vendors and intellectual property rights prohibit the Center from distributing its news archive, efforts are being made to provide non-consumptive public access to the Center’s holdings. This access will allow researchers to evaluate the utility of the Center’s digital archive for their needs and construct a research design to realize those needs. Based on that design, researchers can utilize the Center’s various subcenters of expertise (document classification, training, coding, etc.) to implement it.

I’m not happy about those constraints, but as someone who has managed large and costly social-science research projects, I certainly understand them. I also don’t expect them to go away any time soon, for SPEED or for any similar undertaking.

So that’s the state of the art in the production of political event data: Thanks to the growth of the Internet and advances in computing hardware and software, we can now produce political event data on a scale and at a pace that would have had us drooling a decade ago, but the task still can’t be fully automated without making sacrifices in data quality that most social scientists should be uncomfortable making. The best systems we can build right now blend machine learning and automation with routine human involvement and oversight. Those systems are still expensive to build and run, and partly because of that, we should not expect their output to stream onto our virtual desktops for free, like manna raining down from digital heaven.

  • Author

  • Follow me on Twitter

  • Follow Dart-Throwing Chimp on WordPress.com
  • Enter your email address to follow this blog and receive notifications of new posts by email.

    Join 13,609 other subscribers
  • Archives

%d bloggers like this: