Down the Country-Month Rabbit Hole

Some big things happened in the world this week. Iran and the P5+1 agreed on a framework for a nuclear deal, and the agreement looks good. In a presidential election in Nigeria—the world’s seventh–most populous country, and one that few observers would have tagged as a democracy before last weekend—incumbent Goodluck Jonathan lost and then promptly and peacefully conceded defeat. The trickle of countries joining China’s new Asian Infrastructure Investment Bank turned into a torrent.

All of those things happened, but you won’t read more about them here, because I have spent the better part of the past week down a different rabbit hole. Last Friday, after years of almosts and any-time-nows, the event data produced for the Integrated Conflict Early Warning System (ICEWS) finally landed in the public domain, and I have been busy trying to figure out how to put them to use.

ICEWS isn’t the first publicly available trove of political event data, but it compares favorably to the field’s first mover, GDELT, and it currently covers a much longer time span than the other recent entrant, Phoenix.

The public release of ICEWS is exciting because it opens the door wider to dynamic modeling of world politics. Right now, nearly all of the data sets employed in statistical studies of politics around the globe use country-years as their units of observation. That’s not bad if you’re primarily interested in the effects or predictive power of structural features, but it’s pretty awful for explaining and anticipating faster-changing phenomena, like social unrest or violent conflict. GDELT broke the lock on that door, but its high noise-to-signal ratio and the opacity of its coding process have deterred me from investing too much time in developing monitoring or forecasting systems that depend on it.

With ICEWS on the Dataverse, that changes. I think we now have a critical mass of data sets in the public domain that: a) reliably cover important topics for the whole world over many years; b) are routinely updated; and, crucially, c) can be parsed to the month or even the week or day to reward investments in more dynamic modeling. Other suspects fitting this description include:

  • The spell-file version of Polity, which measures national patterns of political authority;
  • Lists of coup attempts maintained by Jonathan Powell and Clayton Thyne (here) and the Center for Systemic Peace (here); and
  • The PITF Worldwide Atrocities Event Dataset, which records information about events involving the deliberate killing of five or more noncombatant civilians (more on it here).

We also have high-quality data sets on national elections (here) and leadership changes (here, described here) that aren’t routinely updated by their sources but would be relatively easy to code by hand for applied forecasting.

With ICEWS, there is, of course, a catch. The public version of the project’s event data set will be updated monthly, but on a one-year delay. For example, when the archive was first posted in March, it ran through February 2014. On April 1, the Lockheed team added March 2014. This delay won’t matter much for scholars doing retrospective analyses, but it’s a critical flaw, if not a fatal one, for applied forecasters who can’t afford to pay—what, probably hundreds of thousands of dollars?—for a real-time subscription.

Fortunately, we might have a workaround. Phil Schrodt has played a huge role in the creation of the field of machine-coded political event data, including GDELT and ICEWS, and he is now part of the crew building Phoenix. In a blog post published the day ICEWS dropped, Phil suggested that Phoenix and ICEWS data will probably look enough alike to allow substituting the former for the latter, perhaps with some careful calibration. As Phil says, we won’t know for sure until we have a wider overlap between the two and can see how well this works in practice, but the possibility is promising enough for me to dig in.

And what does that mean? Well, a week has now passed since ICEWS hit the Dataverse, and so far I have:

  • Written an R function that creates a table of valid country-months for a user-specified time period, to use as scaffolding in the construction and agglomeration of country-month data sets;
  • Written scripts that call that function and some others to ingest and then parse or aggregate the other data sets I mentioned to the country-month level;
  • Worked out a strategy, and written the code, to partition the data into training and test sets for a project on predicting violence against civilians; and
  • Spent a lot of time staring at the screen thinking about, and a little time coding, ways to aggregate, reduce, and otherwise pre-process the ICEWS events and Polity data for that work on violence against civilians and beyond.

What I haven’t done yet—T plus seven days and counting—is any modeling. How’s that for push-button, Big Data magic?

Leave a comment

7 Comments

  1. Late, very late, to the party Jay. I always counted you among the die hard country year guys. But now that the government has switch gears, its time to try this approach?

    Reply
    • Naw, man, you got the wrong guy. I wrote a dissertation 20 years ago that included continuous-time hazard models with subnational units of observation. I’ve been stuck using country-years ever since then because those are the data we’ve had that covered the cases we needed to cover and were reliably updated. That’s changed in the past several years, but only for people working on the contracts that have created or bought access to the new data sets. That doesn’t include the Early Warning Project, and all the rest of my modeling since PITF has been on my own time and dime. Kinda hard to buy access to ICEWS on that budget.

      Reply
  2. Cyrus

     /  April 4, 2015

    I haven’t had time to go through the ICEWS, but I have been fiddling around with the current Phoenix release (thanks to all the great people working on that project). I must say the geocoding is virtually no better (or worse) than GDELT’s, which is unfortunate (i.e. PETRARCH no better than TABARI). The problem is that only the first 4 lines of a news article are scrapped. The geolocation information we are after, however, is typically mentioned only after the initial summary of events (such as the city/town/village of an attack). What the scrapper does pick up, however, is typically the location from where the report was initially filed. As an example, I downloaded and cleaned the data for events occurring in Lebanon (cooperative, conflict, neutral) over the course of the last 3 months. 79% were coded as Beirut. I subset that initial sample to include only one month (June 2014 -1st month of the current release). For that month there were 16 events that were geocoded to the country of Lebanon, of those 16, 12 were geocoded to Beirut. Of the 12 geocoded to Beirut only 2 were events that actually took place in Beirut, the remaining either took place in other cities across Lebanon (and Syria even) or were not real event types to be coded. Analysis of Turkey, Jordan, and Iraq yielded similar findings. Im hoping things will change with future releases as more and more people contribute to the formation of the data.

    Reply
  1. Down the Country-Month Rabbit Hole | Wirtschaftsprofiling und Unternehmenssicherheit
  2. A quick look at the public ICEWS data | MI Regression
  3. A Bit More on Country-Month Modeling | Dart-Throwing Chimp
  4. To Realize the QDDR’s Early-Warning Goal, Invest in Data-Making | Dart-Throwing Chimp

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: