My family is riding the flu carousel right now, and my turn came this week. So, in lieu of trying to write from scratch, I wanted to pick up where my last post—on moving from country-year to country-month modeling—left off.
As many of you know, this notion is hardly new. For at least the past decade, many political scientists who use statistical tools to study violent conflict have been advocating and sometimes implementing research designs that shrink their units of observation on various dimensions, including time. The Journal of Conflict Resolution published a special issue on “disaggregating civil war” in 2009. At the time, that publication felt (to me) more like the cresting of a wave of new work than the start of one, and it was motivated, in part, by frustration over all the questions that a preceding wave of country-year civil-war modeling had inevitably left unanswered. Over the past several years, Mike Ward and his WardLab collaborators at Duke have been using ICEWS and other higher-resolution data sets to develop predictive models of various kinds of political instability at the country-month level. Their work has used designs that deal thoughtfully with the many challenges this approach entails, including spatial and temporal interdependence and the rarity of the events of interest. So have others.
Meanwhile, sociologists who study protests and social movements have been pushing in this direction even longer. Scholars trying to use statistical methods to help understand the dynamic interplay between mobilization, activism, repression, and change recognized that those processes can take important turns in weeks, days, or even hours. So, researchers in that field started trying to build event data sets that recorded as exactly as possible when and where various actions occurred, and they often use event history models and other methods that “take time seriously” to analyze the results. (One of them sat on my dissertation committee and had a big influence on my work at the time.)
As far as I can tell, there are two main reasons that all research in these fields hasn’t stampeded in the direction of disaggregation, and one of them is a doozy. The first and lesser one is computing power. It’s no simple thing to estimate models of mutually causal processes occurring across many heterogeneous units observed at high frequency. We still aren’t great at it, but accelerating improvements in computational processing, storage, software—and co-evolving improvements in statistical methods—have made it more tractable than it was even five or 10 years ago.
The second, more important, and more persistent impediment to disaggregated analysis is data, or the lack thereof. Data sets used by statistically minded political scientists come in two basic flavors: global, and case– or region-specific. Almost all of the global data sets of which I’m aware have always used, and continue to use, country-years as their units of observation.
That’s partly a function of the research questions they were built to help answer, but it’s also a function of cost. Data sets were (and mostly still are) encoded by hand by people sifting through or poring over relevant documents. All that labor takes a lot of time and therefore costs a lot of money. One can make (or ask RAs to make) a reasonably reliable summary judgment about something like whether or not a civil war was occurring in a particular country during particular year much quicker than one can do that for each month of that year, or each district in that country, or both. This difficulty hasn’t stopped everyone from trying, but the exceptions have been few and often case-specific. In a better world, we could have patched together those case-specific sets to make a larger whole, but they often use idiosyncratic definitions and face different informational constraints, making cross-case comparison difficult.
That’s why I’ve been so excited about the launch of GDELT and Phoenix and now the public release of the ICEWS event data. These are, I think, the leading edge of efforts to solve those data-collection problems in an efficient and durable way. ICEWS data have been available for several years to researchers working on a few contracts, but they haven’t been accessible to most of us until now. At first I thought GDELT had rendered that problem moot, but concerns about its reliability have encouraged me to keep looking. I think Phoenix’s open-source-software approach holds more promise for the long run, but, as its makers describe, it’s still in “beta release” and “under active development.” ICEWS is a more mature project that has tried carefully to solve some of the problems, like event duplication and errors in geolocation, that diminish GDELT’s utility. (Many millions of dollars help.) So, naturally, I and many others have been eager to start exploring it. And now we can. Hooray!
To really open up analysis at this level, though, we’re going to need comparable and publicly (or at least cheaply) available data sets on a lot more of things our theories tell us to care about. As I said in the last post, we have a few of those now, but not many. Some of the work I’ve done over the past couple of years—this, especially—was meant to help fill those gaps, and I’m hoping that work will continue. But it’s just a drop in a leaky bucket. Here’s hoping for a hard turn of the spigot.