I recently wrote about how data sets just starting to come online are going to throw open doors to projects that political scientists have been hoping to do for a while but haven’t had the evidence to handle. Well, one of those shiny new trains just pulled into the station: the Global Dataset of Events, Language, and Tone, a.k.a. GDELT, is now in the public domain.
GDELT is the primarily work of Kalev Leetaru, a a University Fellow at the University of Illinois Graduate School of Library and Information Science, but its intellectual and practical origins—and its journey into the public domain—also owe a lot to the great Phil Schrodt. The data set includes records summarizing more than 200 million events that have occurred around the world from 1979 to the present. Those records are created by software that grabs and parses news reports from a number of international sources, including Agence France Press, the Associated Press, and Xinhua. Each record indicates who did or said what to whom, where, and when.
The “did what” part of each record is based on the CAMEO coding scheme, which sorts actions into a fairly detailed set of categories covering many different forms of verbal and material cooperation and conflict, from public statements of support to attacks with weapons of mass destruction. The “who” and “to whom” parts use carefully constructed dictionaries to identify specific actors and targets by type and proper name. So, for example, “Philippine soldiers” gets identified as Philippines military (PHLMIL), while “Philippine Secretary of Agriculture” gets tagged as Philippine government (PHLGOV). The “where” part uses place names and other clues in the stories to geolocate each event as specifically as possible.
I try to avoid using words like “revolutionary” when talking about the research process, but in this case I think it fits. I suspect this is going to be the data set that launches a thousand dissertations. As Josh Keating noted on his War of Ideas blog at Foreign Policy,
Similar event databases have been built for particular regions, and DARPA has been working along similar lines for the Pentagon with a project known as ICEWS, but for a publicly accessible program…GDELT is unprecedented in it geographic and historic scale.
To Keating’s point about the data set’s scale, I would add two other ways that GDELT is a radical departure from past practice in the discipline. First, it’s going to be updated daily (watch this space). Second, it’s freely available to the public.
Yes, you read that right: a global data set summarizing all sorts of political cooperation and conflict with daily updates is now going to available to anyone with an Internet connection at no charge. As in: FREE. As I said in a tweet-versation about GDELT this afternoon, contractors have been trying for years (and probably succeeding) to sell closed systems like this to the U.S. government for hundreds of thousands or millions of dollars. If I’m not mistaken, that market just crashed, or at the very least shrank by a whole lot.
GDELT isn’t perfect, of course. I’ve already been tinkering with it a bit as part of a project I’m doing for the Holocaust Museum’s Center for the Prevention of Genocide, on monitoring and predicting mass atrocities, and the data on the “Engage in Unconventional Mass Violence” events I’m hoping to use as a marker of atrocities look more reliable in some cases than others. Still, getting a data set of this size and quality in the public domain is a tremendous leap forward for empirical political science, and the fact that it’s open will allow lots of other smart people to find the flaws and work on eliminating or mitigating them.
Last but not least, I think it’s worth noting that GDELT was made possible, in part, through support from the National Science Foundation. It may be free to you, and it’s orders of magnitude cheaper to produce than the artisanal, hand-crafted event data of yesteryear (like, yesterday). But that doesn’t mean it’s been free to develop, produce, or share, and you can thank the NSF for helping various parts of that process happen.