I recently wrote about how data sets just starting to come online are going to throw open doors to projects that political scientists have been hoping to do for a while but haven’t had the evidence to handle. Well, one of those shiny new trains just pulled into the station: the Global Dataset of Events, Language, and Tone, a.k.a. GDELT, is now in the public domain.
GDELT is the primarily work of Kalev Leetaru, a a University Fellow at the University of Illinois Graduate School of Library and Information Science, but its intellectual and practical origins—and its journey into the public domain—also owe a lot to the great Phil Schrodt. The data set includes records summarizing more than 200 million events that have occurred around the world from 1979 to the present. Those records are created by software that grabs and parses news reports from a number of international sources, including Agence France Press, the Associated Press, and Xinhua. Each record indicates who did or said what to whom, where, and when.
The “did what” part of each record is based on the CAMEO coding scheme, which sorts actions into a fairly detailed set of categories covering many different forms of verbal and material cooperation and conflict, from public statements of support to attacks with weapons of mass destruction. The “who” and “to whom” parts use carefully constructed dictionaries to identify specific actors and targets by type and proper name. So, for example, “Philippine soldiers” gets identified as Philippines military (PHLMIL), while “Philippine Secretary of Agriculture” gets tagged as Philippine government (PHLGOV). The “where” part uses place names and other clues in the stories to geolocate each event as specifically as possible.
I try to avoid using words like “revolutionary” when talking about the research process, but in this case I think it fits. I suspect this is going to be the data set that launches a thousand dissertations. As Josh Keating noted on his War of Ideas blog at Foreign Policy,
Similar event databases have been built for particular regions, and DARPA has been working along similar lines for the Pentagon with a project known as ICEWS, but for a publicly accessible program…GDELT is unprecedented in it geographic and historic scale.
To Keating’s point about the data set’s scale, I would add two other ways that GDELT is a radical departure from past practice in the discipline. First, it’s going to be updated daily (watch this space). Second, it’s freely available to the public.
Yes, you read that right: a global data set summarizing all sorts of political cooperation and conflict with daily updates is now going to available to anyone with an Internet connection at no charge. As in: FREE. As I said in a tweet-versation about GDELT this afternoon, contractors have been trying for years (and probably succeeding) to sell closed systems like this to the U.S. government for hundreds of thousands or millions of dollars. If I’m not mistaken, that market just crashed, or at the very least shrank by a whole lot.
GDELT isn’t perfect, of course. I’ve already been tinkering with it a bit as part of a project I’m doing for the Holocaust Museum’s Center for the Prevention of Genocide, on monitoring and predicting mass atrocities, and the data on the “Engage in Unconventional Mass Violence” events I’m hoping to use as a marker of atrocities look more reliable in some cases than others. Still, getting a data set of this size and quality in the public domain is a tremendous leap forward for empirical political science, and the fact that it’s open will allow lots of other smart people to find the flaws and work on eliminating or mitigating them.
Last but not least, I think it’s worth noting that GDELT was made possible, in part, through support from the National Science Foundation. It may be free to you, and it’s orders of magnitude cheaper to produce than the artisanal, hand-crafted event data of yesteryear (like, yesterday). But that doesn’t mean it’s been free to develop, produce, or share, and you can thank the NSF for helping various parts of that process happen.
Andrew
/ April 10, 2013Great post, thanks. This stuff seems to have immense potential.
Do we know how good the automation is? I wonder if we’re talking about 99.9% of the data being accurate or more like 95%.
Also, does the ‘market crash’ mean PITF’s dataset is obsolete?
dartthrowingchimp
/ April 10, 2013On GDELT’s accuracy (and other things), you can find out more about the data set from the paper at the top of this page:
http://eventdata.psu.edu/papers.dir/automated.html
When I was talking about a “market crash”, I specifically meant the market for event data production, not data in general. As far as I know, this doesn’t overlap with most of the data sets PITF produces, which are observed at the level of the country-year and focus on structural conditions or big episodes instead of events. The one exception to that is the Worldwide Atrocities Event Dataset. It would be interesting to see a systematic comparison of similar events from the two.
Fr.
/ April 11, 2013It is not yet clear whether dataset is actually in the public domain: that will depend on the license attached to the data. It is not yet clear, for instance, if a company will be allowed to sell specific parsings of the data. I’m sure KL knows this, so it should get fixed pretty soon.
dartthrowingchimp
/ April 11, 2013Important point, thank you.
Fr.
/ April 11, 2013I’m now thinking that NSF funding might come with contractual strings attached with regards to data release. The Python and R scripts by Paul Schrodt are GPL, so perhaps KL will go to for that too.
dartthrowingchimp
/ April 11, 2013To the best of my knowledge, GDELT itself wasn’t built with NSF funding. It’s that GDELT built off of CAMEO and Tabari, which were.
Oral Hazard
/ April 12, 2013If anything, NSF funding would tend to require the public availability of the data.
Grant
/ April 14, 2013If you’ve used it already are you of the opinion it’d be a good tool for a college political science class to use?
dartthrowingchimp
/ April 15, 2013It’s just a data set, no user interface, and it’s a very large and relatively unstructured one. So, no, probably not, unless it’s a class in stats or coding.
Mary Manjikian
/ April 17, 2013Do you know if there is or will be any training available on using this? It seems a bit . . intimidating . .
dartthrowingchimp
/ April 17, 2013It’s not a product, it’s a data set for research purposes, so, no, I don’t imagine any kind of training unless and until someone develops an interface or something for commercial purposes.
sylvia kronstadt
/ May 31, 2013To answer your site’s initial question, my response would be “bloviation,” based on my own unforgettable, and apparently immortal, relationship with Kalev Leetaru:
Is it artificial intelligence, or authentic stupidity?
http://kronstantinople.blogspot.com/2011/11/does-big-oil-think-youre-slick.html