The Future of Political Science Just Showed Up

I recently wrote about how data sets just starting to come online are going to throw open doors to projects that political scientists have been hoping to do for a while but haven’t had the evidence to handle. Well, one of those shiny new trains just pulled into the station: the Global Dataset of Events, Language, and Tone, a.k.a. GDELT, is now in the public domain.

GDELT is the primarily work of Kalev Leetaru, a a University Fellow at the University of Illinois Graduate School of Library and Information Science, but its intellectual and practical origins—and its journey into the public domain—also owe a lot to the great Phil Schrodt. The data set includes records summarizing more than 200 million events that have occurred around the world from 1979 to the present. Those records are created by software that grabs and parses news reports from a number of international sources, including Agence France Press, the Associated Press, and Xinhua. Each record indicates who did or said what to whom, where, and when.

The “did what” part of each record is based on the CAMEO coding scheme, which sorts actions into a fairly detailed set of categories covering many different forms of verbal and material cooperation and conflict, from public statements of support to attacks with weapons of mass destruction. The “who” and “to whom” parts use carefully constructed dictionaries to identify specific actors and targets by type and proper name. So, for example, “Philippine soldiers” gets identified as Philippines military (PHLMIL), while “Philippine Secretary of Agriculture” gets tagged as Philippine government (PHLGOV). The “where” part uses place names and other clues in the stories to geolocate each event as specifically as possible.

I try to avoid using words like “revolutionary” when talking about the research process, but in this case I think it fits. I suspect this is going to be the data set that launches a thousand dissertations. As Josh Keating noted on his War of Ideas blog at Foreign Policy,

Similar event databases have been built for particular regions, and DARPA has been working along similar lines for the Pentagon with a project known as ICEWS, but for a publicly accessible program…GDELT is unprecedented in it geographic and historic scale.

To Keating’s point about the data set’s scale, I would add two other ways that GDELT is a radical departure from past practice in the discipline. First, it’s going to be updated daily (watch this space). Second, it’s freely available to the public.

Yes, you read that right: a global data set summarizing all sorts of political cooperation and conflict with daily updates is now going to available to anyone with an Internet connection at no charge. As in: FREE. As I said in a tweet-versation about GDELT this afternoon, contractors have been trying for years (and probably succeeding) to sell closed systems like this to the U.S. government for hundreds of thousands or millions of dollars. If I’m not mistaken, that market just crashed, or at the very least shrank by a whole lot.

GDELT isn’t perfect, of course. I’ve already been tinkering with it a bit as part of a project I’m doing for the Holocaust Museum’s Center for the Prevention of Genocide, on monitoring and predicting mass atrocities, and the data on the “Engage in Unconventional Mass Violence” events I’m hoping to use as a marker of atrocities look more reliable in some cases than others. Still, getting a data set of this size and quality in the public domain is a tremendous leap forward for empirical political science, and the fact that it’s open will allow lots of other smart people to find the flaws and work on eliminating or mitigating them.

Last but not least, I think it’s worth noting that GDELT was made possible, in part, through support from the National Science Foundation. It may be free to you, and it’s orders of magnitude cheaper to produce than the artisanal, hand-crafted event data of yesteryear (like, yesterday). But that doesn’t mean it’s been free to develop, produce, or share, and you can thank the NSF for helping various parts of that process happen.

Leave a comment

31 Comments

  1. Andrew

     /  April 10, 2013

    Great post, thanks. This stuff seems to have immense potential.

    Do we know how good the automation is? I wonder if we’re talking about 99.9% of the data being accurate or more like 95%.

    Also, does the ‘market crash’ mean PITF’s dataset is obsolete?

    Reply
    • On GDELT’s accuracy (and other things), you can find out more about the data set from the paper at the top of this page:

      http://eventdata.psu.edu/papers.dir/automated.html

      When I was talking about a “market crash”, I specifically meant the market for event data production, not data in general. As far as I know, this doesn’t overlap with most of the data sets PITF produces, which are observed at the level of the country-year and focus on structural conditions or big episodes instead of events. The one exception to that is the Worldwide Atrocities Event Dataset. It would be interesting to see a systematic comparison of similar events from the two.

      Reply
  2. Fr.

     /  April 11, 2013

    It is not yet clear whether dataset is actually in the public domain: that will depend on the license attached to the data. It is not yet clear, for instance, if a company will be allowed to sell specific parsings of the data. I’m sure KL knows this, so it should get fixed pretty soon.

    Reply
    • Important point, thank you.

      Reply
      • Fr.

         /  April 11, 2013

        I’m now thinking that NSF funding might come with contractual strings attached with regards to data release. The Python and R scripts by Paul Schrodt are GPL, so perhaps KL will go to for that too.

      • To the best of my knowledge, GDELT itself wasn’t built with NSF funding. It’s that GDELT built off of CAMEO and Tabari, which were.

      • Oral Hazard

         /  April 12, 2013

        If anything, NSF funding would tend to require the public availability of the data.

  3. Grant

     /  April 14, 2013

    If you’ve used it already are you of the opinion it’d be a good tool for a college political science class to use?

    Reply
    • It’s just a data set, no user interface, and it’s a very large and relatively unstructured one. So, no, probably not, unless it’s a class in stats or coding.

      Reply
  4. Do you know if there is or will be any training available on using this? It seems a bit . . intimidating . .

    Reply
    • It’s not a product, it’s a data set for research purposes, so, no, I don’t imagine any kind of training unless and until someone develops an interface or something for commercial purposes.

      Reply
  5. To answer your site’s initial question, my response would be “bloviation,” based on my own unforgettable, and apparently immortal, relationship with Kalev Leetaru:
    Is it artificial intelligence, or authentic stupidity?
    http://kronstantinople.blogspot.com/2011/11/does-big-oil-think-youre-slick.html

    Reply
  1. Midweek Linkage » Duck of Minerva
  2. Excitement about GDELT and some Personal Intellectual History | Will Opines
  3. The Future of Political Science Just Showed Up ...
  4. Tuesday Morning Linkage Club » Duck of Minerva
  5. GDELT and Big Data- Why Theory Still Matters | Perpetual Beats
  6. Road-Testing GDELT as a Resource for Monitoring Atrocities | Dart-Throwing Chimp
  7. Links to some useful posts from the past | Global Database of Events, Language, and Tone (GDELT)
  8. Playing Telephone with Data Science | Dart-Throwing Chimp
  9. Hacking GDELT « Bad Hessian
  10. Eye Candy for Social Scientists | Dart-Throwing Chimp
  11. Most Popular Posts of 2013 | Dart-Throwing Chimp
  12. Another Note on the Limitations of Event Data | Dart-Throwing Chimp
  13. Now, that’s BIG DATA: Google’s GDELT Project | Wirtschaftsprofiling und Unternehmenssicherheit
  14. The State of the Art in the Production of Political Event Data | Dart-Throwing Chimp
  15. Как использовать GDELT — для отслеживания внимания прессы к мировым событиям (на примере событий в Украине) | Сети везде
  16. Как использовать GDELT — для отслеживания внимания прессы к мировым событиям (на примере событий в Украине) | математическое моделирование
  17. A Bit More on Country-Month Modeling | Dart-Throwing Chimp
  18. GDELT : la science politique devient une “big data science” | Polit'bistro : des politiques, du café
  19. Another Tottering Step Toward a New Era of Data-Making | Dart-Throwing Chimp

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

  • Author

  • Follow me on Twitter

  • Follow Dart-Throwing Chimp on WordPress.com
  • Enter your email address to follow this blog and receive notifications of new posts by email.

    Join 13,607 other subscribers
  • Archives

%d bloggers like this: