Another Note on the Limitations of Event Data

Last week, Foreign Policy ran a blog post by Kalev Leetaru that used GDELT to try to identify trends over time in protest activity around the world. That’s a fascinating and important question, but it’s also a really hard one, and I don’t think Kalev’s post succeeds in answering it. I wanted to use this space to explain why, because the issues involved are fundamental to efforts to answer many similar and important questions about patterns in human social behavior over time.

To me, the heart of Kalev’s post is his attempt to compare the intensity of protest activity worldwide over the past 35 years, the entirety of the period covered by GDELT. Ideally, we would do this with some kind of index that accounted for things like the number of protest events that occurred, the number of people who participated in them, and the things those people did.

Unfortunately, the data set that includes all of that information for all relevant events around the world doesn’t exist and never will. Although it might feel like we now live in a Panopticon, we don’t. In reality, we can still only see things that get reported in sources to which we have access; those reports aren’t always “true,” sometimes conflict, and are always incomplete; and, even in 2014, it’s still hard to reliably locate, parse, and encode data from the stories that we do see.

GDELT is the most ambitious effort to date to overcome these problems, and that ambition is helping to pull empirical social science in some new and productive directions. GDELT uses software to scour the web for media stories that contain information about a large but predetermined array of verbal and physical interactions. These interactions range from protests, threats, and attacks to more positive things like requests for aid and expressions of support. When GDELT’s software finds text that describes one of those interactions, it creates a record that includes numeric representations of words or phrases indicating what kind of interaction it was, who was involved, and where and when it took place. Each of those records becomes one tiny layer in an ever-growing stack. GDELT was only created in the 2010s, but its software has been applied to archival material to extend its coverage all the way back to 1979. The current version includes roughly 2.5 million records, and that number now grows by tens of thousands every day.

GDELT grows out of a rich tradition of event data production in social science, and its coding process mimics many of the procedures that scholars have long used to try to catalog various events of interest—or, at least, to capture reasonably representative samples of them. As such, it’s tempting to treat GDELT’s records as markers of discrete events that can be counted and cross-tabulated to identify trends over time and other patterns of interest.

That temptation should be assiduously resisted for two reasons that Leetaru and others involved in GDELT’s original creation have frequently acknowledged. First, GDELT can only create records from stories that it sees, and the volume and nature of media coverage and its digitized renderings have changed radically over the past 30 years. This change continues and may still be accelerating. One result of this change is exponential growth over time in the volume of GDELT records, as shown in the chart below (borrowed from an informative post on the Ward Lab blog). Under these circumstances, it’s unclear what comparisons across years, and especially decades, are getting at. Are we seeing meaningful changes in the phenomenon of interest, or are we really just seeing traces of change in the volume and nature of reporting on them?

Change Over Time in the Volume of GDELT Records, 1979-2011 (Source: Ward Lab)

Second, GDELT has not fully worked out how to de-duplicate its records. When the same event is reported in more than one media source, GDELT can’t always tell that they are the same event, sometimes even when it’s the same story appearing verbatim in more than one outlet. As a result, events that attract more attention are likely to generate more records. Under these circumstances, the whole idea of treating counts of records in certain categories as counts of certain event types becomes deeply problematic.

Kalev knows these things and tries to address them in his recent FP post on trends over time in protest activity. Here is how he describes what he does and the graph that results:

The number of protests each month is divided by the total number of all events recorded in GDELT that month to create a “protest intensity” score that tracks just how prevalent worldwide protest activity has been month-by-month over the last quarter-century (this corrects for the exponential rise in media coverage over the last 30 years and the imperfect nature of computer processing of the news). To make it easier to spot the macro-level patterns, a black 12-month moving average trend line is drawn on top of the graph to help clarify the major temporal shifts.

Intensity of protest activity worldwide 1979-April 2014 (black line is 12-month moving average) (Source: Kalev Leetaru via FP)

Unfortunately, I don’t think Kalev’s normalization strategy addresses either of the aforementioned problems enough to make the kind of inferences he wants to make about trends over time in the intensity of protest activity around the world.

Let’s start at the top. The numerator of Kalev’s index is the monthly count of records in a particular set of categories. This is where the lack of de-duplication can really skew the picture, and the index Kalev uses does nothing to directly address it.

Without better de-duplication, we can’t fix this problem, but we might be less worried about it if we thought that duplication were a reliable marker of event intensity. Unfortunately, it almost certainly isn’t. Certain events catch the media’s eyes for all kinds of reasons. Some are related to the nature of the event itself, but many aren’t. The things that interest us change over time, as do the ways we talk about them and the motivations of the corporations and editors who partially mediate that conversation. Under these circumstances, it would strain credulity to assume that the frequency of reports on a particular event reliably represents the intensity, or even the salience, of that event. There are just too many other possible explanations to make that inferential leap.

And there’s trouble in the bottom, too. Kalev’s decision to use the monthly volume of all records in the denominator is a reasonable one, but it doesn’t fully solve the problem it’s meant to address, either.

What we get from this division is a proportion: protest-related records as a share of all records. The problem with comparing these proportions across time slices is that they can differ for more than one reason, and that’s true even if we (heroically) assume that the lack of de-duplication isn’t a concern. A change from one month to the next might result from a change in the frequency or intensity of protest activity, but it could also result from a change in the frequency or intensity of some other event type also being tallied. Say, for example, that a war breaks out and produces a big spike in GDELT records related to violent conflict. Under these circumstances, the number of protest-related records could stay the same or even increase, and we would still see a drop in the “protest intensity score” Kalev uses.

In the end, what we get from Kalev’s index isn’t a reliable measure of the intensity of protest activity around the world and its change over time. What we get instead is a noisy measure of relative media attention to protest activity over a period of time when the nature of media attention itself has changed a great deal in ways that we still don’t fully understand. That quantity is potentially interesting in its own right. Frustratingly, though, it cannot answer seemingly simple questions like “How much protest activity are we seeing now?” or “How has the frequency or intensity of protest activity changed over the past 30 years?”

I’ll wrap this up by saying that I am still really, really excited about the new possibilities for social scientific research opening up as a result of projects like GDELT and, now, the Open Event Data Alliance it helped to spawn. At the same time, I think we social scientists have to be very cautious in our use of these shiny new things. As excited as we may be, we’re also the ones with the professional obligation to check the impulse to push them harder than they’re ready to go.

Advertisements

A Coda to “Using GDELT to Monitor Atrocities, Take 2”

I love doing research in the Internet Age. As I’d hoped it would, my post yesterday on the latest iteration of our atrocities-monitoring system in the works has already sparked a lot of really helpful responses. Some of those responses are captured in comments on the post, but not all of them are. So, partly as a public good and partly for my own record-keeping, I thought I’d write a coda to that post enumerating the leads it generated and some of my reactions to them.

Give the Machines Another Shot at It

As a way to reduce or even eliminate the burden placed on our human(s) in the loop, several people suggested something we’ve been considering for a while: use machine-learning techniques to develop classifiers that can be used to further reduce the data left after our first round of filtering. These classifiers could consider all of the features in GDELT, not just the event and actor types we’re using in our R script now. If we’re feeling really ambitious, we could go all the way back to the source stories and use natural-language processing to look for additional discriminatory power there. This second round might not eliminate the need for human review, but it certainly could lighten the load.

The comment threads on this topic (here and here) nicely capture what I see as the promise and likely limitations of this strategy, so I won’t belabor it here. For now, I’ll just note that how well this would work is an empirical question, and it’s one we hope to get a chance to answer once we’ve accumulated enough screened data to give those classifiers a fighting chance.

Leverage GDELT’s Global Knowledge Graph

Related to the first idea, GDELT co-creator Kalev Leetaru has suggested on a couple of occasions that we think about ways to bring the recently-created GDELT Global Knowledge Graph (GKG) to bear on our filtering task. As Kalev describes in a post on the GDELT blog, GKG consists of two data streams, one that records mentions of various counts and another that captures connections  in each day’s news between “persons, organizations, locations, emotions, themes, counts, events, and sources.” That second stream in particular includes a bunch of data points that we can connect to specific event records and thus use as additional features in the kind of classifiers described under the previous header. In response to my post, Kalev sent this email to me and a few colleagues:

I ran some very very quick numbers on the human coding results Jay sent me where a human coded 922 articles covering 9 days of GDELT events and coded 26 of them as atrocities. Of course, 26 records isn’t enough to get any kind of statistical latch onto to build a training model, but the spectral response of the various GKG themes is quite informative. For events tagged as being an atrocity, themes such as ETHNICITY, RELIGION, HUMAN_RIGHTS, and a variety of functional actors like Villagers, Doctors, Prophets, Activists, show up in the top themes, whereas in the non-atrocities the roles are primarily political leaders, military personnel, authorities, etc. As just a simple example, the HUMAN_RIGHTS theme appeared in just 6% of non-atrocities, but 30% of atrocities, while Activists show up in 33% of atrocities compared with just 4% of non-atrocities, and the list goes on.

Again, 26 articles isn’t enough to build a model on, but just glancing over the breakdown of the GKG themes for the two there is a really strong and clear breakage between the two across the entire set of themes, and the breakdown fits precisely what baysean classifiers like (they are the most accurate for this kind of separation task and outperform SVM and random forest).

So, Jay, the bottom line is that if you can start recording each day the list of articles that you guys review and the ones you flag as an atrocity and give me a nice dataset over time, should be pretty easy to dramatically filter these down for you at the very least.

As I’ve said throughout this process, its not that event data can’t do what is needed, its that often you have to bring additional signals into the mix to accomplish your goals when the thing you’re after requires signals beyond what the event records are capturing.

What Kalev suggests at the end there—keep a record of all the events we review and the decisions we make on them—is what we’re doing now, and I hope we can expand on his experiment in the next several months.

Crowdsource It

Jim Walsh left a thoughtful comment suggesting that we crowdsource the human coding:

Seems to me like a lot of people might be willing to volunteer their time for this important issue–human rights activists and NGO types, area experts, professors and their students (who might even get some credit and learn about coding). If you had a large enough cadre of volunteers, could assign many (10 or more?) to each day’s data and generate some sort of average or modal response. Would need someone to organize the volunteers, and I’m not sure how this would be implemented online, but might be do-able.

As I said in my reply to him, this is an approach we’ve considered but rejected for now. We’re eager to take advantage of the wisdom of interested crowds and are already doing so in big ways on other parts of our early-warning system, but I have two major concerns about how well it would work for this particular task.

The first is the recruiting problem, and here I see a Catch-22: people are less inclined to do this if they don’t believe the system works, but it’s hard to convince them that the system works if we don’t already have a crowd involved to make it go. This recruiting problem becomes especially acute in a system with time-sensitive deliverables. If we promise daily updates, we need to produce daily updates, and it’s hard to do that reliably if we depend on self-organized labor.

My second concern is the principal-agent problem. Our goal is to make reliable and valid data in a timely way, but there are surely people out there who would bring goals to the process that might not align with ours. Imagine, for example, that Absurdistan appears in the filtered-but-not-yet-coded data to be committing atrocities, but citizens (or even paid agents) of Absurdistan don’t like that idea and so organize to vote those events out of the data set. It’s possible that our project would be too far under the radar for anyone to bother, but our ambitions are larger than that, so we don’t want to assume that will be true. If we succeed at attracting the kind of attention we hope to attract, the deeply political and often controversial nature of our subject matter would make crowdsourcing this task more vulnerable to this kind of failure.

Use Mechanical Turk

Both of the concerns I have about the downsides of crowdsourcing the human-coding stage could be addressed by Ryan Briggs’ suggestion via Twitter to have Amazon Mechanical Turk do it. A hired crowd is there when you need it and (usually) doesn’t bring political agendas to the task. It’s also relatively cheap, and you only pay for work performed.

Thanks to our collaboration with Dartmouth’s Dickey Center, the marginal cost of the human coding isn’t huge, so it’s not clear that Mechanical Turk would offer much advantage on that front. Where it could really help is in routinizing the daily updates. As I mentioned in the initial post, when you depend on human action and have just one or a few people involved, it’s hard to establish a set of routines that covers weekends and college breaks and sick days and is robust to periodic changes in personnel. Primarily for this reason, I hope we’ll be able to run an experiment with Mechanical Turk where we can compare its cost and output to what we’re paying and getting now and see if this strategy might make sense for us.

Don’t Forget About Errors of Omission

Last but not least, a longtime colleague had this to say in an email reacting to the post (hyperlinks added):

You are effectively describing a method for reducing errors of commission, events coded by GDELT as atrocities that, upon closer inspection, should not be. It seems like you also need to examine errors of omission. This is obviously harder. Two possible opportunities would be to compare to either [the PITF Worldwide Atrocities Event Data Set] or to ACLED.  There are two questions. Is GDELT “seeing” the same source info (and my guess is that it is and more, though ACLED covers more than just English sources and I’m not sure where GDELT stands on other languages). Then if so (and there are errors of omission) why aren’t they showing up (coded as different types of events or failed to trigger any coding at all)[?]

It’s true that our efforts so far have focused almost exclusively on avoiding errors of commission, with the important caveat that it’s really our automated filtering process, not GDELT, that commits most of these errors. The basic problem for us is that GDELT, or really the CAMEO scheme on which it’s based, wasn’t designed to spot atrocities per se. As a result, most of what we filter out in our human-coding second stage aren’t things that were miscoded by GDELT. Instead, they’re things that were properly coded by GDELT as various forms of violent action but upon closer inspection don’t appear to involve the additional features of atrocities as we define them.

Of course, that still leaves us with this colleague’s central concern about errors of omission, and on that he’s absolutely right. I have experimented with different actor and event-type criteria to make sure we’re not missing a lot of events of interest in GDELT, but I haven’t yet compared what we’re finding in GDELT to what related databases that use different sources are seeing. Once we accumulate a few month’s worth of data, I think this is something we’re really going to need to do.

Stay tuned for Take 3…

The Future of Political Science Just Showed Up

I recently wrote about how data sets just starting to come online are going to throw open doors to projects that political scientists have been hoping to do for a while but haven’t had the evidence to handle. Well, one of those shiny new trains just pulled into the station: the Global Dataset of Events, Language, and Tone, a.k.a. GDELT, is now in the public domain.

GDELT is the primarily work of Kalev Leetaru, a a University Fellow at the University of Illinois Graduate School of Library and Information Science, but its intellectual and practical origins—and its journey into the public domain—also owe a lot to the great Phil Schrodt. The data set includes records summarizing more than 200 million events that have occurred around the world from 1979 to the present. Those records are created by software that grabs and parses news reports from a number of international sources, including Agence France Press, the Associated Press, and Xinhua. Each record indicates who did or said what to whom, where, and when.

The “did what” part of each record is based on the CAMEO coding scheme, which sorts actions into a fairly detailed set of categories covering many different forms of verbal and material cooperation and conflict, from public statements of support to attacks with weapons of mass destruction. The “who” and “to whom” parts use carefully constructed dictionaries to identify specific actors and targets by type and proper name. So, for example, “Philippine soldiers” gets identified as Philippines military (PHLMIL), while “Philippine Secretary of Agriculture” gets tagged as Philippine government (PHLGOV). The “where” part uses place names and other clues in the stories to geolocate each event as specifically as possible.

I try to avoid using words like “revolutionary” when talking about the research process, but in this case I think it fits. I suspect this is going to be the data set that launches a thousand dissertations. As Josh Keating noted on his War of Ideas blog at Foreign Policy,

Similar event databases have been built for particular regions, and DARPA has been working along similar lines for the Pentagon with a project known as ICEWS, but for a publicly accessible program…GDELT is unprecedented in it geographic and historic scale.

To Keating’s point about the data set’s scale, I would add two other ways that GDELT is a radical departure from past practice in the discipline. First, it’s going to be updated daily (watch this space). Second, it’s freely available to the public.

Yes, you read that right: a global data set summarizing all sorts of political cooperation and conflict with daily updates is now going to available to anyone with an Internet connection at no charge. As in: FREE. As I said in a tweet-versation about GDELT this afternoon, contractors have been trying for years (and probably succeeding) to sell closed systems like this to the U.S. government for hundreds of thousands or millions of dollars. If I’m not mistaken, that market just crashed, or at the very least shrank by a whole lot.

GDELT isn’t perfect, of course. I’ve already been tinkering with it a bit as part of a project I’m doing for the Holocaust Museum’s Center for the Prevention of Genocide, on monitoring and predicting mass atrocities, and the data on the “Engage in Unconventional Mass Violence” events I’m hoping to use as a marker of atrocities look more reliable in some cases than others. Still, getting a data set of this size and quality in the public domain is a tremendous leap forward for empirical political science, and the fact that it’s open will allow lots of other smart people to find the flaws and work on eliminating or mitigating them.

Last but not least, I think it’s worth noting that GDELT was made possible, in part, through support from the National Science Foundation. It may be free to you, and it’s orders of magnitude cheaper to produce than the artisanal, hand-crafted event data of yesteryear (like, yesterday). But that doesn’t mean it’s been free to develop, produce, or share, and you can thank the NSF for helping various parts of that process happen.

  • Author

  • Follow me on Twitter

  • Follow Dart-Throwing Chimp on WordPress.com
  • Enter your email address to follow this blog and receive notifications of new posts by email.

    Join 13,631 other followers

  • Archives

  • Advertisements
%d bloggers like this: