Another Note on the Limitations of Event Data

Last week, Foreign Policy ran a blog post by Kalev Leetaru that used GDELT to try to identify trends over time in protest activity around the world. That’s a fascinating and important question, but it’s also a really hard one, and I don’t think Kalev’s post succeeds in answering it. I wanted to use this space to explain why, because the issues involved are fundamental to efforts to answer many similar and important questions about patterns in human social behavior over time.

To me, the heart of Kalev’s post is his attempt to compare the intensity of protest activity worldwide over the past 35 years, the entirety of the period covered by GDELT. Ideally, we would do this with some kind of index that accounted for things like the number of protest events that occurred, the number of people who participated in them, and the things those people did.

Unfortunately, the data set that includes all of that information for all relevant events around the world doesn’t exist and never will. Although it might feel like we now live in a Panopticon, we don’t. In reality, we can still only see things that get reported in sources to which we have access; those reports aren’t always “true,” sometimes conflict, and are always incomplete; and, even in 2014, it’s still hard to reliably locate, parse, and encode data from the stories that we do see.

GDELT is the most ambitious effort to date to overcome these problems, and that ambition is helping to pull empirical social science in some new and productive directions. GDELT uses software to scour the web for media stories that contain information about a large but predetermined array of verbal and physical interactions. These interactions range from protests, threats, and attacks to more positive things like requests for aid and expressions of support. When GDELT’s software finds text that describes one of those interactions, it creates a record that includes numeric representations of words or phrases indicating what kind of interaction it was, who was involved, and where and when it took place. Each of those records becomes one tiny layer in an ever-growing stack. GDELT was only created in the 2010s, but its software has been applied to archival material to extend its coverage all the way back to 1979. The current version includes roughly 2.5 million records, and that number now grows by tens of thousands every day.

GDELT grows out of a rich tradition of event data production in social science, and its coding process mimics many of the procedures that scholars have long used to try to catalog various events of interest—or, at least, to capture reasonably representative samples of them. As such, it’s tempting to treat GDELT’s records as markers of discrete events that can be counted and cross-tabulated to identify trends over time and other patterns of interest.

That temptation should be assiduously resisted for two reasons that Leetaru and others involved in GDELT’s original creation have frequently acknowledged. First, GDELT can only create records from stories that it sees, and the volume and nature of media coverage and its digitized renderings have changed radically over the past 30 years. This change continues and may still be accelerating. One result of this change is exponential growth over time in the volume of GDELT records, as shown in the chart below (borrowed from an informative post on the Ward Lab blog). Under these circumstances, it’s unclear what comparisons across years, and especially decades, are getting at. Are we seeing meaningful changes in the phenomenon of interest, or are we really just seeing traces of change in the volume and nature of reporting on them?

Change Over Time in the Volume of GDELT Records, 1979-2011 (Source: Ward Lab)

Second, GDELT has not fully worked out how to de-duplicate its records. When the same event is reported in more than one media source, GDELT can’t always tell that they are the same event, sometimes even when it’s the same story appearing verbatim in more than one outlet. As a result, events that attract more attention are likely to generate more records. Under these circumstances, the whole idea of treating counts of records in certain categories as counts of certain event types becomes deeply problematic.

Kalev knows these things and tries to address them in his recent FP post on trends over time in protest activity. Here is how he describes what he does and the graph that results:

The number of protests each month is divided by the total number of all events recorded in GDELT that month to create a “protest intensity” score that tracks just how prevalent worldwide protest activity has been month-by-month over the last quarter-century (this corrects for the exponential rise in media coverage over the last 30 years and the imperfect nature of computer processing of the news). To make it easier to spot the macro-level patterns, a black 12-month moving average trend line is drawn on top of the graph to help clarify the major temporal shifts.

Intensity of protest activity worldwide 1979-April 2014 (black line is 12-month moving average) (Source: Kalev Leetaru via FP)

Unfortunately, I don’t think Kalev’s normalization strategy addresses either of the aforementioned problems enough to make the kind of inferences he wants to make about trends over time in the intensity of protest activity around the world.

Let’s start at the top. The numerator of Kalev’s index is the monthly count of records in a particular set of categories. This is where the lack of de-duplication can really skew the picture, and the index Kalev uses does nothing to directly address it.

Without better de-duplication, we can’t fix this problem, but we might be less worried about it if we thought that duplication were a reliable marker of event intensity. Unfortunately, it almost certainly isn’t. Certain events catch the media’s eyes for all kinds of reasons. Some are related to the nature of the event itself, but many aren’t. The things that interest us change over time, as do the ways we talk about them and the motivations of the corporations and editors who partially mediate that conversation. Under these circumstances, it would strain credulity to assume that the frequency of reports on a particular event reliably represents the intensity, or even the salience, of that event. There are just too many other possible explanations to make that inferential leap.

And there’s trouble in the bottom, too. Kalev’s decision to use the monthly volume of all records in the denominator is a reasonable one, but it doesn’t fully solve the problem it’s meant to address, either.

What we get from this division is a proportion: protest-related records as a share of all records. The problem with comparing these proportions across time slices is that they can differ for more than one reason, and that’s true even if we (heroically) assume that the lack of de-duplication isn’t a concern. A change from one month to the next might result from a change in the frequency or intensity of protest activity, but it could also result from a change in the frequency or intensity of some other event type also being tallied. Say, for example, that a war breaks out and produces a big spike in GDELT records related to violent conflict. Under these circumstances, the number of protest-related records could stay the same or even increase, and we would still see a drop in the “protest intensity score” Kalev uses.

In the end, what we get from Kalev’s index isn’t a reliable measure of the intensity of protest activity around the world and its change over time. What we get instead is a noisy measure of relative media attention to protest activity over a period of time when the nature of media attention itself has changed a great deal in ways that we still don’t fully understand. That quantity is potentially interesting in its own right. Frustratingly, though, it cannot answer seemingly simple questions like “How much protest activity are we seeing now?” or “How has the frequency or intensity of protest activity changed over the past 30 years?”

I’ll wrap this up by saying that I am still really, really excited about the new possibilities for social scientific research opening up as a result of projects like GDELT and, now, the Open Event Data Alliance it helped to spawn. At the same time, I think we social scientists have to be very cautious in our use of these shiny new things. As excited as we may be, we’re also the ones with the professional obligation to check the impulse to push them harder than they’re ready to go.