Another Note on the Limitations of Event Data

Last week, Foreign Policy ran a blog post by Kalev Leetaru that used GDELT to try to identify trends over time in protest activity around the world. That’s a fascinating and important question, but it’s also a really hard one, and I don’t think Kalev’s post succeeds in answering it. I wanted to use this space to explain why, because the issues involved are fundamental to efforts to answer many similar and important questions about patterns in human social behavior over time.

To me, the heart of Kalev’s post is his attempt to compare the intensity of protest activity worldwide over the past 35 years, the entirety of the period covered by GDELT. Ideally, we would do this with some kind of index that accounted for things like the number of protest events that occurred, the number of people who participated in them, and the things those people did.

Unfortunately, the data set that includes all of that information for all relevant events around the world doesn’t exist and never will. Although it might feel like we now live in a Panopticon, we don’t. In reality, we can still only see things that get reported in sources to which we have access; those reports aren’t always “true,” sometimes conflict, and are always incomplete; and, even in 2014, it’s still hard to reliably locate, parse, and encode data from the stories that we do see.

GDELT is the most ambitious effort to date to overcome these problems, and that ambition is helping to pull empirical social science in some new and productive directions. GDELT uses software to scour the web for media stories that contain information about a large but predetermined array of verbal and physical interactions. These interactions range from protests, threats, and attacks to more positive things like requests for aid and expressions of support. When GDELT’s software finds text that describes one of those interactions, it creates a record that includes numeric representations of words or phrases indicating what kind of interaction it was, who was involved, and where and when it took place. Each of those records becomes one tiny layer in an ever-growing stack. GDELT was only created in the 2010s, but its software has been applied to archival material to extend its coverage all the way back to 1979. The current version includes roughly 2.5 million records, and that number now grows by tens of thousands every day.

GDELT grows out of a rich tradition of event data production in social science, and its coding process mimics many of the procedures that scholars have long used to try to catalog various events of interest—or, at least, to capture reasonably representative samples of them. As such, it’s tempting to treat GDELT’s records as markers of discrete events that can be counted and cross-tabulated to identify trends over time and other patterns of interest.

That temptation should be assiduously resisted for two reasons that Leetaru and others involved in GDELT’s original creation have frequently acknowledged. First, GDELT can only create records from stories that it sees, and the volume and nature of media coverage and its digitized renderings have changed radically over the past 30 years. This change continues and may still be accelerating. One result of this change is exponential growth over time in the volume of GDELT records, as shown in the chart below (borrowed from an informative post on the Ward Lab blog). Under these circumstances, it’s unclear what comparisons across years, and especially decades, are getting at. Are we seeing meaningful changes in the phenomenon of interest, or are we really just seeing traces of change in the volume and nature of reporting on them?

Change Over Time in the Volume of GDELT Records, 1979-2011 (Source: Ward Lab)

Second, GDELT has not fully worked out how to de-duplicate its records. When the same event is reported in more than one media source, GDELT can’t always tell that they are the same event, sometimes even when it’s the same story appearing verbatim in more than one outlet. As a result, events that attract more attention are likely to generate more records. Under these circumstances, the whole idea of treating counts of records in certain categories as counts of certain event types becomes deeply problematic.

Kalev knows these things and tries to address them in his recent FP post on trends over time in protest activity. Here is how he describes what he does and the graph that results:

The number of protests each month is divided by the total number of all events recorded in GDELT that month to create a “protest intensity” score that tracks just how prevalent worldwide protest activity has been month-by-month over the last quarter-century (this corrects for the exponential rise in media coverage over the last 30 years and the imperfect nature of computer processing of the news). To make it easier to spot the macro-level patterns, a black 12-month moving average trend line is drawn on top of the graph to help clarify the major temporal shifts.

Intensity of protest activity worldwide 1979-April 2014 (black line is 12-month moving average) (Source: Kalev Leetaru via FP)

Unfortunately, I don’t think Kalev’s normalization strategy addresses either of the aforementioned problems enough to make the kind of inferences he wants to make about trends over time in the intensity of protest activity around the world.

Let’s start at the top. The numerator of Kalev’s index is the monthly count of records in a particular set of categories. This is where the lack of de-duplication can really skew the picture, and the index Kalev uses does nothing to directly address it.

Without better de-duplication, we can’t fix this problem, but we might be less worried about it if we thought that duplication were a reliable marker of event intensity. Unfortunately, it almost certainly isn’t. Certain events catch the media’s eyes for all kinds of reasons. Some are related to the nature of the event itself, but many aren’t. The things that interest us change over time, as do the ways we talk about them and the motivations of the corporations and editors who partially mediate that conversation. Under these circumstances, it would strain credulity to assume that the frequency of reports on a particular event reliably represents the intensity, or even the salience, of that event. There are just too many other possible explanations to make that inferential leap.

And there’s trouble in the bottom, too. Kalev’s decision to use the monthly volume of all records in the denominator is a reasonable one, but it doesn’t fully solve the problem it’s meant to address, either.

What we get from this division is a proportion: protest-related records as a share of all records. The problem with comparing these proportions across time slices is that they can differ for more than one reason, and that’s true even if we (heroically) assume that the lack of de-duplication isn’t a concern. A change from one month to the next might result from a change in the frequency or intensity of protest activity, but it could also result from a change in the frequency or intensity of some other event type also being tallied. Say, for example, that a war breaks out and produces a big spike in GDELT records related to violent conflict. Under these circumstances, the number of protest-related records could stay the same or even increase, and we would still see a drop in the “protest intensity score” Kalev uses.

In the end, what we get from Kalev’s index isn’t a reliable measure of the intensity of protest activity around the world and its change over time. What we get instead is a noisy measure of relative media attention to protest activity over a period of time when the nature of media attention itself has changed a great deal in ways that we still don’t fully understand. That quantity is potentially interesting in its own right. Frustratingly, though, it cannot answer seemingly simple questions like “How much protest activity are we seeing now?” or “How has the frequency or intensity of protest activity changed over the past 30 years?”

I’ll wrap this up by saying that I am still really, really excited about the new possibilities for social scientific research opening up as a result of projects like GDELT and, now, the Open Event Data Alliance it helped to spawn. At the same time, I think we social scientists have to be very cautious in our use of these shiny new things. As excited as we may be, we’re also the ones with the professional obligation to check the impulse to push them harder than they’re ready to go.

The Steep Slope of the Data Revolution’s Second Derivative

Most of the talk about a social science “data revolution” has emphasized rapid increases in the quantity of data available to us. Some of that talk has also focused on changes in the quality of those data, including new ideas about how to separate the wheat from the chaff in situations where there’s a lot of grain to thresh. So far, though, we seem to be talking much less about the rate of change in those changes, or what calculus calls the second derivative.

Lately, the slope of this second derivative has been pretty steep. It’s not just that we now have much more, and in some cases much better, data. The sources and content of those data sets are often fast-moving targets, too. The whole environment is growing and churning at an accelerating pace, and that’s simultaneously exhilarating and frustrating.

It’s frustrating because data sets that evolve as we use them create a number of analytical problems that we don’t get from stable measurement tools. Most important, evolving data sets make it hard to compare observations across time, and longitudinal analysis is the crux of most social-scientific research. Paul Pierson explains why in his terrific 2004 book, Politics in Time:

Why do social scientists need to focus on how processes unfold over significant stretches of time? First, because many social processes are path dependent, in which case the key causes are temporally removed from their continuing effects… Second, because sequencing—the temporal order of events or processes—can be a crucial determinant of important social outcomes. Third, because many important social causes and outcomes are slow-moving—they take place over quite extended periods of time and are only likely to be adequately explained (or in some cases even observed in the first place) if analysts are specifically attending to that possibility.

When our measurement systems evolve as we use them, changes in the data we receive might reflect shifts in the underlying phenomenon. They also might reflect changes in the methods and mechanisms by which we observe and record information about that phenomenon, however, and it’s often impossible to tease the one out from the other.

recent study by David Lazer, Gary King, Ryan Kennedy, and Alessandro Vespignani on what Google Flu Trends (GFT) teaches us about “traps in Big Data analysis” offers a nice case in point. Developed in the late 2000s by Google engineers and researchers at the Centers for Disease Control and Prevention, GFT uses data on Google search queries to help detect flu epidemics (see this paper). As Lazer and his co-authors describe, GFT initially showed great promise as a forecasting tool, and its success spurred excitement about the power of new data streams to shed light on important social processes. For the past few years, though, the tool has worked poorly on its own, and Lazer & co. believe of changes in Google’s search software are the reason. The problem—for researchers, anyway—is that

The Google search algorithm is not a static entity—the company is constantly testing and improving search. For example, the official Google search blog reported 86 changes in June and July 2012 alone (SM). Search patterns are the result of thousands of decisions made by the company’s programmers in various sub-units and by millions of consumers worldwide.

Google keeps tinkering with its search software because that’s what its business entails, but we can expect to see more frequent changes in some data sets specific to social science, too. One of the developments about which I’m most excited is the recent formation of the Open Event Data Alliance (OEDA) and the initial release of the machine-coded political event data it plans to start producing soon, hopefully this summer. As its name implies, OEDA plans to make not just its data but also its code freely available to the public in order to grow a community of users who can help improve and expand the software. That crowdsourcing will surely accelerate the development of the scraping and coding machinery, but it also ensures that the data OEDA produces will be a moving target for a while in ways that will complicate attempts to analyze it.

If these accelerated changes are challenging for basic researchers, they’re even tougher on applied researchers, who have to show and use their work in real time. So what’s an applied researcher to do when your data-gathering instruments are frequently changing, and often in opaque and unpredictable ways?

First, it seems prudent to build systems that are modular, so that a failure in one part of the system can be identified and corrected without having to rebuild the whole edifice. In the atrocities early-warning system I’m helping to build right now, we’re doing this by creating a few subsystems with some overlap in their functions. If one part doesn’t pan out or suddenly breaks, we can lean on the others while we repair or retool.

Second, it’s also a good idea to embed those technical systems in organizational procedures that emphasize frequent checking and fast adaptation. One way to do this is to share your data and code and to discuss your work often with outsiders as you go, so you can catch mistakes, spot alternatives, and see these changes coming before you get too far down any one path. Using open-source statistical software like R is also helpful in this regard, because it lets you take advantage of new features and crowd fixes as they bubble up.

Last and fuzziest, I think it helps to embrace the idea that you’re work doesn’t really belong to you or your organization but is just one tiny part of a larger ecosystem that you’re hoping to see evolve in a particular direction. What worked one month might not work the next, and you’ll never know exactly what effect you’re having, but that’s okay if you recognize that it’s not really supposed to be about you. Just keep up as best you can, don’t get too heavily invested in any one approach or idea, and try to enjoy the ride.

  • Author

  • Follow me on Twitter

  • Follow Dart-Throwing Chimp on WordPress.com
  • Enter your email address to follow this blog and receive notifications of new posts by email.

    Join 13,607 other subscribers
  • Archives

%d bloggers like this: