The Steep Slope of the Data Revolution’s Second Derivative

Most of the talk about a social science “data revolution” has emphasized rapid increases in the quantity of data available to us. Some of that talk has also focused on changes in the quality of those data, including new ideas about how to separate the wheat from the chaff in situations where there’s a lot of grain to thresh. So far, though, we seem to be talking much less about the rate of change in those changes, or what calculus calls the second derivative.

Lately, the slope of this second derivative has been pretty steep. It’s not just that we now have much more, and in some cases much better, data. The sources and content of those data sets are often fast-moving targets, too. The whole environment is growing and churning at an accelerating pace, and that’s simultaneously exhilarating and frustrating.

It’s frustrating because data sets that evolve as we use them create a number of analytical problems that we don’t get from stable measurement tools. Most important, evolving data sets make it hard to compare observations across time, and longitudinal analysis is the crux of most social-scientific research. Paul Pierson explains why in his terrific 2004 book, Politics in Time:

Why do social scientists need to focus on how processes unfold over significant stretches of time? First, because many social processes are path dependent, in which case the key causes are temporally removed from their continuing effects… Second, because sequencing—the temporal order of events or processes—can be a crucial determinant of important social outcomes. Third, because many important social causes and outcomes are slow-moving—they take place over quite extended periods of time and are only likely to be adequately explained (or in some cases even observed in the first place) if analysts are specifically attending to that possibility.

When our measurement systems evolve as we use them, changes in the data we receive might reflect shifts in the underlying phenomenon. They also might reflect changes in the methods and mechanisms by which we observe and record information about that phenomenon, however, and it’s often impossible to tease the one out from the other.

recent study by David Lazer, Gary King, Ryan Kennedy, and Alessandro Vespignani on what Google Flu Trends (GFT) teaches us about “traps in Big Data analysis” offers a nice case in point. Developed in the late 2000s by Google engineers and researchers at the Centers for Disease Control and Prevention, GFT uses data on Google search queries to help detect flu epidemics (see this paper). As Lazer and his co-authors describe, GFT initially showed great promise as a forecasting tool, and its success spurred excitement about the power of new data streams to shed light on important social processes. For the past few years, though, the tool has worked poorly on its own, and Lazer & co. believe of changes in Google’s search software are the reason. The problem—for researchers, anyway—is that

The Google search algorithm is not a static entity—the company is constantly testing and improving search. For example, the official Google search blog reported 86 changes in June and July 2012 alone (SM). Search patterns are the result of thousands of decisions made by the company’s programmers in various sub-units and by millions of consumers worldwide.

Google keeps tinkering with its search software because that’s what its business entails, but we can expect to see more frequent changes in some data sets specific to social science, too. One of the developments about which I’m most excited is the recent formation of the Open Event Data Alliance (OEDA) and the initial release of the machine-coded political event data it plans to start producing soon, hopefully this summer. As its name implies, OEDA plans to make not just its data but also its code freely available to the public in order to grow a community of users who can help improve and expand the software. That crowdsourcing will surely accelerate the development of the scraping and coding machinery, but it also ensures that the data OEDA produces will be a moving target for a while in ways that will complicate attempts to analyze it.

If these accelerated changes are challenging for basic researchers, they’re even tougher on applied researchers, who have to show and use their work in real time. So what’s an applied researcher to do when your data-gathering instruments are frequently changing, and often in opaque and unpredictable ways?

First, it seems prudent to build systems that are modular, so that a failure in one part of the system can be identified and corrected without having to rebuild the whole edifice. In the atrocities early-warning system I’m helping to build right now, we’re doing this by creating a few subsystems with some overlap in their functions. If one part doesn’t pan out or suddenly breaks, we can lean on the others while we repair or retool.

Second, it’s also a good idea to embed those technical systems in organizational procedures that emphasize frequent checking and fast adaptation. One way to do this is to share your data and code and to discuss your work often with outsiders as you go, so you can catch mistakes, spot alternatives, and see these changes coming before you get too far down any one path. Using open-source statistical software like R is also helpful in this regard, because it lets you take advantage of new features and crowd fixes as they bubble up.

Last and fuzziest, I think it helps to embrace the idea that you’re work doesn’t really belong to you or your organization but is just one tiny part of a larger ecosystem that you’re hoping to see evolve in a particular direction. What worked one month might not work the next, and you’ll never know exactly what effect you’re having, but that’s okay if you recognize that it’s not really supposed to be about you. Just keep up as best you can, don’t get too heavily invested in any one approach or idea, and try to enjoy the ride.

Advertisement
Previous Post
Leave a comment

4 Comments

  1. Jay,

    I think you’ve highlighted some really good points here. The point that hits closest to my experience is about the contingent nature of a lot of this data. As marginal costs and time needed for producing data fall (which is what I think is at the root of the acceleration you describe) the infrastructure surrounding the data production gets smaller and smaller. That, in turn, means that there aren’t always solid institutional homes for the data in the way that the older “artisanal, hand-crafted” datasets had (COW, for instance). While those light startup costs great for spurring innovation, it also means that data projects can fall apart just as quickly as they come together. That leaves a lot of people who were counting on the data in the lurch, as I and I suspect many other readers experienced in January.

    The churn in the data itself over time is thankfully something I haven’t had to deal with yet, but as innovation goes faster and anyone can contribute improvements to coding software on GitHub, it’s going to be something we have to figure out how to accomodate. It’s great to improve accuracy and add new features, but for production uses of the data, including forecasts, new improvements can’t break the old models.

    I think there are a few overlapping solutions to these two problems.
    One step is to make everything about the process as open as possible. This does a few things: it distributes the code so anyone can pick things back up if the coding project fails completely, it makes any changes to the code visible so you can figure out exactly how the new version is different from the old, and more broadly it lets users peer into what would otherwise be a black box and see how coding decisions are made. I think people’s default position should be skepticism toward data generating systems they can’t inspect openly.

    A second step is to build institutional frameworks around data generating systems. One approach that you allude to is the Open Event Data Alliance (http://openeventdata.org/) (note: I’m involved in setting it up but I don’t speak on its behalf). Having an institution run the system, rather than an individual, makes the whole process more open, less about personalities, and helps solve the indelicately-put “bus problem” (i.e., if all of the methods for generating a dataset are only in one person’s head, what happens if they get hit by a bus?). OEDA also plans to provide gold-standard codings and provide guidance on how to maintain and compare event datasets.

    Finally, I think producers and users of datasets should get comfortable versioning their datasets in the same way that software developers version their releases. Bleeding edge features can get incorporated in newer versions and people who want the features can get quick access, but people using it for forecasting or for comparison over time should have access to stable releases and be confident that their availability will continue for years. Maintaining, hosting, and explaining those versions is another role an organization like OEDA should play.

    Reply
  1. Tuesday Linkage » Duck of Minerva
  2. Data Churn and Data Versioning | Andrew Halterman
  3. Developments in Event Data | Andrew Halterman

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

  • Author

  • Follow me on Twitter

  • Follow Dart-Throwing Chimp on WordPress.com
  • Enter your email address to follow this blog and receive notifications of new posts by email.

    Join 13,607 other subscribers
  • Archives

%d bloggers like this: