Most of the talk about a social science “data revolution” has emphasized rapid increases in the quantity of data available to us. Some of that talk has also focused on changes in the quality of those data, including new ideas about how to separate the wheat from the chaff in situations where there’s a lot of grain to thresh. So far, though, we seem to be talking much less about the rate of change in those changes, or what calculus calls the second derivative.
Lately, the slope of this second derivative has been pretty steep. It’s not just that we now have much more, and in some cases much better, data. The sources and content of those data sets are often fast-moving targets, too. The whole environment is growing and churning at an accelerating pace, and that’s simultaneously exhilarating and frustrating.
It’s frustrating because data sets that evolve as we use them create a number of analytical problems that we don’t get from stable measurement tools. Most important, evolving data sets make it hard to compare observations across time, and longitudinal analysis is the crux of most social-scientific research. Paul Pierson explains why in his terrific 2004 book, Politics in Time:
Why do social scientists need to focus on how processes unfold over significant stretches of time? First, because many social processes are path dependent, in which case the key causes are temporally removed from their continuing effects… Second, because sequencing—the temporal order of events or processes—can be a crucial determinant of important social outcomes. Third, because many important social causes and outcomes are slow-moving—they take place over quite extended periods of time and are only likely to be adequately explained (or in some cases even observed in the first place) if analysts are specifically attending to that possibility.
When our measurement systems evolve as we use them, changes in the data we receive might reflect shifts in the underlying phenomenon. They also might reflect changes in the methods and mechanisms by which we observe and record information about that phenomenon, however, and it’s often impossible to tease the one out from the other.
A recent study by David Lazer, Gary King, Ryan Kennedy, and Alessandro Vespignani on what Google Flu Trends (GFT) teaches us about “traps in Big Data analysis” offers a nice case in point. Developed in the late 2000s by Google engineers and researchers at the Centers for Disease Control and Prevention, GFT uses data on Google search queries to help detect flu epidemics (see this paper). As Lazer and his co-authors describe, GFT initially showed great promise as a forecasting tool, and its success spurred excitement about the power of new data streams to shed light on important social processes. For the past few years, though, the tool has worked poorly on its own, and Lazer & co. believe of changes in Google’s search software are the reason. The problem—for researchers, anyway—is that
The Google search algorithm is not a static entity—the company is constantly testing and improving search. For example, the official Google search blog reported 86 changes in June and July 2012 alone (SM). Search patterns are the result of thousands of decisions made by the company’s programmers in various sub-units and by millions of consumers worldwide.
Google keeps tinkering with its search software because that’s what its business entails, but we can expect to see more frequent changes in some data sets specific to social science, too. One of the developments about which I’m most excited is the recent formation of the Open Event Data Alliance (OEDA) and the initial release of the machine-coded political event data it plans to start producing soon, hopefully this summer. As its name implies, OEDA plans to make not just its data but also its code freely available to the public in order to grow a community of users who can help improve and expand the software. That crowdsourcing will surely accelerate the development of the scraping and coding machinery, but it also ensures that the data OEDA produces will be a moving target for a while in ways that will complicate attempts to analyze it.
If these accelerated changes are challenging for basic researchers, they’re even tougher on applied researchers, who have to show and use their work in real time. So what’s an applied researcher to do when your data-gathering instruments are frequently changing, and often in opaque and unpredictable ways?
First, it seems prudent to build systems that are modular, so that a failure in one part of the system can be identified and corrected without having to rebuild the whole edifice. In the atrocities early-warning system I’m helping to build right now, we’re doing this by creating a few subsystems with some overlap in their functions. If one part doesn’t pan out or suddenly breaks, we can lean on the others while we repair or retool.
Second, it’s also a good idea to embed those technical systems in organizational procedures that emphasize frequent checking and fast adaptation. One way to do this is to share your data and code and to discuss your work often with outsiders as you go, so you can catch mistakes, spot alternatives, and see these changes coming before you get too far down any one path. Using open-source statistical software like R is also helpful in this regard, because it lets you take advantage of new features and crowd fixes as they bubble up.
Last and fuzziest, I think it helps to embrace the idea that you’re work doesn’t really belong to you or your organization but is just one tiny part of a larger ecosystem that you’re hoping to see evolve in a particular direction. What worked one month might not work the next, and you’ll never know exactly what effect you’re having, but that’s okay if you recognize that it’s not really supposed to be about you. Just keep up as best you can, don’t get too heavily invested in any one approach or idea, and try to enjoy the ride.