A few years ago, Wired editor Chris Anderson trolled the scientific world with an essay called “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete.” After talking about the fantastic growth in the scale and specificity of data that was occurring at the time—and that growth has only gotten a lot faster since—Anderson argued that
Petabytes allow us to say: “Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.
In other words, with data this rich, theory becomes superfluous.
We may one day get to the point where sufficient quantities of big data can be harvested to answer all of the social questions that most concern us. I doubt it though. There will always be digital divides; always be uneven data shadows; and always be biases in how information and technology are used and produced.
And so we shouldn’t forget the important role of specialists to contextualise and offer insights into what our data do, and maybe more importantly, don’t tell us.
At the same time, I also worry that we’re overreacting to Anderson and his ilk by dismissing Big Data as nothing but marketing hype. From my low perch in one small corner of the social-science world, I get the sense that anyone who sounds excited about Big Data is widely seen as either a fool or a huckster. As Christopher Zorn wrote on Twitter this morning, “‘Big data is dead” is the geek-hipster equivalent of ‘I stopped liking that band before you even heard of them.’”
Of course, I say that as one of those people who’s really excited about the social-scientific potential these data represent. I think a lot of people who dismiss Big Data as marketing hype misunderstand the status quo in social science. If you don’t regularly try to use data to test and develop hypotheses about things like stasis and change in political institutions or the ebb and flow of political violence around the world, you might not realize how scarce and noisy the data we have now really are. On many things our mental models tell us to care about, we simply don’t have reliable measures.
Take, for example, the widely held belief that urban poverty and unemployment drive political unrest in poor countries. Is this true? Well, who knows? For most poor countries, the data we have on income are sparse and often unreliable, and we don’t have any data on unemployment, ever. And that’s at the national level. The micro-level data we’d need to link individuals’ income and employment status to their participation in political mobilization and violence? Apart from a few projects on specific cases (e.g., here and here), fuggeddaboudit.
Lacking the data we need to properly test our models, we fill the space with stories. As Daniel Kahneman describes on p. 201 of Thinking, Fast and Slow,
You cannot help dealing with the limited information you have as if it were all there is to know. You build the best possible story from the information available to you, and if it is a good story, you believe it. Paradoxically, it is easier to construct a coherent story when you know little, when there are fewer pieces to fit into the puzzle. Our comforting conviction that the world makes sense rests on a secure foundation: our almost unlimited ability to ignore our ignorance.
When that’s the state of the art, more data can only make things better. Sure, some researchers will poke around in these data sets until they find “statistically significant” associations and then pretend that’s what they expected to find the whole time. But, as Phil Schrodt points out, plenty of people are already doing that now.
Meanwhile, other researchers with important but unproven ideas about social phenomena will finally get a chance to test and refine those ideas in ways they’d never been able to do before. Barefoot empiricism will play a role, too, but science has always been an iterative process that way, bouncing around between induction and deduction until it hits on something that works. If the switch from data-poor to data-rich social science brings more of that, I feel lucky to be present for its arrival.