Big Data Won’t Kill the Theory Star

A few years ago, Wired editor Chris Anderson trolled the scientific world with an essay called “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete.” After talking about the fantastic growth in the scale and specificity of data that was occurring at the time—and that growth has only gotten a lot faster since—Anderson argued that

Petabytes allow us to say: “Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.

In other words, with data this rich, theory becomes superfluous.

Like many of my colleagues, I think Anderson is wrong about the increasing irrelevance of theory. Mark Graham explains why in a year-old post on the Guardian‘s Datablog:

We may one day get to the point where sufficient quantities of big data can be harvested to answer all of the social questions that most concern us. I doubt it though. There will always be digital divides; always be uneven data shadows; and always be biases in how information and technology are used and produced.

And so we shouldn’t forget the important role of specialists to contextualise and offer insights into what our data do, and maybe more importantly, don’t tell us.

At the same time, I also worry that we’re overreacting to Anderson and his ilk by dismissing Big Data as nothing but marketing hype. From my low perch in one small corner of the social-science world, I get the sense that anyone who sounds excited about Big Data is widely seen as either a fool or a huckster. As Christopher Zorn wrote on Twitter this morning, “‘Big data is dead” is the geek-hipster equivalent of ‘I stopped liking that band before you even heard of them.'”

Of course, I say that as one of those people who’s really excited about the social-scientific potential these data represent. I think a lot of people who dismiss Big Data as marketing hype misunderstand the status quo in social science. If you don’t regularly try to use data to test and develop hypotheses about things like stasis and change in political institutions or the ebb and flow of political violence around the world, you might not realize how scarce and noisy the data we have now really are. On many things our mental models tell us to care about, we simply don’t have reliable measures.

Take, for example, the widely held belief that urban poverty and unemployment drive political unrest in poor countries. Is this true? Well, who knows? For most poor countries, the data we have on income are sparse and often unreliable, and we don’t have any data on unemployment, ever. And that’s at the national level. The micro-level data we’d need to link individuals’ income and employment status to their participation in political mobilization and violence? Apart from a few projects on specific cases (e.g., here and here), fuggeddaboudit.

Lacking the data we need to properly test our models, we fill the space with stories. As Daniel Kahneman describes on p. 201 of Thinking, Fast and Slow,

You cannot help dealing with the limited information you have as if it were all there is to know. You build the best possible story from the information available to you, and if it is a good story, you believe it. Paradoxically, it is easier to construct a coherent story when you know little, when there are fewer pieces to fit into the puzzle. Our comforting conviction that the world makes sense rests on a secure foundation: our almost unlimited ability to ignore our ignorance.

When that’s the state of the art, more data can only make things better. Sure, some researchers will poke around in these data sets until they find “statistically significant” associations and then pretend that’s what they expected to find the whole time. But, as Phil Schrodt points out, plenty of people are already doing that now.

Meanwhile, other researchers with important but unproven ideas about social phenomena will finally get a chance to test and refine those ideas in ways they’d never been able to do before. Barefoot empiricism will play a role, too, but science has always been an iterative process that way, bouncing around between induction and deduction until it hits on something that works. If the switch from data-poor to data-rich social science brings more of that, I feel lucky to be present for its arrival.

16 Comments

by Jay Ulfelder on February 23, 2013 • Permalink

Posted in Misc.

Tagged Big Data, Christopher Zorn, Daniel Kahneman, narrative fallacy, Phil Schrodt, philosophy of science

Posted by Jay Ulfelder on February 23, 2013

https://dartthrowingchimp.wordpress.com/2013/02/23/big-data-wont-kill-the-theory-star/

16 Comments

Brett Bobley (@brettbobley)
/ February 23, 2013

Nice piece Jay. Big Data definitely isn’t dead. In fact, if anything, we are just starting to figure out how to use it in the social sciences and humanities. To quote Peter Norvig (quoting Chris Anderson), “in the era of big data, more isn’t just more. More is different.” Norvig’s article is well-worth reading — he critiques some of the more outrageous claims in the Anderson piece, but then goes on to note (by example) why “more” really is different. See: http://norvig.com/fact-check.html

If I may, I’d encourage any data-leaning social scientists or humanities researchers to consider participating in the Digging into Data Challenge. It is an international grant competition sponsored by ten research agencies that focuses on using big data for HSS research. see: http://www.diggingintodata.org/

Reply
- dartthrowingchimp
  / February 23, 2013
  
  Thanks, Brett. I’m looking forward to seeing what Digging into Data produces. As you say, we’re really just getting started here…
  
  Reply
Grant
/ February 23, 2013

Social scientists might be replaced some day (if I knew how to use italics regularly I’d use that here). Might. That depends a lot on technological changes that we can’t even properly imagine. For the time being all that data seems more like a tool that requires a good amount of experience and skill to actually use.

Reply
Artem Kaznatcheev
/ April 22, 2014

“‘Big data is dead” is the geek-hipster equivalent of ‘I stopped liking that band before you even heard of them.’”

Great line. However, I think it is still important to critique big data and do so loudly. Even though in most cases we are just using the big data brand-name to point out problems that were prevalent long before big data, say overfitting and overtesting. These issues need to be addressed and since nobody really wants to have honest methodological discussions, I think it is fine to hitchhike such a discussion on the hype of big data.

Of course, doing so has the risk of equating ‘big data’ to ‘machine learning done wrong’, especially if you are optimistic about the potential usefulness of direct connections to data.

To come back to the main point of your post. I am not sure about the social sciences in particular, but in biology and machine, the equivalent of ‘big data’ has largely killed theory (or at least sidelined it). As I wrote earlier (and echoing some of the other comments I left around your blog):

If we can skip understanding and just get quick results, then why bother with theory and analytic treatments? It is tempting to say that more data and statistical analysis can’t possibly hurt (except maybe the opportunity cost of collecting it), they just give more playthings for theorists. But for me, the real problem is that a focus on data changes priorities. Scientists (due mostly to science funding) start to move away from trying to explain and understand phenomena and more to collecting data and data-mining it to make predictions without understanding. A focus on blind statistics on big data often provides great practical results in the short term (and that is why it wins funding) at the expense of the understanding needed for long term development of science. In particular I suspect that after a short success it will produce a field that doesn’t know how to ask new and interesting questions.

Philip Gerlee has a nice brief discussion on this in the biological context.

Reply