Ken Benoit, Drew Conway, Benjamin Lauderdale, Michael Laver, and Slava Mikhaylov have an article forthcoming in the American Political Science Review that knocked my socks off when I read it this morning. Here is the abstract from the ungated version I saw:
Empirical social science often relies on data that are not observed in the field, but are transformed into quantitative variables by expert researchers who analyze and interpret qualitative raw sources. While generally considered the most valid way to produce data, this expert-driven process is inherently difficult to replicate or to assess on grounds of reliability. Using crowd-sourcing to distribute text for reading and interpretation by massive numbers of non-experts, we generate results comparable to those using experts to read and interpret the same texts, but do so far more quickly and flexibly. Crucially, the data we collect can be reproduced and extended transparently, making crowd-sourced datasets intrinsically reproducible. This focuses researchers’ attention on the fundamental scientific objective of specifying reliable and replicable methods for collecting the data needed, rather than on the content of any particular dataset. We also show that our approach works straightforwardly with different types of political text, written in different languages. While findings reported here concern text analysis, they have far-reaching implications for expert-generated data in the social sciences.
The data-making strategy they develop is really innovative, and the cost of implementing is, I estimate from the relevant tidbits in the paper, 2–3 orders of magnitude lower than the cost of the traditional expert-centric approach. In other words, this is potentially a BIG DEAL for social-science data-making, which, as Sinan Aral reminds us, is a BIG DEAL for doing better social science.
That said, I do wonder how much structure is baked into the manifesto-coding task that isn’t there in most data-making problems, and that makes it especially well suited to the process the authors develop. In the exercise the paper describes:
- The relevant corpus (party manifestos) is self-evident, finite, and not too large;
- The concepts of interest (economic vs. social policy, left vs. right) are fairly intuitive; and
- The inferential task is naturally “fractal”; that is, the concepts of interest inhere in individual sentences (and maybe even words) as well as whole documents.
None of those attributes holds when it comes to coding latent socio-political structural features like de facto forms of government (a.k.a. regime type) or whether or not a country is in a state of civil war. These features are fundamental to analyses of international politics, but the high cost of producing them means that we sometimes don’t get them at all, and when we do, we usually don’t get them updated as quickly or as often as we would need to do more dynamic analysis and prediction. Maybe it’s my lack of imagination, but I can’t quite see how to extend the authors’ approach to those topics without stretching it past the breaking point. I can think of ways to keep the corpus manageable, but the concepts are not as intuitive, and the inferential task is not fractal. Ditto for coding event data, where I suspect that 2 from the list above would mostly hold; 3 would sometimes hold; but 1 absolutely would not.*
In short, I’m ga-ga about this paper and the directions in which it points us, but I’m not ready yet to declare imminent victory in the struggle to drag political science into a new and much healthier era of data-making. (Fool me once…)
* If you think I’m overlooking something here, please leave a comment explaining how you think it might be do-able.