Another Tottering Step Toward a New Era of Data-Making

Ken Benoit, Drew Conway, Benjamin Lauderdale, Michael Laver, and Slava Mikhaylov have an article forthcoming in the American Political Science Review that knocked my socks off when I read it this morning. Here is the abstract from the ungated version I saw:

Empirical social science often relies on data that are not observed in the field, but are transformed into quantitative variables by expert researchers who analyze and interpret qualitative raw sources. While generally considered the most valid way to produce data, this expert-driven process is inherently difficult to replicate or to assess on grounds of reliability. Using crowd-sourcing to distribute text for reading and interpretation by massive numbers of non-experts, we generate results comparable to those using experts to read and interpret the same texts, but do so far more quickly and flexibly. Crucially, the data we collect can be reproduced and extended transparently, making crowd-sourced datasets intrinsically reproducible. This focuses researchers’ attention on the fundamental scientific objective of specifying reliable and replicable methods for collecting the data needed, rather than on the content of any particular dataset. We also show that our approach works straightforwardly with different types of political text, written in different languages. While findings reported here concern text analysis, they have far-reaching implications for expert-generated data in the social sciences.

The data-making strategy they develop is really innovative, and the cost of implementing is, I estimate from the relevant tidbits in the paper, 2–3 orders of magnitude lower than the cost of the traditional expert-centric approach. In other words, this is potentially a BIG DEAL for social-science data-making, which, as Sinan Aral reminds us, is a BIG DEAL for doing better social science.

That said, I do wonder how much structure is baked into the manifesto-coding task that isn’t there in most data-making problems, and that makes it especially well suited to the process the authors develop. In the exercise the paper describes:

  1. The relevant corpus (party manifestos) is self-evident, finite, and not too large;
  2. The concepts of interest (economic vs. social policy, left vs. right) are fairly intuitive; and
  3. The inferential task is naturally “fractal”; that is, the concepts of interest inhere in individual sentences (and maybe even words) as well as whole documents.

None of those attributes holds when it comes to coding latent socio-political structural features like de facto forms of government (a.k.a. regime type) or whether or not a country is in a state of civil war. These features are fundamental to analyses of international politics, but the high cost of producing them means that we sometimes don’t get them at all, and when we do, we usually don’t get them updated as quickly or as often as we would need to do more dynamic analysis and prediction. Maybe it’s my lack of imagination, but I can’t quite see how to extend the authors’ approach to those topics without stretching it past the breaking point. I can think of ways to keep the corpus manageable, but the concepts are not as intuitive, and the inferential task is not fractal. Ditto for coding event data, where I suspect that 2 from the list above would mostly hold; 3 would sometimes hold; but 1 absolutely would not.*

In short, I’m ga-ga about this paper and the directions in which it points us, but I’m not ready yet to declare imminent victory in the struggle to drag political science into a new and much healthier era of data-making. (Fool me once…)

* If you think I’m overlooking something here, please leave a comment explaining how you think it might be do-able.

Leave a comment


  1. I think this is a great paper and I definitely want to see more of this kind of work. Your point 2 is important, but I find it useful to think of crowdworkers as RAs rather than mysterious automatons: they can be trained, taught, instructed, or just recruited to do a wide variety of potentially high-level tasks. I’m not as familiar with the demographics of Crowdflower workers, but MTurk workers tend to be well-educated, so their abilities to handle coding tasks should meet or exceed that of undergraduates. Of course, as task complexity increases, so does time and cost, but the inherently massive parallelism of crowdsourced approaches means that savings of time are substantial and – given current wage levels – likely to be much cheaper than even the most affordable RA work. You rightly point out that the biggest challenge is – perhaps as it always is – the need to have a task that can be easily broken down into small pieces: if generating a single numeric value requires parsing an enormous body of contextual or historical information, it’s unlikely to be something that is well-suited to crowdsourcing.

    • Thanks, Thomas. All points well taken.

      Actually, I’m now thinking this approach might break down a couple of the barriers impeding progress on the Early Warning Project‘s atrocities monitoring project, which I wrote a bit about in some earlier posts (here, here, and here). Instead of having a staff person serve as the human in the loop reviewing candidate events pulled from an automated event-data-making process like Phoenix, we could have a swarm of MTurk or CrowdFlower workers do that task, or even pieces of it. It sounds like that would be much cheaper, and it would solve the workflow problem of getting daily updates on weekends or holidays or days when a staffer is out sick. Stay tuned…

  2. Sam

     /  June 27, 2015

    Hi, just came across your site and really enjoying it so far. Just wondering why your link the the article on GDELT says ‘fool me once’? Have you changed your mind on it? Is it no longer publicly accessible?

    • Yes. I’m still very optimistic about the possibilities for machine-coded data, but I have found GDELT to be much less useful than I originally expected — especially for scholarship — because it is so noisy. By noisy, I don’t mean that it contains lots of records; I mean that so many of those records contain errors or are duplicates, often produced by repeated parsing of the same news story across different outlets. Whenever I poke around in the source texts, I find that the vast majority of the records I review are miscoded on one or more major fields, including the type of event, the actors involved, and where and when it occurred. I can’t give you statistics on those error rates, in part because there is no recorded “truth” to which GDELT can be compared. But the fact that I find the same thing again and again in samples on varied topics, times, and locations leads me to believe that this pattern is a basic feature of the data set, not a function of my interests.


Leave a Comment

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: