Peter Nardulli, Scott Althaus, and Matthew Hayes have a piece forthcoming in Sociological Methodology (PDF) that describes what I now see as the cutting edge in the production of political event data: machine-human hybrid systems.
If you have ever participated in the production of political event data, you know that having people find, read, and code data from news stories and other texts takes a tremendous amount of work. Even boutique data sets on narrowly defined topics for short time periods in single cases usually require hundreds or thousands of person-hours to create, and the results still aren’t as pristine as we’d like or often believe.
Contrary to my premature congratulation on GDELT a couple of years ago, however, fully automated systems are not quite ready to take over the task, either. Once a machine-coding system has been built, the data come fast and cheap, but those data are, inevitably, still pretty noisy. (On that point, see here for some of my own experiences with GDELT and here, here, here, here, and here for other relevant discussions.)
I’m now convinced that the best current solution is one that borrows strength from both approaches—in other words, a hybrid. As Nardulli, Althaus, and Hayes argue in their forthcoming article, “Machine coding is no simple substitute for human coding.”
Until fully automated approaches can match the flexibility and contextual richness of human coding, the best option for generating near-term advances in social science research lies in hybrid systems that rely on both machines and humans for extracting information from unstructured texts.
As you might expect, Nardulli & co. have built and are operating such a system—the Social, Political, and Economic Event Database (SPEED)—to code data on a bunch of interesting things, including coups and civil unrest. Their hybrid process goes beyond supervised learning, where an algorithm gets trained on a data set carefully constructed by human coders and then put in the traces to make new data from fresh material. Instead, adopt a “progressive supervised-learning system,” which basically means two things:
- They keep humans in the loop for all steps where the error rate from their machine-only process remains intolerably high, making the results as reliable as possible; and
- They use those humans’ coding decisions as new training sets to continually check and refine their algorithms, gradually shrinking the load borne by the humans and mitigating the substantial risk of concept drift that attaches to any attempt to automate the extraction of data from a constantly evolving news-media ecosystem.
I think SPEED exemplifies the state of the art in a couple of big ways. The first is the process itself. Machine-learning processes have made tremendous gains in the past several years (see here, h/t Steve Mills), but we still haven’t arrived at the point where we can write algorithms that reliably recognize and extract the information we want from the torrent of news stories coursing through the Internet. As long as that’s the case—and I expect it will be for at least another several years—we’re going to need to keep humans in the loop to get data sets we really trust and understand. (And, of course, even then the results will still suffer from biases that even a perfect coding process can’t avoid; see here for Will Moore’s thoughtful discussion of that point.)
The second way in which SPEED exemplifies the state of the art is what Nardulli, Althaus, and Hayes’ paper explicitly and implicitly tells us about the cost and data-sharing constraints that come with building and running a system of this kind on that scale. Nardulli & co. don’t report exactly how much money has been spent on SPEED so far and how much it costs to keep running it, but they do say this:
The Cline Center began assembling its news archive and developing SPEED’s workflow system in 2006, but lacked an operational cyberinfrastructure until 2009. Seven years and well over a million dollars later, the Cline Center released its first SPEED data set.
Partly because of those high costs and partly because of legal issues attached to data coded from many news stories, the data SPEED produces are not freely available to the public. The project shares some historical data sets on its web site, but the content of those sets is limited, and the near-real-time data coveted by applied researchers like me are not made public. Here’s how the authors describe their situation:
While agreements with commercial vendors and intellectual property rights prohibit the Center from distributing its news archive, efforts are being made to provide non-consumptive public access to the Center’s holdings. This access will allow researchers to evaluate the utility of the Center’s digital archive for their needs and construct a research design to realize those needs. Based on that design, researchers can utilize the Center’s various subcenters of expertise (document classification, training, coding, etc.) to implement it.
I’m not happy about those constraints, but as someone who has managed large and costly social-science research projects, I certainly understand them. I also don’t expect them to go away any time soon, for SPEED or for any similar undertaking.
So that’s the state of the art in the production of political event data: Thanks to the growth of the Internet and advances in computing hardware and software, we can now produce political event data on a scale and at a pace that would have had us drooling a decade ago, but the task still can’t be fully automated without making sacrifices in data quality that most social scientists should be uncomfortable making. The best systems we can build right now blend machine learning and automation with routine human involvement and oversight. Those systems are still expensive to build and run, and partly because of that, we should not expect their output to stream onto our virtual desktops for free, like manna raining down from digital heaven.