The State of the Art in the Production of Political Event Data

Peter Nardulli, Scott Althaus, and Matthew Hayes have a piece forthcoming in Sociological Methodology (PDF) that describes what I now see as the cutting edge in the production of political event data: machine-human hybrid systems.

If you have ever participated in the production of political event data, you know that having people find, read, and code data from news stories and other texts takes a tremendous amount of work. Even boutique data sets on narrowly defined topics for short time periods in single cases usually require hundreds or thousands of person-hours to create, and the results still aren’t as pristine as we’d like or often believe.

Contrary to my premature congratulation on GDELT a couple of years ago, however, fully automated systems are not quite ready to take over the task, either. Once a machine-coding system has been built, the data come fast and cheap, but those data are, inevitably, still pretty noisy. (On that point, see here for some of my own experiences with GDELT and here, here, here, here, and here for other relevant discussions.)

I’m now convinced that the best current solution is one that borrows strength from both approaches—in other words, a hybrid. As Nardulli, Althaus, and Hayes argue in their forthcoming article, “Machine coding is no simple substitute for human coding.”

Until fully automated approaches can match the flexibility and contextual richness of human coding, the best option for generating near-term advances in social science research lies in hybrid systems that rely on both machines and humans for extracting information from unstructured texts.

As you might expect, Nardulli & co. have built and are operating such a system—the Social, Political, and Economic Event Database (SPEED)—to code data on a bunch of interesting things, including coups and civil unrest. Their hybrid process goes beyond supervised learning, where an algorithm gets trained on a data set carefully constructed by human coders and then put in the traces to make new data from fresh material. Instead, adopt a “progressive supervised-learning system,” which basically means two things:

  1. They keep humans in the loop for all steps where the error rate from their machine-only process remains intolerably high, making the results as reliable as possible; and
  2. They use those humans’ coding decisions as new training sets to continually check and refine their algorithms, gradually shrinking the load borne by the humans and mitigating the substantial risk of concept drift that attaches to any attempt to automate the extraction of data from a constantly evolving news-media ecosystem.

I think SPEED exemplifies the state of the art in a couple of big ways. The first is the process itself. Machine-learning processes have made tremendous gains in the past several years (see here, h/t Steve Mills), but we still haven’t arrived at the point where we can write algorithms that reliably recognize and extract the information we want from the torrent of news stories coursing through the Internet. As long as that’s the case—and I expect it will be for at least another several years—we’re going to need to keep humans in the loop to get data sets we really trust and understand. (And, of course, even then the results will still suffer from biases that even a perfect coding process can’t avoid; see here for Will Moore’s thoughtful discussion of that point.)

The second way in which SPEED exemplifies the state of the art is what Nardulli, Althaus, and Hayes’ paper explicitly and implicitly tells us about the cost and data-sharing constraints that come with building and running a system of this kind on that scale. Nardulli & co. don’t report exactly how much money has been spent on SPEED so far and how much it costs to keep running it, but they do say this:

The Cline Center began assembling its news archive and developing SPEED’s workflow system in 2006, but lacked an operational cyberinfrastructure until 2009. Seven years and well over a million dollars later, the Cline Center released its first SPEED data set.

Partly because of those high costs and partly because of legal issues attached to data coded from many news stories, the data SPEED produces are not freely available to the public. The project shares some historical data sets on its web site, but the content of those sets is limited, and the near-real-time data coveted by applied researchers like me are not made public. Here’s how the authors describe their situation:

While agreements with commercial vendors and intellectual property rights prohibit the Center from distributing its news archive, efforts are being made to provide non-consumptive public access to the Center’s holdings. This access will allow researchers to evaluate the utility of the Center’s digital archive for their needs and construct a research design to realize those needs. Based on that design, researchers can utilize the Center’s various subcenters of expertise (document classification, training, coding, etc.) to implement it.

I’m not happy about those constraints, but as someone who has managed large and costly social-science research projects, I certainly understand them. I also don’t expect them to go away any time soon, for SPEED or for any similar undertaking.

So that’s the state of the art in the production of political event data: Thanks to the growth of the Internet and advances in computing hardware and software, we can now produce political event data on a scale and at a pace that would have had us drooling a decade ago, but the task still can’t be fully automated without making sacrifices in data quality that most social scientists should be uncomfortable making. The best systems we can build right now blend machine learning and automation with routine human involvement and oversight. Those systems are still expensive to build and run, and partly because of that, we should not expect their output to stream onto our virtual desktops for free, like manna raining down from digital heaven.

Advertisements

Is Algorithmic Judgment Creepy or Wonderful?

For the Nieman Lab’s Predictions for Journalism 2015, Zeynep Tufekci writes that

We’re seeing the birth of a new era, the era of judging machines: machines that calculate not just how to quickly sort a database, or perform a mathematical calculation, but to decide what is “best,” “relevant,” “appropriate,” or “harmful.”

Tufekci believes we’re increasingly “creeped out” by this trend, and she thinks that’s appropriate. It’s not the algorithms themselves that bother her so much as the noiselessness of their presence. Decisions are constantly being made for us without our even realizing it, and those decisions could reshape our lives.

Or, in some cases, save them. At FiveThirtyEight, Andrew Flowers reports on the U.S. Army’s efforts to apply machine-learning techniques to large data sets to develop a predictive tool—an algorithm—that can accurately identify soldiers at highest risk of attempting suicide. The Army has a serious suicide problem, and an algorithm that can help clinicians decide which individuals require additional interventions could help mitigate that problem. The early results are promising:

The model’s predictive abilities were impressive. Those soldiers who were rated in the top 5 percent of risk were responsible for 52 percent of all suicides — they were the needles, and the Army was starting to find them.

So which is it? Are algorithmic interventions creepy or wonderful?

I’ve been designing and hawking algorithms to help people assess risks for more than 15 years, so it won’t surprise anyone to hear that I tilt toward the “wonderful” camp. Maybe it’s just the paychecks talking, but consciously, at least, my defense of algorithms starts from the fact that we humans consistently overestimate the power of our intuition. As researchers like Paul Meehl and Phil Tetlock keep showing, we’re not nearly as good at compiling and assessing information as we think we are. So, the baseline condition—unassisted human judgment—is often much worse than we recognize, and there’s lots of room to improve.

Flowers’ story on the Army’s suicide risk-detection efforts offers a case in point. As Flowers notes, “The Army is constructing a high-tech weapon to fight suicide because it’s losing the battle against it.” The status quo, in which clinicians make judgments about these risks without the benefit of explicit predictive modeling, is failing to stem the increase in soldiers’ suicide rates. Under the conditions, the risk-assessing algorithm doesn’t have to work perfectly to have some positive effect. Moving the needle even a little bit in the right direction could save dozens of soldiers’ lives each year.

Where I agree strongly with Tufekci is on the importance of transparency. I want to have algorithms helping me decide what’s most relevant and what the best course of action might be, but I also want to know where and when those algorithms are operating. As someone who builds these kinds of tools, I also want to be able to poke around under the hood. The latter won’t always be possible in the commercial world—algorithms are a form of trade knowledge, and I understand the need for corporations (and freelancers!) to protect their comparative advantages—but informed consent should be a given.

Machine learning our way to better early warning on mass atrocities

For the past couple of years, I’ve been helping build a system that uses statistics and expert crowds to assess and track risks of mass atrocities around the world. Recently dubbed the Early Warning Project (EWP), this effort already has a blog up and running (here), and the EWP should finally be able to launch a more extensive public website within the next several weeks.

One of the first things I did for the project, back in 2012, was to develop a set of statistical models that assess risks of onsets of state-led mass killing in countries worldwide, the type of mass atrocities for which we have the most theory and data. Consistent with the idea that the EWP will strive to keep improving on what it does as new data, methods, and ideas become available, that piece of the system has continued to evolve over the ensuing couple of years.

You can find the first two versions of that statistical tool here and here. The latest iteration—recently codified in new-and-improved replication materials—has performed pretty well, correctly identifying the few countries that have seen onsets of state-led mass killing in the past couple of years as relatively high-risk cases before those onsets occurred. It’s not nearly as precise as we’d like—I usually apply the phrase “relatively high-risk” to the Top 30, and we’ll only see one or two events in most years—but that level of imprecision is par for the course when forecasting rare and complex political crises like these.

Of course, a solid performance so far doesn’t mean that we can’t or shouldn’t try to do even better. Last week, I finally got around to applying a couple of widely used machine learning techniques to our data to see how those techniques might perform relative to the set of models we’re using now. Our statistical risk assessments come not from a single model but from a small collection of them—a “multi-model ensemble” in applied forecasting jargon—because these collections of models usually produce more accurate forecasts than any single one can. Our current ensemble mixes two logistic regression models, each representing a different line of thinking about the origins of mass killing, with one machine-learning algorithm—Random Forests—that gets applied to all of the variables used by those theory-specific models. In cross-validation, the Random Forests forecasts handily beat the two logistic regression models, but, as is often the case, the average of the forecasts from all three does even better.

Inspired by the success of Random Forests in our current risk assessments and by the power of machine learning in another project on which I’m working, I decided last week to apply two more machine learning methods to this task: support vector machines (SVM) and the k-nearest neighbors (KNN) algorithm. I won’t explain the two techniques in any detail here; you can find good explanations elsewhere on the internet (see here and here, for example), and, frankly, I don’t understand the methods deeply enough to explain them any better.

What I will happily report is that one of the two techniques, SVM, appears to perform our forecasting task about as well as Random Forests. In five-fold cross-validation, both SVM and Random Forests both produced areas under the ROC curve (a.k.a. AUC scores) in the mid-0.80s. AUC scores range from 0.5 to 1, and a score in the mid-0.80s is pretty good for out-of-sample accuracy on this kind of forecasting problem. What’s more, when I averaged the estimates for each case from SVM and Random Forests, I got AUC scores in the mid– to upper 0.80s. That’s several points better than our current ensemble, which combines Random Forests with those logistic regression models.

By contrast, KNN did quite poorly, hovering close to the 0.5 mark that we would get with randomly generated probabilities. Still, success in one of the two experiments is pretty exciting. We don’t have a lot of forecasts to combine right now, so adding even a single high-quality model to the mix could produce real gains.

Mind you, this wasn’t a push-button operation. For one thing, I had to rework my code to handle missing data in a different way—not because SVM handles missing data differently from Random Forests, but because the functions I was using to implement the techniques do. (N.B. All of this work was done in R. I used ‘ksvm’ from the kernlab package for SVM and ‘knn3’ from the caret package for KNN.) I also got poor results from SVM in my initial implementation, which used the default settings for all of the relevant parameters. It took some iterating to discover that the Laplacian kernel significantly improved the algorithm’s performance, and that tinkering with the other flexible parameters (sigma and C for the Laplacian kernel in ksvm) had no effect or made things worse.

I also suspect that the performance of KNN would improve with more effort. To keep the comparison simple, I gave all three algorithms the same set of features and observations. As it happens, though, Random Forests and SVMs are less prone to over-fitting than KNN, which has a harder time separating the signal from the noise when irrelevant features are included. The feature set I chose probably includes some things that don’t add any predictive power, and their inclusion may be obscuring the patterns that do lie in those data. In the next go-round, I would start the KNN algorithm with the small set of features in whose predictive power I’m most confident, see if that works better, and try expanding from there. I would also experiment with different values of k, which I locked in at 5 for this exercise.

It’s tempting to spin the story of this exercise as a human vs. machine parable in which newfangled software and Big Data outdo models hand-crafted by scholars wedded to overly simple stories about the origins of mass atrocities. It’s tempting, but it would also be wrong on a couple of crucial points.

First, this is still small data. Machine learning refers to a class of analytic methods, not the amount of data involved. Here, I am working with the same country-year data set covering the world from the 1940s to the present that I have used in previous iterations of this exercise. This data set contains fewer than 10,000 observations on scores of variables and takes up about as much space on my hard drive as a Beethoven symphony. In the future, I’d like to experiment with newer and larger data sets at different levels of aggregation, but that’s not what I’m doing now, mostly because those newer and larger data sets still don’t cover enough time and space to be useful in the analysis of such rare events.

Second and more important, theory still pervades this process. Scholars’ beliefs about what causes and presages mass killing have guided my decisions about what variables to include in this analysis and, in many cases, how those variables were originally measured and the fact that data even exist on them at all. Those data-generating and variable-selection processes, and all of the expertise they encapsulate, are essential to these models’ forecasting power. In principle, machine learning could be applied to a much wider set of features, and perhaps we’ll try that some time, too. With events as rare as onsets of state-led mass killing, however, I would not have much confidence that results from a theoretically agnostic search would add real forecasting power and not just result in over-fitting.

In any case, based on these results, I will probably incorporate SVM into the next iteration of the Early Warning Project’s statistical risk assessments. Those are due out early in the spring of 2015, when all of the requisite inputs will have been updated (we hope). I think we’ll also need to think carefully about whether or not to keep those logistic regression models in the mix, and what else we might borrow from the world of machine learning. In the meantime, I’ve enjoyed getting to try out some new techniques on data I know well, where it’s a lot easier to tell if things are going wonky, and it’s encouraging to see that we can continue to get better at this hard task if we keep trying.

  • Author

  • Follow me on Twitter

  • Follow Dart-Throwing Chimp on WordPress.com
  • Enter your email address to follow this blog and receive notifications of new posts by email.

    Join 13,632 other followers

  • Archives

  • Advertisements
%d bloggers like this: