For the past couple of years, I’ve been helping build a system that uses statistics and expert crowds to assess and track risks of mass atrocities around the world. Recently dubbed the Early Warning Project (EWP), this effort already has a blog up and running (here), and the EWP should finally be able to launch a more extensive public website within the next several weeks.
One of the first things I did for the project, back in 2012, was to develop a set of statistical models that assess risks of onsets of state-led mass killing in countries worldwide, the type of mass atrocities for which we have the most theory and data. Consistent with the idea that the EWP will strive to keep improving on what it does as new data, methods, and ideas become available, that piece of the system has continued to evolve over the ensuing couple of years.
You can find the first two versions of that statistical tool here and here. The latest iteration—recently codified in new-and-improved replication materials—has performed pretty well, correctly identifying the few countries that have seen onsets of state-led mass killing in the past couple of years as relatively high-risk cases before those onsets occurred. It’s not nearly as precise as we’d like—I usually apply the phrase “relatively high-risk” to the Top 30, and we’ll only see one or two events in most years—but that level of imprecision is par for the course when forecasting rare and complex political crises like these.
Of course, a solid performance so far doesn’t mean that we can’t or shouldn’t try to do even better. Last week, I finally got around to applying a couple of widely used machine learning techniques to our data to see how those techniques might perform relative to the set of models we’re using now. Our statistical risk assessments come not from a single model but from a small collection of them—a “multi-model ensemble” in applied forecasting jargon—because these collections of models usually produce more accurate forecasts than any single one can. Our current ensemble mixes two logistic regression models, each representing a different line of thinking about the origins of mass killing, with one machine-learning algorithm—Random Forests—that gets applied to all of the variables used by those theory-specific models. In cross-validation, the Random Forests forecasts handily beat the two logistic regression models, but, as is often the case, the average of the forecasts from all three does even better.
Inspired by the success of Random Forests in our current risk assessments and by the power of machine learning in another project on which I’m working, I decided last week to apply two more machine learning methods to this task: support vector machines (SVM) and the k-nearest neighbors (KNN) algorithm. I won’t explain the two techniques in any detail here; you can find good explanations elsewhere on the internet (see here and here, for example), and, frankly, I don’t understand the methods deeply enough to explain them any better.
What I will happily report is that one of the two techniques, SVM, appears to perform our forecasting task about as well as Random Forests. In five-fold cross-validation, both SVM and Random Forests both produced areas under the ROC curve (a.k.a. AUC scores) in the mid-0.80s. AUC scores range from 0.5 to 1, and a score in the mid-0.80s is pretty good for out-of-sample accuracy on this kind of forecasting problem. What’s more, when I averaged the estimates for each case from SVM and Random Forests, I got AUC scores in the mid– to upper 0.80s. That’s several points better than our current ensemble, which combines Random Forests with those logistic regression models.
By contrast, KNN did quite poorly, hovering close to the 0.5 mark that we would get with randomly generated probabilities. Still, success in one of the two experiments is pretty exciting. We don’t have a lot of forecasts to combine right now, so adding even a single high-quality model to the mix could produce real gains.
Mind you, this wasn’t a push-button operation. For one thing, I had to rework my code to handle missing data in a different way—not because SVM handles missing data differently from Random Forests, but because the functions I was using to implement the techniques do. (N.B. All of this work was done in R. I used ‘ksvm’ from the kernlab package for SVM and ‘knn3’ from the caret package for KNN.) I also got poor results from SVM in my initial implementation, which used the default settings for all of the relevant parameters. It took some iterating to discover that the Laplacian kernel significantly improved the algorithm’s performance, and that tinkering with the other flexible parameters (sigma and C for the Laplacian kernel in ksvm) had no effect or made things worse.
I also suspect that the performance of KNN would improve with more effort. To keep the comparison simple, I gave all three algorithms the same set of features and observations. As it happens, though, Random Forests and SVMs are less prone to over-fitting than KNN, which has a harder time separating the signal from the noise when irrelevant features are included. The feature set I chose probably includes some things that don’t add any predictive power, and their inclusion may be obscuring the patterns that do lie in those data. In the next go-round, I would start the KNN algorithm with the small set of features in whose predictive power I’m most confident, see if that works better, and try expanding from there. I would also experiment with different values of k, which I locked in at 5 for this exercise.
It’s tempting to spin the story of this exercise as a human vs. machine parable in which newfangled software and Big Data outdo models hand-crafted by scholars wedded to overly simple stories about the origins of mass atrocities. It’s tempting, but it would also be wrong on a couple of crucial points.
First, this is still small data. Machine learning refers to a class of analytic methods, not the amount of data involved. Here, I am working with the same country-year data set covering the world from the 1940s to the present that I have used in previous iterations of this exercise. This data set contains fewer than 10,000 observations on scores of variables and takes up about as much space on my hard drive as a Beethoven symphony. In the future, I’d like to experiment with newer and larger data sets at different levels of aggregation, but that’s not what I’m doing now, mostly because those newer and larger data sets still don’t cover enough time and space to be useful in the analysis of such rare events.
Second and more important, theory still pervades this process. Scholars’ beliefs about what causes and presages mass killing have guided my decisions about what variables to include in this analysis and, in many cases, how those variables were originally measured and the fact that data even exist on them at all. Those data-generating and variable-selection processes, and all of the expertise they encapsulate, are essential to these models’ forecasting power. In principle, machine learning could be applied to a much wider set of features, and perhaps we’ll try that some time, too. With events as rare as onsets of state-led mass killing, however, I would not have much confidence that results from a theoretically agnostic search would add real forecasting power and not just result in over-fitting.
In any case, based on these results, I will probably incorporate SVM into the next iteration of the Early Warning Project’s statistical risk assessments. Those are due out early in the spring of 2015, when all of the requisite inputs will have been updated (we hope). I think we’ll also need to think carefully about whether or not to keep those logistic regression models in the mix, and what else we might borrow from the world of machine learning. In the meantime, I’ve enjoyed getting to try out some new techniques on data I know well, where it’s a lot easier to tell if things are going wonky, and it’s encouraging to see that we can continue to get better at this hard task if we keep trying.