The second Annual Bank Conference on Africa happened in Berkeley, CA, earlier this week, and the World Bank’s Development Impact blog has an outstanding summary of the 50-odd papers presented there. If you have to pick between reading this post and that one, go there.
One paper on that roster that caught my eye revisits the choice of statistical models for the study of civil wars. As authors John Paul Dunne and Nan Tian describe, the default choice is logistic regression, although probit gets a little playing time, too. They argue, however, that a zero-inflated Poisson (ZIP) model matches the data-generating process better than either of these traditional picks, and they show that this choice affects what we learn about the causes of civil conflict.
Having worked on statistical models of civil conflict for nearly 20 years, I have some opinions on that model-choice issue, but those aren’t what I want to discuss right now. Instead, I want to wonder aloud why more researchers don’t use prediction as the yardstick—or at least one of the yardsticks—for adjudicating these model comparisons.
In their paper, Dunne and Tian stake their claim about the superiority of ZIP to logit and probit on comparisons of Akaike information criteria (AIC) and Vuong tests. Okay, but if their goal is to see if ZIP fits the underlying data-generating process better than those other choices, what better way to find out than by comparing out-of-sample predictive power?
Prediction is fundamental to the accumulation of scientific knowledge. The better we understand why and how something happens, the more accurate our predictions of it should be. When we estimate models from observational data and only look at how well our models fit the data from which they were estimated, we learn some things about the structure of that data set, but we don’t learn how well those things generalize to other relevant data sets. If we believe that the world isn’t deterministic—that the observed data are just one of many possible realizations of the world—then we need to care about that ability to generalize, because that generalization and the discovery of its current limits is the heart of the scientific enterprise.
From a scientific standpoint, the ideal world would be one in which we could estimate models representing rival theories, then compare the accuracy of the predictions they generate across a large number of relevant “trials” as they unfold in real time. That’s difficult for scholars studying big but rare events like civil wars and wars between states; though; a lot of time has to pass before we’ll see enough new examples to make a statistically powerful comparison across models.
But, hey, there’s an app for that—cross-validation! Instead of using all the data in the initial estimation, hold some out to use as a test set for the models we get from the rest. Better yet, split the data into several equally-sized folds and then iterate the training and testing across all possible groupings of them (k-fold cross-validation). Even better, repeat that process a bunch of times and compare distributions of the resulting statistics.
Prediction is the gold standard in most scientific fields, and cross-validation is standard practice in many areas of applied forecasting, because they are more informative than in-sample tests. For some reason, political science still mostly eschews both.* Here’s hoping that changes soon.
* For some recent exceptions to this rule on topics in world politics, see Ward, Greenhill, and Bakke and Blair, Blattman, and Hartman on predicting civil conflict; Chadefaux on warning signs of interstate war; Hill and Jones on state repression; and Chenoweth and me on the onset of nonviolent campaigns.