On Evaluating and Presenting Forecasts

On Saturday afternoon, at the International Studies Association‘s annual conference in New Orleans, I’m slated to participate in a round-table discussion with Patrick Brandt, Kristian Gleditsch, and Håvard Hegre on “assessing forecasts of (rare) international events.” Our chair, Andy Halterman, gave us three questions he’d like to discuss:

  1. What are the quantitative assessments of forecasting accuracy that people should use when they publish forecasts?
  2. What’s the process that should be used to determine whether a gain in a scoring metric represents an actual improvement in the model?
  3. How should model uncertainty and past model performance be conveyed to government or non-academic users of forecasts?

As I did for a Friday panel on alternative academic careers (here), I thought I’d use the blog to organize my thoughts ahead of the event and share them with folks who are interested but can’t attend the round table. So, here goes:

When assessing predictive power, we use perfection as the default benchmark. We ask, “Was she right?” or “How close to true value did he get?”

In fields where predictive accuracy is already very good, this approach seems reasonable. When the object of the forecasts are rare international events, however, I think this is a mistake, or at least misleading. It implies that perfection is attainable, and that distance from perfection is what we care about. In fact, approximating perfection is not a realistic goal in many fields, and what we really care about in those situations is distance from the available alternatives. In other words, I think we should always assess accuracy in comparative terms, not absolute ones. So, the question becomes: “Compared to what?”

I can think of two situations in which we’d want to forecast international events, and the ways we assess and describe the accuracy of the results will differ across the two. First, there is basic research, where the goal is to develop and test theory. This is what most scholars are doing most of the time, and here the benchmark should be other relevant theories. We want compare predictive power across nested models or models representing competing hypotheses to see which version does a better job anticipating real-world behavior—and, by implication, explaining it.

Then, of course, there is applied research, where the goal is to support or improve some decision process. Policy-making and investing are probably the two most common ones. Here, the benchmark should be the status quo. What we want to know is: “How much does the proposed forecasting process improve on the one(s) used now?” If the status quo is unclear, that already tells you something important about the state of forecasting in that field—namely, that it probably isn’t great. Even in that case, though, I think it’s still important to pick a benchmark that’s more realistic than perfection. Depending on the rarity of the event in question, that will usually mean either random guessing (for frequent events) or base rates (for rare ones).

How we communicate our findings on predictive power will also differ across basic and applied research, or at least I think it should. This has less to do with the goals of the work than it does with the audiences at which they’re usually aimed. When the audience is other scholars, I think it’s reasonable to expect them to understand the statistics and, so, to use those. For frequent events, Brier or logarithmic scores are often best, whereas for rare events I find that AUC scores are usually more informative, and I know a lot of people like to use F-1 scores in this context, too.

In applied settings, though, we’re usually doing the work as a service to someone else who probably doesn’t know the mechanics of the relevant statistics and shouldn’t be expected to. In my experience, it’s a bad idea in these settings to try to educate your customer on the spot about things like Brier or AUC scores. They don’t need to know those statistics, and you’re liable to come across as aloof or even condescending if you presume to spend time teaching them. Instead, I’d recommend using the practical problem they’re asking you to help solve to frame your representation of your predictive power. Propose a realistic decision process—or, if you can, take the one they’re already using—and describe the results you’d get if you plugged your forecasts into it.

In applied contexts, people often will also want to know how your process performed on crucial cases and what events would have surprised it, so it’s good to be prepared to talk about those as well. These topics are germane to basic research, too, but crucial cases will be defined differently in the two contexts. For scholars, crucial cases are usually understood as most-likely and least-likely ones in relation to the theory being tested. For policy-makers and other applied audiences, the crucial cases are usually understood as the ones where surprise was or would have been costliest.

So that’s how I think about assessing and describing the accuracy of forecasts of the kinds of (rare) international events a lot of us study. Now, if you’ll indulge me, I’d like to close with a pedantic plea: Can we please reserve the terms “forecast” and “prediction” for statements about things that haven’t happened and not apply them to estimates we generate for cases with known outcomes?

This might seem like a petty concern, but it’s actually tied to the philosophy of knowledge that underpins science, or my understanding of it, anyway. Making predictions about things that haven’t already happened is a crucial part of the scientific method. To learn from prediction, we assume that a model’s forecasting power tells us something about its proximity to the “true” data-generating process. This assumption won’t always be right, but it’s proven pretty useful over the past few centuries, so I’m okay sticking with it for now. For obvious reasons, it’s much easier to make accurate “predictions” about cases with known outcomes than unknown ones, so the scientific value of the two endeavors is very different. In light of that fact, I think we should be as clear and honest with ourselves and our audiences as we can about which one we’re doing, and therefore how much we’re learning.

When we’re doing this stuff in practice, there are three basic modes: 1) in-sample fitting, 2) cross-validation (CV), and 3) forecasting. In-sample fitting is the least informative of the three and, in my opinion, really should only be used in exploratory analysis and should not be reported in finished work. It tells us a lot more about the method than the phenomenon of interest.

CV is usually more informative than in-sample fitting, but not always. Each iteration of CV on the same data set moves you a little closer to in-sample fitting, because you effectively train to the idiosyncrasies of your chosen test set. Using multiple iterations of CV may ameliorate this problem, but it doesn’t always eliminate it. And on topics where the available data have already been worked to death—as they have on many problems of interest to scholars of international affairs—cross-validation really isn’t much more informative than in-sample fitting unless you’ve got a brand-new data series you can throw at the task and are focused on learning about it.

True forecasting—making clear statements about things that haven’t happened yet and then seeing how they turn out—is uniquely informative in this regard, so I think it’s important to reserve that term for the situations where that’s actually what we’re doing. When we describe in-sample and cross-validation estimates as forecasts, we confuse our readers, and we risk confusing ourselves about how much we’re really learning.

Of course, that’s easier for some phenomena than it is for others. If your theory concerns the risk of interstate wars, for example, you’re probably (and thankfully) not going to get a lot of opportunities to test it through prediction. Rather than sweeping those issues under the rug, though, I think we should recognize them for what they are. They are not an excuse to elide the huge differences between prediction and fitting models to history. Instead, they are a big haymaker of a reminder that social science is especially hard—not because humans are uniquely unpredictable, but rather because we only have the one grand and always-running experiment to observe, and we and our work are part of it.

Advertisements

Forecasting Round-up No. 8

1. The latest Chronicle of Higher Education includes a piece on forecasting international affairs (here) by Beth McMurtrie, who asserts that

Forecasting is undergoing a revolution, driven by digitized data, government money, new ways to analyze information, and discoveries about how to get the best forecasts out of people.

The article covers terrain that is familiar to anyone working in this field, but I think it gives a solid overview of the current landscape. (Disclosure: I’m quoted in the piece, and it describes several research projects for which I have done or now do paid work.)

2. Yesterday, I discovered a new R package that looks to be very useful for evaluating and comparing forecasts. It’s called ‘scoring‘, and it does just that, providing functions to implement an array of proper scoring rules for probabilistic predictions of binary and categorical outcomes. The rules themselves are nicely discussed in a 2013 publication co-authored by the package’s creator, Ed Merkle, and Mark Steyvers. Those rules and a number of others are also discussed in a paper by Patrick Brandt, John Freeman, and Phil Schrodt that appeared in the International Journal of Forecasting last year (earlier ungated version here).

I found the package because I was trying to break the habit of always using the area under the ROC curve, or AUC score, to evaluate and compare the accuracy of forecasts from statistical models of rare events. AUC is quite useful as far as it goes, but it doesn’t address all aspects of forecast accuracy we might care about. Mathematically, the AUC score represents the probability that a prediction selected at random from the set of cases that had an event of interest (e.g., a coup attempt or civil-war onset) will be larger than a prediction selected at random from the set of cases that didn’t. In other words, AUC deals strictly in relative ranking and tells us nothing about calibration.

This came up in my work this week when I tried to compare out-of-sample estimates from three machine-learning algorithms—kernel-based regularized least squares (KRLS), Random Forests (RF), and support vector machines (SVM)—trained on and then applied to the same variables and data. In five-fold cross-validation, the three algorithms produced similar AUC scores, but histograms of the out-of-sample estimates showed much less variance for KRLS than RF and SVM. The mean out-of-sample “forecast” from all three was about 0.009, the base rate for the event, but the maximum for KRLS was only about 0.01, compared with maxes in the 0.4s and 0.7s for the others. It turned out that KRLS was doing about as well at rank ordering the cases as RF and SVM, but it was much more conservative in estimating the likelihood of an event. To consider that difference in my comparisons, I needed to apply scoring rules that were sensitive to forecast calibration and my particular concern with avoiding false negatives, and Merkle’s ‘scoring’ package gave me the functions I needed to do that. (More on the results some other time.)

3. Last week, Andreas Beger wrote a great post for the WardLab blog, Predictive Heuristics, cogently explaining why event data is so important to improving forecasts of political crises:

To predict something that changes…you need predictors that change.

That sounds obvious, and in one sense it is. As Beger describes, though, most of the models political scientists have built so far have used slow-changing country-year data to try to anticipate not just where but also when crises like coup attempts or civil-war onsets will occur. Some of those models are very good at the “where” part, but, unsurprisingly, none of them does so hot on the “when” part. Beger explains why that’s true and how new data on political events can help us fix that.

4. Finally, Chris Blattman, Rob Blair, and Alexandra Hartman have posted a new working paper on predicting violence at the local level in “fragile” states. As they describe in their abstract,

We use forecasting models and new data from 242 Liberian communities to show that it is to possible to predict outbreaks of local violence with high sensitivity and moderate accuracy, even with limited data. We train our models to predict communal and criminal violence in 2010 using risk factors measured in 2008. We compare predictions to actual violence in 2012 and find that up to 88% of all violence is correctly predicted. True positives come at the cost of many false positives, giving overall accuracy between 33% and 50%.

The patterns Blattman and Blair describe in that last sentence are related to what Beger was talking about with cross-national forecasting. Blattman, Blair, and Hartman’s models run on survey data and some other structural measures describing conditions in a sample of Liberian localities. Their predictive algorithms were derived from a single time step: inputs from 2008 and observations of violence from 2010. When those algorithms are applied to data from 2010 to predict violence in 2012, they do okay—not great, but “[similar] to some of the earliest prediction efforts at the cross-national level.” As the authors say, to do much better at this task, we’re going to need more and more dynamic data covering a wider range of cases.

Whatever the results, I think it’s great that the authors are trying to forecast at all. Even better, they make explicit the connections they see between theory building, data collection, data exploration, and prediction. On that subject, the authors get the last word:

However important deductive hypothesis testing remains, there is much to gain from inductive, data-driven approaches as well. Conflict is a complex phenomenon with many potential risk factors, and it is rarely possible to adjudicate between them on ex ante theoretical grounds. As datasets on local violence proliferate, it may be more fruitful to (on occasion) let the data decide. Agnosticism may help focus attention on the dependent variable and illuminate substantively and statistically significant relationships that the analyst would not have otherwise detected. This does not mean running “kitchen sink” regressions, but rather seeking models that produce consistent, interpretable results in high dimensions and (at the same time) improve predictive power. Unexpected correlations, if robust, provide puzzles and stylized facts for future theories to explain, and thus generate important new avenues of research. Forecasting can be an important tool in inductive theory-building in an area as poorly understood as local violence.

Finally, testing the predictive power of exogenous, statistically significant causes of violence can tell us much about their substantive significance—a quantity too often ignored in the comparative politics and international relations literature. A causal model that cannot generate predictions with some reasonable degree of accuracy is not in fact a causal model at all.

  • Follow me on Twitter

  • Follow Dart-Throwing Chimp on WordPress.com
  • Enter your email address to follow this blog and receive notifications of new posts by email.

    Join 13,633 other followers

  • Archives

  • Advertisements
%d bloggers like this: