Forecasting Round-up No. 8

1. The latest Chronicle of Higher Education includes a piece on forecasting international affairs (here) by Beth McMurtrie, who asserts that

Forecasting is undergoing a revolution, driven by digitized data, government money, new ways to analyze information, and discoveries about how to get the best forecasts out of people.

The article covers terrain that is familiar to anyone working in this field, but I think it gives a solid overview of the current landscape. (Disclosure: I’m quoted in the piece, and it describes several research projects for which I have done or now do paid work.)

2. Yesterday, I discovered a new R package that looks to be very useful for evaluating and comparing forecasts. It’s called ‘scoring‘, and it does just that, providing functions to implement an array of proper scoring rules for probabilistic predictions of binary and categorical outcomes. The rules themselves are nicely discussed in a 2013 publication co-authored by the package’s creator, Ed Merkle, and Mark Steyvers. Those rules and a number of others are also discussed in a paper by Patrick Brandt, John Freeman, and Phil Schrodt that appeared in the International Journal of Forecasting last year (earlier ungated version here).

I found the package because I was trying to break the habit of always using the area under the ROC curve, or AUC score, to evaluate and compare the accuracy of forecasts from statistical models of rare events. AUC is quite useful as far as it goes, but it doesn’t address all aspects of forecast accuracy we might care about. Mathematically, the AUC score represents the probability that a prediction selected at random from the set of cases that had an event of interest (e.g., a coup attempt or civil-war onset) will be larger than a prediction selected at random from the set of cases that didn’t. In other words, AUC deals strictly in relative ranking and tells us nothing about calibration.

This came up in my work this week when I tried to compare out-of-sample estimates from three machine-learning algorithms—kernel-based regularized least squares (KRLS), Random Forests (RF), and support vector machines (SVM)—trained on and then applied to the same variables and data. In five-fold cross-validation, the three algorithms produced similar AUC scores, but histograms of the out-of-sample estimates showed much less variance for KRLS than RF and SVM. The mean out-of-sample “forecast” from all three was about 0.009, the base rate for the event, but the maximum for KRLS was only about 0.01, compared with maxes in the 0.4s and 0.7s for the others. It turned out that KRLS was doing about as well at rank ordering the cases as RF and SVM, but it was much more conservative in estimating the likelihood of an event. To consider that difference in my comparisons, I needed to apply scoring rules that were sensitive to forecast calibration and my particular concern with avoiding false negatives, and Merkle’s ‘scoring’ package gave me the functions I needed to do that. (More on the results some other time.)

3. Last week, Andreas Beger wrote a great post for the WardLab blog, Predictive Heuristics, cogently explaining why event data is so important to improving forecasts of political crises:

To predict something that changes…you need predictors that change.

That sounds obvious, and in one sense it is. As Beger describes, though, most of the models political scientists have built so far have used slow-changing country-year data to try to anticipate not just where but also when crises like coup attempts or civil-war onsets will occur. Some of those models are very good at the “where” part, but, unsurprisingly, none of them does so hot on the “when” part. Beger explains why that’s true and how new data on political events can help us fix that.

4. Finally, Chris Blattman, Rob Blair, and Alexandra Hartman have posted a new working paper on predicting violence at the local level in “fragile” states. As they describe in their abstract,

We use forecasting models and new data from 242 Liberian communities to show that it is to possible to predict outbreaks of local violence with high sensitivity and moderate accuracy, even with limited data. We train our models to predict communal and criminal violence in 2010 using risk factors measured in 2008. We compare predictions to actual violence in 2012 and find that up to 88% of all violence is correctly predicted. True positives come at the cost of many false positives, giving overall accuracy between 33% and 50%.

The patterns Blattman and Blair describe in that last sentence are related to what Beger was talking about with cross-national forecasting. Blattman, Blair, and Hartman’s models run on survey data and some other structural measures describing conditions in a sample of Liberian localities. Their predictive algorithms were derived from a single time step: inputs from 2008 and observations of violence from 2010. When those algorithms are applied to data from 2010 to predict violence in 2012, they do okay—not great, but “[similar] to some of the earliest prediction efforts at the cross-national level.” As the authors say, to do much better at this task, we’re going to need more and more dynamic data covering a wider range of cases.

Whatever the results, I think it’s great that the authors are trying to forecast at all. Even better, they make explicit the connections they see between theory building, data collection, data exploration, and prediction. On that subject, the authors get the last word:

However important deductive hypothesis testing remains, there is much to gain from inductive, data-driven approaches as well. Conflict is a complex phenomenon with many potential risk factors, and it is rarely possible to adjudicate between them on ex ante theoretical grounds. As datasets on local violence proliferate, it may be more fruitful to (on occasion) let the data decide. Agnosticism may help focus attention on the dependent variable and illuminate substantively and statistically significant relationships that the analyst would not have otherwise detected. This does not mean running “kitchen sink” regressions, but rather seeking models that produce consistent, interpretable results in high dimensions and (at the same time) improve predictive power. Unexpected correlations, if robust, provide puzzles and stylized facts for future theories to explain, and thus generate important new avenues of research. Forecasting can be an important tool in inductive theory-building in an area as poorly understood as local violence.

Finally, testing the predictive power of exogenous, statistically significant causes of violence can tell us much about their substantive significance—a quantity too often ignored in the comparative politics and international relations literature. A causal model that cannot generate predictions with some reasonable degree of accuracy is not in fact a causal model at all.

Why political scientists should predict

Last week, Hans Noel wrote a post for Mischiefs of Faction provocatively titled “Stop trying to predict the future“. I say provocatively because, if I read the post correctly, Noel’s argument deliberately refutes his own headline. Noel wasn’t making a case against forecasting. Rather, he was arguing in favor of forecasting, as long as it’s done in service of social-scientific objectives.

If that’s right, then I largely agree with Noel’s argument and would restate it as follows. Political scientists shouldn’t get sucked into bickering with their colleagues over small differences in forecast accuracy around single events, because those differences will rarely contain enough information for us to learn much from them. Instead, we should take prediction seriously as a means of testing competing theories by doing two things.

First, we should build forecasting models that clearly represent contrasting sets of beliefs about the causes and precursors of the things we’re trying to predict. In Noel’s example, U.S. election forecasts are only scientifically interesting in so far as they come from models that instantiate different beliefs about why Americans vote like they do. If, for example, a model that incorporates information about trends in unemployment consistently produces more accurate forecasts than a very similar model that doesn’t, then we can strengthen our confidence that trends in unemployment shape voter behavior. If all the predictive models use only the same inputs—polls, for example—we don’t leave ourselves much room to learn about theories from them.

In my work for the Early Warning Project, I have tried to follow this principle by organizing our multi-model ensemble around a pair of models that represent overlapping but distinct ideas about the origins of state-led mass killing. One model focuses on the characteristics of the political regimes that might perpetrate this kind of violence, while another focuses on the circumstances in which those regimes might find themselves. These models embody competing claims about why states kill, so a comparison of their predictive accuracy will give us a chance to learn something about the relative explanatory power of those competing claims. Most of the current work on forecasting U.S. elections follows this principle too, by the way, even if that’s not what gets emphasized in media coverage of their work.

Second, we should only really compare the predictive power of those models across multiple events or a longer time span, where we can be more confident that observed differences in accuracy are meaningful. This is basic statistics. The smaller the sample, the less confident we can be that it is representative of the underlying distribution(s) from which it was drawn. If we declare victory or failure in response to just one or a few bits of feedback, we risk “correcting” for an unlikely draw that dimly reflects the processes that really interest us. Instead, we should let the models run for a while before chucking or tweaking them, or at least leave the initial version running while trying out alternatives.

Admittedly, this can be hard to do in practice, especially when the events of interest are rare. All of the applied forecasters I know—myself included—are tinkerers by nature, so it’s difficult for us to find the patience that second step requires. With U.S. elections, forecasters also know that they only get one shot every two or four years, and that most people won’t hear anything about their work beyond a topline summary that reads like a racing form from the horse track. If you’re at all competitive—and anyone doing this work probably is—it’s hard not to respond to that incentive. With the Early Warning Project, I worry about having a salient “miss” early in the system’s lifespan that encourages doubters to dismiss the work before we’ve really had a chance to assess its reliability and value. We can be patient, but if our intended audiences aren’t too, then the system could fail to get the traction it deserves.

Difficult doesn’t mean impossible, however, and I’m optimistic that political scientists will increasingly use forecasting in service of their search for more useful and more powerful theories. Journal articles that take this idea seriously are still rare birds, especially on things other than U.S. elections, but you occasionally spot them (Exhibit A and B). As Drew Linzer tweeted in response to Noel’s post, “Arguing over [predictive] models is arguing over assumptions, which is arguing over theories. This is exactly what [political science] should be doing.”

Machine learning our way to better early warning on mass atrocities

For the past couple of years, I’ve been helping build a system that uses statistics and expert crowds to assess and track risks of mass atrocities around the world. Recently dubbed the Early Warning Project (EWP), this effort already has a blog up and running (here), and the EWP should finally be able to launch a more extensive public website within the next several weeks.

One of the first things I did for the project, back in 2012, was to develop a set of statistical models that assess risks of onsets of state-led mass killing in countries worldwide, the type of mass atrocities for which we have the most theory and data. Consistent with the idea that the EWP will strive to keep improving on what it does as new data, methods, and ideas become available, that piece of the system has continued to evolve over the ensuing couple of years.

You can find the first two versions of that statistical tool here and here. The latest iteration—recently codified in new-and-improved replication materials—has performed pretty well, correctly identifying the few countries that have seen onsets of state-led mass killing in the past couple of years as relatively high-risk cases before those onsets occurred. It’s not nearly as precise as we’d like—I usually apply the phrase “relatively high-risk” to the Top 30, and we’ll only see one or two events in most years—but that level of imprecision is par for the course when forecasting rare and complex political crises like these.

Of course, a solid performance so far doesn’t mean that we can’t or shouldn’t try to do even better. Last week, I finally got around to applying a couple of widely used machine learning techniques to our data to see how those techniques might perform relative to the set of models we’re using now. Our statistical risk assessments come not from a single model but from a small collection of them—a “multi-model ensemble” in applied forecasting jargon—because these collections of models usually produce more accurate forecasts than any single one can. Our current ensemble mixes two logistic regression models, each representing a different line of thinking about the origins of mass killing, with one machine-learning algorithm—Random Forests—that gets applied to all of the variables used by those theory-specific models. In cross-validation, the Random Forests forecasts handily beat the two logistic regression models, but, as is often the case, the average of the forecasts from all three does even better.

Inspired by the success of Random Forests in our current risk assessments and by the power of machine learning in another project on which I’m working, I decided last week to apply two more machine learning methods to this task: support vector machines (SVM) and the k-nearest neighbors (KNN) algorithm. I won’t explain the two techniques in any detail here; you can find good explanations elsewhere on the internet (see here and here, for example), and, frankly, I don’t understand the methods deeply enough to explain them any better.

What I will happily report is that one of the two techniques, SVM, appears to perform our forecasting task about as well as Random Forests. In five-fold cross-validation, both SVM and Random Forests both produced areas under the ROC curve (a.k.a. AUC scores) in the mid-0.80s. AUC scores range from 0.5 to 1, and a score in the mid-0.80s is pretty good for out-of-sample accuracy on this kind of forecasting problem. What’s more, when I averaged the estimates for each case from SVM and Random Forests, I got AUC scores in the mid– to upper 0.80s. That’s several points better than our current ensemble, which combines Random Forests with those logistic regression models.

By contrast, KNN did quite poorly, hovering close to the 0.5 mark that we would get with randomly generated probabilities. Still, success in one of the two experiments is pretty exciting. We don’t have a lot of forecasts to combine right now, so adding even a single high-quality model to the mix could produce real gains.

Mind you, this wasn’t a push-button operation. For one thing, I had to rework my code to handle missing data in a different way—not because SVM handles missing data differently from Random Forests, but because the functions I was using to implement the techniques do. (N.B. All of this work was done in R. I used ‘ksvm’ from the kernlab package for SVM and ‘knn3′ from the caret package for KNN.) I also got poor results from SVM in my initial implementation, which used the default settings for all of the relevant parameters. It took some iterating to discover that the Laplacian kernel significantly improved the algorithm’s performance, and that tinkering with the other flexible parameters (sigma and C for the Laplacian kernel in ksvm) had no effect or made things worse.

I also suspect that the performance of KNN would improve with more effort. To keep the comparison simple, I gave all three algorithms the same set of features and observations. As it happens, though, Random Forests and SVMs are less prone to over-fitting than KNN, which has a harder time separating the signal from the noise when irrelevant features are included. The feature set I chose probably includes some things that don’t add any predictive power, and their inclusion may be obscuring the patterns that do lie in those data. In the next go-round, I would start the KNN algorithm with the small set of features in whose predictive power I’m most confident, see if that works better, and try expanding from there. I would also experiment with different values of k, which I locked in at 5 for this exercise.

It’s tempting to spin the story of this exercise as a human vs. machine parable in which newfangled software and Big Data outdo models hand-crafted by scholars wedded to overly simple stories about the origins of mass atrocities. It’s tempting, but it would also be wrong on a couple of crucial points.

First, this is still small data. Machine learning refers to a class of analytic methods, not the amount of data involved. Here, I am working with the same country-year data set covering the world from the 1940s to the present that I have used in previous iterations of this exercise. This data set contains fewer than 10,000 observations on scores of variables and takes up about as much space on my hard drive as a Beethoven symphony. In the future, I’d like to experiment with newer and larger data sets at different levels of aggregation, but that’s not what I’m doing now, mostly because those newer and larger data sets still don’t cover enough time and space to be useful in the analysis of such rare events.

Second and more important, theory still pervades this process. Scholars’ beliefs about what causes and presages mass killing have guided my decisions about what variables to include in this analysis and, in many cases, how those variables were originally measured and the fact that data even exist on them at all. Those data-generating and variable-selection processes, and all of the expertise they encapsulate, are essential to these models’ forecasting power. In principle, machine learning could be applied to a much wider set of features, and perhaps we’ll try that some time, too. With events as rare as onsets of state-led mass killing, however, I would not have much confidence that results from a theoretically agnostic search would add real forecasting power and not just result in over-fitting.

In any case, based on these results, I will probably incorporate SVM into the next iteration of the Early Warning Project’s statistical risk assessments. Those are due out early in the spring of 2015, when all of the requisite inputs will have been updated (we hope). I think we’ll also need to think carefully about whether or not to keep those logistic regression models in the mix, and what else we might borrow from the world of machine learning. In the meantime, I’ve enjoyed getting to try out some new techniques on data I know well, where it’s a lot easier to tell if things are going wonky, and it’s encouraging to see that we can continue to get better at this hard task if we keep trying.

2014 NFL Football Season Predictions

Professional (American) football season starts tonight when the Green Bay Packers visit last year’s champs, the Seattle Seahawks, for a Thursday-night opener thing that still seems weird to me. (SUNDAY, people. Pro football is played on Sunday.) So, who’s likely to win?

With the final preseason scores from our pairwise wiki survey in hand, we can generate a prediction for that game, along with all 255 other regular-season contests on the 2014 schedule. As I described in a recent post, this wiki survey offers a novel way to crowdsource the problem of estimating team strength before the season starts. We can use last year’s preseason survey data and game results to estimate a simple statistical model that accounts for two teams’ strength differential and home-field advantage. Then, we can apply that model to this year’s survey results to get game-level forecasts.

In the last post, I used the initial model estimates to generate predicted net scores (home minus visitor) and confidence intervals. This time, I thought I’d push it a little further and use predictive simulations. Following Gelman and Hill’s Data Analysis Using Regression and Multilevel/Hierarchical Models (2009), I generated 1,000 simulated net scores for each game and then summarized the distributions of those scores to get my statistics of interest.

The means of those simulated net scores for each game represent point estimates of the outcome, and the variance of those distributions gives us another way to compute confidence intervals. Those means and confidence intervals closely approximate the ones we’d get from a one-shot application of the predictive model to the 2014 survey results, however, so there’s no real new information there.

What we can do with those distributions that is new is compute win probabilities. The share of simulated net scores above 0 gives us an estimate of the probability of a home-team win, and 1 minus that estimate gives us the probability of a visiting-team win.

A couple of pictures make this idea clearer. First, here’s a histogram of the simulated net scores for tonight’s Packers-Seahawks game. The Packers fared pretty well in the preseason wiki survey, ranking 5th overall with a score of 77.5 out of 100. The defending-champion Seahawks got the highest score in the survey, however—a whopping 92.6—and they have home-field advantage, which is worth about 3.1 points on average, according  to my model. In my predictive simulations, 673 of the 1,000 games had a net score above 0, suggesting a win probability of 67%, or 2:1 odds, in favor of the Seahawks. The mean predicted net score is 5.8, which is pretty darn close to the current spread of -5.5.

Seattle Seahawks.Green Bay Packers

Things look a little tighter for the Bengals-Ravens contest, which I’ll be attending with my younger son on Sunday in our once-annual pilgrimage to M&T Bank Stadium. The Ravens wound up 10th in the wiki survey with a score of 60.5, but the Bengals are just a few rungs down the ladder, in 13th, with a score of 54.7. Add in home-field advantage, though, and the simulations give the Ravens a win probability of 62%, or about 3:2 odds. Here, the mean net score is 3.6, noticeably higher than the current spread of -1.5 but on the same side of the win/loss line. (N.B. Because the two teams’ survey scores are so close, the tables turn when Cincinnati hosts in Week 8, and the predicted probability of a home win is 57%.)

Baltimore Ravens.Cincinnati Bengals

Once we’ve got those win probabilities ginned up, we can use them to move from game-level to season-level forecasts. It’s tempting to think of the wiki survey results as season-level forecasts already, but what they don’t do is account for variation in strength of schedule. Other things being equal, a strong team with a really tough schedule might not be expected to do much better than a mediocre team with a relatively easy schedule. The model-based simulations refract those survey results through the 2014 schedule to give us a clearer picture of what we can expect to happen on the field this year.

The table below (made with the handy ‘textplot’ command in R’s gplots package) turns the predictive simulations into season-level forecasts for all 32 teams.* I calculated two versions of a season summary and juxtaposed them to the wiki survey scores and resulting rankings. Here’s what’s in the table:

  • WikiRank shows each team’s ranking in the final preseason wiki survey results.
  • WikiScore shows the score on which that ranking is based.
  • WinCount counts the number of games in which each team has a win probability above 0.5. This process gives us a familiar number, the first half of a predicted W-L record, but it also throws out a lot of information by treating forecasts close to 0.5 the same as ones where we’re more confident in our prediction of the winner.
  • WinSum, is the sum of each team’s win probabilities across the 16 games. This expected number of wins is a better estimate of each team’s anticipated results than WinCount, but it’s also a less familiar one, so I thought I would show both.

Teams appear in the table in descending order of WinSum, which I consider the single-best estimate in this table of a team’s 2014 performance. It’s interesting (to me, anyway) to see how the rank order changes from the survey to the win totals because of differences in strength of schedule. So, for example, the Patriots ranked 4th in the wiki survey, but they get the second-highest expected number of wins this year (9.8), just behind the Seahawks (9.9). Meanwhile, the Steelers scored 16th in the wiki survey, but they rank 11th in expected number of wins with an 8.4. That’s a smidgen better than the Cincinnati Bengals (8.3) and not much worse than the Baltimore Ravens (9.0), suggesting an even tighter battle for the AFC North division title than the wiki survey results alone.

2014 NFL Season-Level Forecasts from 1,000 Predictive Simulations Using Preseason Wiki Survey Results and Home-Field Advantage

2014 NFL Season-Level Forecasts from 1,000 Predictive Simulations Using Preseason Wiki Survey Results and Home-Field Advantage

There are a lot of other interesting quantities we could extract from the results of the game-level simulations, but that’s all I’ve got time to do now. If you want to poke around in the original data and simulation results, you can find them all in a .csv on my Google Drive (here). I’ve also posted a version of the R script I used to generate the game-level and season-level forecasts on Github (here).

At this point, I don’t have plans to try to update the forecasts during the season, but I will be seeing how the preseason predictions fare and occasionally reporting the results here. Meanwhile, if you have suggestions on other ways to use these data or to improve these forecasts, please leave a comment here on the blog.

* The version of this table I initially posted had an error in the WikiRank column where 18 was skipped and the rankings ran to 33. This version corrects that error. Thanks to commenter C.P. Liberatore for pointing it out.

Turning Crowdsourced Preseason NFL Strength Ratings into Game-Level Forecasts

For the past week, nearly all of my mental energy has gone into the Early Warning Project and a paper for the upcoming APSA Annual Meeting here in Washington, DC. Over the weekend, though, I found some time for a toy project on forecasting pro-football games. Here are the results.

The starting point for this toy project is a pairwise wiki survey that turns a crowd’s beliefs about relative team strength into scalar ratings. Regular readers will recall that I first experimented with one of these before the 2013-2014 NFL season, and the predictive power wasn’t terrible, especially considering that the number of participants was small and the ratings were completed before the season started.

This year, to try to boost participation and attract a more knowledgeable crowd of respondents, I paired with Trey Causey to announce the survey on his pro-football analytics blog, The Spread. The response has been solid so far. Since the survey went up, the crowd—that’s you!—has cast nearly 3,400 votes in more than 100 unique user sessions (see the Data Visualizations section here).

The survey will stay open throughout the season, but that doesn’t mean it’s too early to start seeing what it’s telling us. One thing I’ve already noticed is that the crowd does seem to be updating in response to preseason action. For example, before the first round of games, I noticed that the Baltimore Ravens, my family’s favorites, were running mid-pack with a rating of about 50. After they trounced the defending NFC champion 49ers in their preseason opener, however, the Ravens jumped to the upper third with a rating of 59. (You can always see up-to-the-moment survey results here, and you can cast your own votes here.)

The wiki survey is a neat way to measure team strength. On their own, though, those ratings don’t tell us what we really want to know, which is how each game is likely to turn out, or how well our team might be expected to do this season. The relationship between relative strength and game outcomes should be pretty strong, but we might want to consider other factors, too, like home-field advantage. To turn a strength rating into a season-level forecast for a single team, we need to consider the specifics of its schedule. In game play, it’s relative strength that matters, and some teams will have tougher schedules than others.

A statistical model is the best way I can think to turn ratings into game forecasts. To get a model to apply to this season’s ratings, I estimated a simple linear one from last year’s preseason ratings and the results of all 256 regular-season games (found online in .csv format here). The model estimates net score (home minus visitor) from just one feature, the difference between the two teams’ preseason ratings (again, home minus visitor). Because the net scores are all ordered the same way and the model also includes an intercept, though, it implicitly accounts for home-field advantage as well.

The scatterplot below shows the raw data on those two dimensions from the 2013 season. The model estimated from these data has an intercept of 3.1 and a slope of 0.1 for the score differential. In other words, the model identifies a net home-field advantage of 3 points—consistent with the conventional wisdom—and it suggests that every point of advantage on the wiki-survey ratings translates into a net swing of one-tenth of a point on the field. I also tried a generalized additive model with smoothing splines to see if the association between the survey-score differential and net game score was nonlinear, but as the scatterplot suggests, it doesn’t seem to be.

2013 NFL Games Arranged by Net Game Score and Preseason Wiki Survey Rating Differentials

2013 NFL Games Arranged by Net Game Score and Preseason Wiki Survey Rating Differentials

In sample, the linear model’s accuracy was good, not great. If we convert the net scores the model postdicts to binary outcomes and compare those postdictions to actual outcomes, we see that the model correctly classifies 60 percent of the games. That’s in sample, but it’s also based on nothing more than home-field advantage and a single preseason rating for each team from a survey with a small number of respondents. So, all things considered, it looks like a potentially useful starting point.

Whatever its limitations, that model gives us the tool we need to convert 2014 wiki survey results into game-level predictions. To do that, we also need a complete 2014 schedule. I couldn’t find one in .csv format, but I found something close (here) that I saved as text, manually cleaned in a minute or so (deleted extra header rows, fixed remaining header), and then loaded and merged with a .csv of the latest survey scores downloaded from the manager’s view of the survey page on All Our Ideas.

I’m not going to post forecasts for all 256 games—at least not now, with three more preseason games to learn from and, hopefully, lots of votes yet to be cast. To give you a feel for how the model is working, though, I’ll show a couple of cuts on those very preliminary results.

The first is a set of forecasts for all Week 1 games. The labels show Visitor-Home, and the net score is ordered the same way. So, a predicted net score greater than 0 means the home team (second in the paired label) is expected to win, while a predicted net score below 0 means the visitor (first in the paired label) is expected to win. The lines around the point predictions represent 90-percent confidence intervals, giving us a partial sense of the uncertainty around these estimates.

Week 1 Game Forecasts from Preseason Wiki Survey Results on 10 August 2014

Week 1 Game Forecasts from Preseason Wiki Survey Results on 10 August 2014

Of course, as a fan of particular team, I’m most interested in what the model says about how my guys are going to do this season. The next plot shows predictions for all 16 of Baltimore’s games. Unfortunately, the plotting command orders the data by label, and my R skills and available time aren’t sufficient to reorder them by week, but the information is all there. In this plot, the dots for the point predictions are colored red if they predict a Baltimore win and black for an expected loss. The good news for Ravens fans is that this plot suggests an 11-5 season, good enough for a playoff berth. The bad news is that an 8-8 season also lies within the 90-percent confidence intervals, so the playoffs don’t look like a lock.

2014 Game-Level Forecasts for the Baltimore Ravens from 10 August 2014 Wiki Survey Scores

2014 Game-Level Forecasts for the Baltimore Ravens from 10 August 2014 Wiki Survey Scores

So that’s where the toy project stands now. My intuition tells me that the predicted net scores aren’t as well calibrated as I’d like, and the estimated confidence intervals surely understate the true uncertainty around each game (“On any given Sunday…”). Still, I think this exercise demonstrates the potential of this forecasting process. If I were a betting man, I wouldn’t lay money on these estimates. As an applied forecaster, though, I can imagine using these predictions as priors in a more elaborate process that incorporates additional and, ideally, more dynamic information about each team and game situation over the course of the season. Maybe my doppelganger can take that up while I get back to my day job…

Postscript. After I published this post, Jeff Fogle suggested via Twitter that I compare the Week 1 forecasts to the current betting lines for those games. The plot below shows the median point spread from an NFL odds-aggregating site as blue dots on top of the statistical forecasts already shown above. As you can see, the statistical forecasts are tracking the betting lines pretty closely. There’s only one game—Carolina at Tampa Bay—where the predictions from the two series fall on different sides of the win/loss line, and it’s a game the statistical model essentially sees as a toss-up. It’s also reassuring that there isn’t a consistent direction to the differences, so the statistical process doesn’t seem to be biased in some fundamental way.

Week 1 Game-Level Forecasts Compared to Median Point Spread from Betting Sites on 11 August 2014

Week 1 Game-Level Forecasts Compared to Median Point Spread from Betting Sites on 11 August 2014

Forecasting Round-Up No. 7

1. I got excited when I heard on Twitter yesterday about a machine-learning process that turns out to be very good at predicting U.S. Supreme Court decisions (blog post here, paper here). I got even more excited when I saw that the guys who built that process have also been running a play-money prediction market on the same problem for the past several years, and that the most accurate forecasters in that market have done even better than that model (here). It sounds like they are now thinking about more rigorous ways to compare and cross-pollinate the two. That’s part of what we’re trying to do with the Early Warning Project, so I hope that they do and we can learn from their findings.

2. A paper in the current issue of the Journal of Personality and Social Psychology (here, but paywalled; hat-tip to James Igoe Walsh) adds to the growing pile of evidence on the forecasting power of crowds, with an interesting additional finding on the willingness of others to trust and use those forecasts:

We introduce the select-crowd strategy, which ranks judges based on a cue to ability (e.g., the accuracy of several recent judgments) and averages the opinions of the top judges, such as the top 5. Through both simulation and an analysis of 90 archival data sets, we show that select crowds of 5 knowledgeable judges yield very accurate judgments across a wide range of possible settings—the strategy is both accurate and robust. Following this, we examine how people prefer to use information from a crowd. Previous research suggests that people are distrustful of crowds and of mechanical processes such as averaging. We show in 3 experiments that, as expected, people are drawn to experts and dislike crowd averages—but, critically, they view the select-crowd strategy favorably and are willing to use it. The select-crowd strategy is thus accurate, robust, and appealing as a mechanism for helping individuals tap collective wisdom.

3. Adam Elkus recently spotlighted two interesting papers involving agent-based modeling (ABM) and forecasting.

  • The first (here) “presents a set of guidelines, imported from the field of forecasting, that can help social simulation and, more specifically, agent-based modelling practitioners to improve the predictive performance and the robustness of their models.”
  • The second (here), from 2009 but new to me, describes an experiment in deriving an agent-based model of political conflict from event data. The results were pretty good; a model built from event data and then tweaked by a subject-matter expert was as accurate as one built entirely by hand, and the hybrid model took much less time to construct.

4. Nautilus ran a great piece on Lewis Fry Richardson, a pioneer in weather forecasting who also applied his considerable intellect to predicting violent conflict. As the story notes,

At the turn of the last century, the notion that the laws of physics could be used to predict weather was a tantalizing new idea. The general idea—model the current state of the weather, then apply the laws of physics to calculate its future state—had been described by the pioneering Norwegian meteorologist Vilhelm Bjerknes. In principle, Bjerkens held, good data could be plugged into equations that described changes in air pressure, temperature, density, humidity, and wind velocity. In practice, however, the turbulence of the atmosphere made the relationships among these variables so shifty and complicated that the relevant equations could not be solved. The mathematics required to produce even an initial description of the atmosphere over a region (what Bjerknes called the “diagnostic” step) were massively difficult.

Richardson helped solve that problem in weather forecasting by breaking the task into many more manageable parts—atmospheric cells, in this case—and thinking carefully about how those parts fit together. I wonder if we will see similar advances in forecasts of social behavior in the next 100 years. I doubt it, but the trajectory of weather prediction over the past century should remind us to remain open to the possibility.

5. Last, a bit of fun: Please help Trey Causey and me forecast the relative strength of this year’s NFL teams by voting in this pairwise wiki survey! I did this exercise last year, and the results weren’t bad, even though the crowd was pretty small and probably not especially expert. Let’s see what happens if more people participate, shall we?

Uncertainty About How Best to Convey Uncertainty

NPR News ran a series of stories this week under the header Risk and Reason, on “how well we understand and act on probabilities.” I thought the series nicely represented how uncertain we are about how best to convey forecasts to people who might want to use them. There really is no clear standard here, even though it is clear that the choices we make in presenting forecasts and other statistics on risks to their intended consumers strongly shape what they hear.

This uncertainty about how best to convey forecasts was on full display in the piece on how CIA analysts convey predictive assessments (here). Ken Pollack, a former analyst who now teaches intelligence analysis, tells NPR that, at CIA, “There was a real injunction that no one should ever use numbers to explain probability.” Asked why, he says that,

Assigning numerical probability suggests a much greater degree of certainty than you ever want to convey to a policymaker. What we are doing is inherently difficult. Some might even say it’s impossible. We’re trying to protect the future. And, you know, saying to someone that there’s a 67 percent chance that this is going to happen, that sounds really precise. And that makes it seem like we really know what’s going to happen. And the truth is that we really don’t.

In that same segment, though, Dartmouth professor Jeff Friedman, who studies decision-making about national security issues, says we should provide a numeric point estimate of an event’s likelihood, along with some information about our confidence in that estimate and how malleable it may be. (See this paper by Friedman and Richard Zeckhauser for a fuller treatment of this argument.) The U.S. Food and Drug Administration apparently agrees; according to the same NPR story, the FDA “prefers numbers and urges drug companies to give numerical values for risk—and to avoid using vague terms such as ‘rare, infrequent and frequent.'”

Instead of numbers, Pollack advocates for using words: “Almost certainly or highly likely or likely or very unlikely,” he tells NPR. As noted by one of the other stories in the series (here), however—on the use of probabilities in medical decision-making—words and phrases are ambiguous, too, and that ambiguity can be just as problematic.

Doctors, including Leigh Simmons, typically prefer words. Simmons is an internist and part of a group practice that provides primary care at Mass General. “As doctors we tend to often use words like, ‘very small risk,’ ‘very unlikely,’ ‘very rare,’ ‘very likely,’ ‘high risk,’ ” she says.

But those words can be unclear to a patient.

“People may hear ‘small risk,’ and what they hear is very different from what I’ve got in my mind,” she says. “Or what’s a very small risk to me, it’s a very big deal to you if it’s happened to a family member.

Intelligence analysts have sometimes tried to remove that ambiguity by standardizing the language they use to convey likelihoods, most famously in Sherman Kent’s “Words of Estimative Probability.” It’s not clear to me, though, how effective this approach is. For one thing, consumers are often lazy about trying to understand just what information they’re being given, and templates like Kent’s don’t automatically solve that problem. This laziness came across most clearly in NPR’s Risk and Reason segment on meteorology (here). Many of us routinely consume probabilistic forecasts of rainfall and make decisions in response to them, but it turns out that few of us understand what those forecasts actually mean. With Kent’s words of estimative probability, I suspect that many readers of the products that use them haven’t memorized the table that spells out their meaning and don’t bother to consult it when they come across those phrases, even when it’s reproduced in the same document.

Equally important, a template that works well for some situations won’t necessarily work for all. I’m thinking in particular of forecasts on the kinds of low-probability, high-impact events that I usually analyze and that are essential to the CIA’s work, too. Here, what look like small differences in probability can sometimes be very meaningful. For example, imagine that it’s August 2001 and you’ve three different assessments of the risk of a major terrorist attack on U.S. soil in the next few months. One pegs the risk at 1 in 1,000; another at 1 in 100; and another at 1 in 10. Using Kent’s table, all three of those assessments would get translated into a statement that the event is “almost certainly not” going to happen, but I imagine that most U.S. decision-makers would have felt very differently about risks of 0.1%, 1%, and 10% with a threat of that kind.

There are lots of rare but important events that inhabit this corner of the probability space: nuclear accidents, extreme weather events, medical treatments, and mass atrocities, to name a few. We could create a separate lexicon for assessments in these areas, as the European Medicines Agency has done for adverse reactions to medical therapies (here, via NPR). I worry, though, that we ask too much of consumers of these and other forecasts if we expect them to remember multiple lexicons and to correctly code-switch between them. We also know that the relevant scale will differ across audiences, even on the same topic. For example, an individual patient considering a medical treatment might not care much about the difference between a mortality risk of 1 in 1,000 and 1 in 10,000, but a drug company and the regulators charged with overseeing them hopefully do.

If there’s a general lesson here, it’s that producers of probabilistic forecasts should think carefully about how best to convey their estimates to specific audiences. In practice, that means thinking about the nature of the decision processes those forecasts are meant to inform and, if possible, trying different approaches and checking to find out how each is being understood. Ideally, consumers of those forecasts should also take time to try to educate themselves on what they’re getting. I’m not optimistic that many will do that, but we should at least make it as easy as possible for them to do so.

In Applied Forecasting, Keep It Simple

One of the lessons I think I’ve learned from the nearly 15 years I’ve spent developing statistical models to forecast rare political events is: keep it simple unless and until you’re compelled to do otherwise.

The fact that the events we want to forecast emerge from extremely complex systems doesn’t mean that the models we build to forecast them need to be extremely complex as well. In a sense, the unintelligible complexity of the causal processes relieves us from the imperative to follow that path. We know our models can’t even begin to capture the true data-generating process. So, we can and usually should think instead about looking for measures that capture relevant concepts in a coarse way and then use simple model forms to combine those measures.

A few experiences and readings have especially shaped my thinking on this issue.

  • When I worked on the Political Instability Task Force (PITF), my colleagues and I found that a logistic regression model with just four variables did a pretty good job assessing relative risks of a few forms of major political crisis in countries worldwide (see here, or ungated here). In fact, one of the four variables in that model—an indicator that four or more bordering countries have ongoing major armed conflicts—has almost no variance, so it’s effectively a three-variable model. We tried adding a lot of other things that were suggested by a lot of smart people, but none of them really improved the model’s predictive power. (There were also a lot of things we couldn’t even try because the requisite data don’t exist, but that’s a different story.)
  • Toward the end of my time with PITF, we ran a “tournament of methods” to compare the predictive power of several statistical techniques that varied in their complexity, from logistic regression to Bayesian hierarchical models with spatial measures (see here for the write-up). We found that the more complex approaches usually didn’t outperform the simpler ones, and when they did, it wasn’t by much. What mattered most for predictive accuracy was finding the inputs with the most forecasting power. Once we had those, the functional form and complexity of the model didn’t make much difference.
  • As Andreas Graefe describes (here), models that assign equal weights to all predictors often forecast at least as accurately as multiple regression models that estimate weights from historical data. “Such findings have led researchers to conclude that the weighting of variables is secondary for the accuracy of forecasts,” Graefe writes. “Once the relevant variables are included and their directional impact on the criterion is specified, the magnitudes of effects are not very important.”

Of course, there will be some situations in which complexity adds value, so it’s worth exploring those ideas when we have a theoretical rationale and the coding skills, data, and time needed to pursue them. In general, though, I am convinced that we should always try simpler forms first and only abandon them if and when we discover that more complex forms significantly increase forecasting power.

Importantly, the evidence for that judgment should come from out-of-sample validation—ideally, from forecasts made about events that hadn’t yet happened. Models with more variables and more complex forms will often score better than simpler ones when applied to the data from which they were derived, but this will usually turn out to be a result of overfitting. If the more complex approach isn’t significantly better at real-time forecasting, it should probably be set aside until it does.

Oh, and a corollary: if you have to choose between a) building more complex models, or even just applying lots of techniques to the same data, and b) testing other theoretically relevant variables for predictive power, do (b).

The Ethics of Political Science in Practice

As citizens and as engaged intellectuals, we all have the right—indeed, an obligation—to make moral judgments and act based on those convictions. As political scientists, however, we have a unique set of potential contributions and constraints. Political scientists do not typically have anything of distinctive value to add to a chorus of moral condemnation or declarations of normative solidarity. What we do have, hopefully, is the methodological training, empirical knowledge and comparative insight to offer informed assessments about alternative courses of action on contentious issues. Our primary ethical commitment as political scientists, therefore must be to get the theory and the empirical evidence right, and to clearly communicate those findings to relevant audiences—however unpalatable or inconclusive they might be.

That’s a manifesto of sorts, nested in a great post by Marc Lynch at the Monkey Cage. Marc’s post focuses on analysis of the Middle East, but everything he writes generalizes to the whole discipline.

I’ve written a couple of posts on this theme, too:

  • This Is Not a Drill,” on the challenges of doing what Marc proposes in the midst of fast-moving and politically charged events with weighty consequences; and
  • Advocascience,” on the ways that researchers’ political and moral commitments shape our analyses, sometimes but not always intentionally.

Putting all of those pieces together, I’d say that I wholeheartedly agree with Marc in principle, but I also believe this is extremely difficult to do in practice. We can—and, I think, should—aspire to this posture, but we can never quite achieve it.

That applies to forecasting, too, by the way. Coincidentally, I saw this great bit this morning in the Letter from the Editors for a new special issue of The Appendix, on “futures of the past”:

Prediction is a political act. Imagined futures can be powerful tools for social change, but they can also reproduce the injustices of the present.

Concern about this possibility played a role in my decision to leave my old job, helping to produce forecasts of political instability around the world for private consumption by the U.S. government. It is also part of what attracts me to my current work on a public early-warning system for mass atrocities. By making the same forecasts available to all comers, I hope that we can mitigate that downside risk in an area where the immorality of the acts being considered is unambiguous.

As a social scientist, though, I also understand that we’ll never know for sure what good or ill effects our individual and collective efforts had. We won’t know because we can’t observe the “control” worlds we would need to confidently establish cause and effect, and we won’t know because the world we seek to understand keeps changing, sometimes even in response to our own actions. This is the paradox at the core of applied, empirical social science, and it is inescapable.

Beware the Confident Counterfactual

Did you anticipate the Syrian uprising that began in 2011? What about the Tunisian, Egyptian, and Libyan uprisings that preceded and arguably shaped it? Did you anticipate that Assad would survive the first three years of civil war there, or that Iraq’s civil war would wax again as intensely as it has in the past few days?

All of these events or outcomes were difficult forecasting problems before they occurred, and many observers have been frank about their own surprise at many of them. At the same time, many of those same observers speak with confidence about the causes of those events. The invasion of Iraq in 2003 surely is or is not the cause of the now-raging civil war in that country. The absence of direct US or NATO military intervention in Syria is or is not to blame for continuation of that country’s civil war and the mass atrocities it has brought—and, by extension, the resurgence of civil war in Iraq.

But here’s the thing: strong causal claims require some confidence about how history would have unfolded in the absence of the cause of interest, and those counterfactual histories are no easier to get right than observed history was to anticipate.

Like all of the most interesting questions, what causality means and how we might demonstrate it will forever be matters for debate—see here on Daniel Little’s blog for an overview of that debate’s recent state—but most conceptions revolve around some idea of necessity. When we say X caused Y, we usually mean that had X not occurred, Y wouldn’t have happened, either. Subtler or less stringent versions might center on salience instead of necessity and insert a “probably” into the final phrase of the previous sentence, but the core idea is the same.

In nonexperimental social science, this logic implicitly obliges us to consider the various ways history might have unfolded in response to X’ rather than X. In a sense, then, both prediction and explanation are forecasting problems. They require us to imagine states of the world we have not seen and to connect them in plausible ways to to ones we have. If anything, the counterfactual predictions required for explanation are more frustrating epistemological problems than the true forecasts, because we will never get to see the outcome(s) against which we could assess the accuracy of our guesses.

As Robert Jervis pointed out in his contribution to a 1996 edited volume on counterfactual thought experiments in world politics, counterfactuals are (or should be) especially hard to construct—and thus causal claims especially hard to make—when the causal processes of interest involve systems. For Jervis,

A system exists when elements or units are interconnected so that the system has emergent properties—i.e., its characteristics and behavior canot be inferred from the characteristics and behavior of the units taken individually—and when changes in one unit or the relationship between any two of them produce ramifying alterations in other units or relationships.

As Jervis notes,

A great deal of thinking about causation…is based on comparing two situations that are the same in all ways except one. Any differences in the outcome, whether actual or expected…can be attributed to the difference in the state of the one element…

Under many circumstances, this method is powerful and appropriate. But it runs into serious problems when we are dealing with systems because other things simply cannot be held constant: as Garret Hardin nicely puts it, in a system, ‘we can never do merely one thing.’

Jervis sketches a few thought experiments to drive this point home. He has a nice one about the effects of external interventions on civil wars that is topical here, but I think his New York traffic example is more resonant:

In everyday thought experiments we ask what would have happened if one element in our world had been different. Living in New York, I often hear people speculate that traffic would be unbearable (as opposed to merely terrible) had Robert Moses not built his highways, bridges, and tunnels. But to try to estimate what things would have been like, we cannot merely subtract these structures from today’s Manhattan landscape. The traffic patterns, the location of businesses and residences, and the number of private automobiles that are now on the streets are in significant measure the product of Moses’s road network. Had it not been built, or had it been built differently, many other things would have been different. Traffic might now be worse, but it is also possible that it would have been better because a more efficient public transportation system would have been developed or because the city would not have grown so large and prosperous without the highways.

Substitute “invade Iraq” or “fail to invade Syria” for Moses’s bridges and tunnels, and I hope you see what I mean.

In the end, it’s much harder to get beyond banal observations about influences to strong claims about causality than our story-telling minds and the popular media that cater to them would like. Of course the invasion of Iraq in 2003 or the absence of Western military intervention in Syria have shaped the histories that followed. But what would have happened in their absence—and, by implication, what would happen now if, for example, the US now re-inserted its armed forces into Iraq or attempted to topple Assad? Those questions are far tougher to answer, and we should beware of anyone who speaks with great confidence about their answers. If you’re a social scientist who isn’t comfortable making and confident in the accuracy of your predictions, you shouldn’t be comfortable making and confident in the validity of your causal claims, either.

Follow

Get every new post delivered to your Inbox.

Join 7,816 other followers

%d bloggers like this: