Turning Crowdsourced Preseason NFL Strength Ratings into Game-Level Forecasts

For the past week, nearly all of my mental energy has gone into the Early Warning Project and a paper for the upcoming APSA Annual Meeting here in Washington, DC. Over the weekend, though, I found some time for a toy project on forecasting pro-football games. Here are the results.

The starting point for this toy project is a pairwise wiki survey that turns a crowd’s beliefs about relative team strength into scalar ratings. Regular readers will recall that I first experimented with one of these before the 2013-2014 NFL season, and the predictive power wasn’t terrible, especially considering that the number of participants was small and the ratings were completed before the season started.

This year, to try to boost participation and attract a more knowledgeable crowd of respondents, I paired with Trey Causey to announce the survey on his pro-football analytics blog, The Spread. The response has been solid so far. Since the survey went up, the crowd—that’s you!—has cast nearly 3,400 votes in more than 100 unique user sessions (see the Data Visualizations section here).

The survey will stay open throughout the season, but that doesn’t mean it’s too early to start seeing what it’s telling us. One thing I’ve already noticed is that the crowd does seem to be updating in response to preseason action. For example, before the first round of games, I noticed that the Baltimore Ravens, my family’s favorites, were running mid-pack with a rating of about 50. After they trounced the defending NFC champion 49ers in their preseason opener, however, the Ravens jumped to the upper third with a rating of 59. (You can always see up-to-the-moment survey results here, and you can cast your own votes here.)

The wiki survey is a neat way to measure team strength. On their own, though, those ratings don’t tell us what we really want to know, which is how each game is likely to turn out, or how well our team might be expected to do this season. The relationship between relative strength and game outcomes should be pretty strong, but we might want to consider other factors, too, like home-field advantage. To turn a strength rating into a season-level forecast for a single team, we need to consider the specifics of its schedule. In game play, it’s relative strength that matters, and some teams will have tougher schedules than others.

A statistical model is the best way I can think to turn ratings into game forecasts. To get a model to apply to this season’s ratings, I estimated a simple linear one from last year’s preseason ratings and the results of all 256 regular-season games (found online in .csv format here). The model estimates net score (home minus visitor) from just one feature, the difference between the two teams’ preseason ratings (again, home minus visitor). Because the net scores are all ordered the same way and the model also includes an intercept, though, it implicitly accounts for home-field advantage as well.

The scatterplot below shows the raw data on those two dimensions from the 2013 season. The model estimated from these data has an intercept of 3.1 and a slope of 0.1 for the score differential. In other words, the model identifies a net home-field advantage of 3 points—consistent with the conventional wisdom—and it suggests that every point of advantage on the wiki-survey ratings translates into a net swing of one-tenth of a point on the field. I also tried a generalized additive model with smoothing splines to see if the association between the survey-score differential and net game score was nonlinear, but as the scatterplot suggests, it doesn’t seem to be.

2013 NFL Games Arranged by Net Game Score and Preseason Wiki Survey Rating Differentials

2013 NFL Games Arranged by Net Game Score and Preseason Wiki Survey Rating Differentials

In sample, the linear model’s accuracy was good, not great. If we convert the net scores the model postdicts to binary outcomes and compare those postdictions to actual outcomes, we see that the model correctly classifies 60 percent of the games. That’s in sample, but it’s also based on nothing more than home-field advantage and a single preseason rating for each team from a survey with a small number of respondents. So, all things considered, it looks like a potentially useful starting point.

Whatever its limitations, that model gives us the tool we need to convert 2014 wiki survey results into game-level predictions. To do that, we also need a complete 2014 schedule. I couldn’t find one in .csv format, but I found something close (here) that I saved as text, manually cleaned in a minute or so (deleted extra header rows, fixed remaining header), and then loaded and merged with a .csv of the latest survey scores downloaded from the manager’s view of the survey page on All Our Ideas.

I’m not going to post forecasts for all 256 games—at least not now, with three more preseason games to learn from and, hopefully, lots of votes yet to be cast. To give you a feel for how the model is working, though, I’ll show a couple of cuts on those very preliminary results.

The first is a set of forecasts for all Week 1 games. The labels show Visitor-Home, and the net score is ordered the same way. So, a predicted net score greater than 0 means the home team (second in the paired label) is expected to win, while a predicted net score below 0 means the visitor (first in the paired label) is expected to win. The lines around the point predictions represent 90-percent confidence intervals, giving us a partial sense of the uncertainty around these estimates.

Week 1 Game Forecasts from Preseason Wiki Survey Results on 10 August 2014

Week 1 Game Forecasts from Preseason Wiki Survey Results on 10 August 2014

Of course, as a fan of particular team, I’m most interested in what the model says about how my guys are going to do this season. The next plot shows predictions for all 16 of Baltimore’s games. Unfortunately, the plotting command orders the data by label, and my R skills and available time aren’t sufficient to reorder them by week, but the information is all there. In this plot, the dots for the point predictions are colored red if they predict a Baltimore win and black for an expected loss. The good news for Ravens fans is that this plot suggests an 11-5 season, good enough for a playoff berth. The bad news is that an 8-8 season also lies within the 90-percent confidence intervals, so the playoffs don’t look like a lock.

2014 Game-Level Forecasts for the Baltimore Ravens from 10 August 2014 Wiki Survey Scores

2014 Game-Level Forecasts for the Baltimore Ravens from 10 August 2014 Wiki Survey Scores

So that’s where the toy project stands now. My intuition tells me that the predicted net scores aren’t as well calibrated as I’d like, and the estimated confidence intervals surely understate the true uncertainty around each game (“On any given Sunday…”). Still, I think this exercise demonstrates the potential of this forecasting process. If I were a betting man, I wouldn’t lay money on these estimates. As an applied forecaster, though, I can imagine using these predictions as priors in a more elaborate process that incorporates additional and, ideally, more dynamic information about each team and game situation over the course of the season. Maybe my doppelganger can take that up while I get back to my day job…

Postscript. After I published this post, Jeff Fogle suggested via Twitter that I compare the Week 1 forecasts to the current betting lines for those games. The plot below shows the median point spread from an NFL odds-aggregating site as blue dots on top of the statistical forecasts already shown above. As you can see, the statistical forecasts are tracking the betting lines pretty closely. There’s only one game—Carolina at Tampa Bay—where the predictions from the two series fall on different sides of the win/loss line, and it’s a game the statistical model essentially sees as a toss-up. It’s also reassuring that there isn’t a consistent direction to the differences, so the statistical process doesn’t seem to be biased in some fundamental way.

Week 1 Game-Level Forecasts Compared to Median Point Spread from Betting Sites on 11 August 2014

Week 1 Game-Level Forecasts Compared to Median Point Spread from Betting Sites on 11 August 2014

Forecasting Round-Up No. 7

1. I got excited when I heard on Twitter yesterday about a machine-learning process that turns out to be very good at predicting U.S. Supreme Court decisions (blog post here, paper here). I got even more excited when I saw that the guys who built that process have also been running a play-money prediction market on the same problem for the past several years, and that the most accurate forecasters in that market have done even better than that model (here). It sounds like they are now thinking about more rigorous ways to compare and cross-pollinate the two. That’s part of what we’re trying to do with the Early Warning Project, so I hope that they do and we can learn from their findings.

2. A paper in the current issue of the Journal of Personality and Social Psychology (here, but paywalled; hat-tip to James Igoe Walsh) adds to the growing pile of evidence on the forecasting power of crowds, with an interesting additional finding on the willingness of others to trust and use those forecasts:

We introduce the select-crowd strategy, which ranks judges based on a cue to ability (e.g., the accuracy of several recent judgments) and averages the opinions of the top judges, such as the top 5. Through both simulation and an analysis of 90 archival data sets, we show that select crowds of 5 knowledgeable judges yield very accurate judgments across a wide range of possible settings—the strategy is both accurate and robust. Following this, we examine how people prefer to use information from a crowd. Previous research suggests that people are distrustful of crowds and of mechanical processes such as averaging. We show in 3 experiments that, as expected, people are drawn to experts and dislike crowd averages—but, critically, they view the select-crowd strategy favorably and are willing to use it. The select-crowd strategy is thus accurate, robust, and appealing as a mechanism for helping individuals tap collective wisdom.

3. Adam Elkus recently spotlighted two interesting papers involving agent-based modeling (ABM) and forecasting.

  • The first (here) “presents a set of guidelines, imported from the field of forecasting, that can help social simulation and, more specifically, agent-based modelling practitioners to improve the predictive performance and the robustness of their models.”
  • The second (here), from 2009 but new to me, describes an experiment in deriving an agent-based model of political conflict from event data. The results were pretty good; a model built from event data and then tweaked by a subject-matter expert was as accurate as one built entirely by hand, and the hybrid model took much less time to construct.

4. Nautilus ran a great piece on Lewis Fry Richardson, a pioneer in weather forecasting who also applied his considerable intellect to predicting violent conflict. As the story notes,

At the turn of the last century, the notion that the laws of physics could be used to predict weather was a tantalizing new idea. The general idea—model the current state of the weather, then apply the laws of physics to calculate its future state—had been described by the pioneering Norwegian meteorologist Vilhelm Bjerknes. In principle, Bjerkens held, good data could be plugged into equations that described changes in air pressure, temperature, density, humidity, and wind velocity. In practice, however, the turbulence of the atmosphere made the relationships among these variables so shifty and complicated that the relevant equations could not be solved. The mathematics required to produce even an initial description of the atmosphere over a region (what Bjerknes called the “diagnostic” step) were massively difficult.

Richardson helped solve that problem in weather forecasting by breaking the task into many more manageable parts—atmospheric cells, in this case—and thinking carefully about how those parts fit together. I wonder if we will see similar advances in forecasts of social behavior in the next 100 years. I doubt it, but the trajectory of weather prediction over the past century should remind us to remain open to the possibility.

5. Last, a bit of fun: Please help Trey Causey and me forecast the relative strength of this year’s NFL teams by voting in this pairwise wiki survey! I did this exercise last year, and the results weren’t bad, even though the crowd was pretty small and probably not especially expert. Let’s see what happens if more people participate, shall we?

Uncertainty About How Best to Convey Uncertainty

NPR News ran a series of stories this week under the header Risk and Reason, on “how well we understand and act on probabilities.” I thought the series nicely represented how uncertain we are about how best to convey forecasts to people who might want to use them. There really is no clear standard here, even though it is clear that the choices we make in presenting forecasts and other statistics on risks to their intended consumers strongly shape what they hear.

This uncertainty about how best to convey forecasts was on full display in the piece on how CIA analysts convey predictive assessments (here). Ken Pollack, a former analyst who now teaches intelligence analysis, tells NPR that, at CIA, “There was a real injunction that no one should ever use numbers to explain probability.” Asked why, he says that,

Assigning numerical probability suggests a much greater degree of certainty than you ever want to convey to a policymaker. What we are doing is inherently difficult. Some might even say it’s impossible. We’re trying to protect the future. And, you know, saying to someone that there’s a 67 percent chance that this is going to happen, that sounds really precise. And that makes it seem like we really know what’s going to happen. And the truth is that we really don’t.

In that same segment, though, Dartmouth professor Jeff Friedman, who studies decision-making about national security issues, says we should provide a numeric point estimate of an event’s likelihood, along with some information about our confidence in that estimate and how malleable it may be. (See this paper by Friedman and Richard Zeckhauser for a fuller treatment of this argument.) The U.S. Food and Drug Administration apparently agrees; according to the same NPR story, the FDA “prefers numbers and urges drug companies to give numerical values for risk—and to avoid using vague terms such as ‘rare, infrequent and frequent.'”

Instead of numbers, Pollack advocates for using words: “Almost certainly or highly likely or likely or very unlikely,” he tells NPR. As noted by one of the other stories in the series (here), however—on the use of probabilities in medical decision-making—words and phrases are ambiguous, too, and that ambiguity can be just as problematic.

Doctors, including Leigh Simmons, typically prefer words. Simmons is an internist and part of a group practice that provides primary care at Mass General. “As doctors we tend to often use words like, ‘very small risk,’ ‘very unlikely,’ ‘very rare,’ ‘very likely,’ ‘high risk,’ ” she says.

But those words can be unclear to a patient.

“People may hear ‘small risk,’ and what they hear is very different from what I’ve got in my mind,” she says. “Or what’s a very small risk to me, it’s a very big deal to you if it’s happened to a family member.

Intelligence analysts have sometimes tried to remove that ambiguity by standardizing the language they use to convey likelihoods, most famously in Sherman Kent’s “Words of Estimative Probability.” It’s not clear to me, though, how effective this approach is. For one thing, consumers are often lazy about trying to understand just what information they’re being given, and templates like Kent’s don’t automatically solve that problem. This laziness came across most clearly in NPR’s Risk and Reason segment on meteorology (here). Many of us routinely consume probabilistic forecasts of rainfall and make decisions in response to them, but it turns out that few of us understand what those forecasts actually mean. With Kent’s words of estimative probability, I suspect that many readers of the products that use them haven’t memorized the table that spells out their meaning and don’t bother to consult it when they come across those phrases, even when it’s reproduced in the same document.

Equally important, a template that works well for some situations won’t necessarily work for all. I’m thinking in particular of forecasts on the kinds of low-probability, high-impact events that I usually analyze and that are essential to the CIA’s work, too. Here, what look like small differences in probability can sometimes be very meaningful. For example, imagine that it’s August 2001 and you’ve three different assessments of the risk of a major terrorist attack on U.S. soil in the next few months. One pegs the risk at 1 in 1,000; another at 1 in 100; and another at 1 in 10. Using Kent’s table, all three of those assessments would get translated into a statement that the event is “almost certainly not” going to happen, but I imagine that most U.S. decision-makers would have felt very differently about risks of 0.1%, 1%, and 10% with a threat of that kind.

There are lots of rare but important events that inhabit this corner of the probability space: nuclear accidents, extreme weather events, medical treatments, and mass atrocities, to name a few. We could create a separate lexicon for assessments in these areas, as the European Medicines Agency has done for adverse reactions to medical therapies (here, via NPR). I worry, though, that we ask too much of consumers of these and other forecasts if we expect them to remember multiple lexicons and to correctly code-switch between them. We also know that the relevant scale will differ across audiences, even on the same topic. For example, an individual patient considering a medical treatment might not care much about the difference between a mortality risk of 1 in 1,000 and 1 in 10,000, but a drug company and the regulators charged with overseeing them hopefully do.

If there’s a general lesson here, it’s that producers of probabilistic forecasts should think carefully about how best to convey their estimates to specific audiences. In practice, that means thinking about the nature of the decision processes those forecasts are meant to inform and, if possible, trying different approaches and checking to find out how each is being understood. Ideally, consumers of those forecasts should also take time to try to educate themselves on what they’re getting. I’m not optimistic that many will do that, but we should at least make it as easy as possible for them to do so.

In Applied Forecasting, Keep It Simple

One of the lessons I think I’ve learned from the nearly 15 years I’ve spent developing statistical models to forecast rare political events is: keep it simple unless and until you’re compelled to do otherwise.

The fact that the events we want to forecast emerge from extremely complex systems doesn’t mean that the models we build to forecast them need to be extremely complex as well. In a sense, the unintelligible complexity of the causal processes relieves us from the imperative to follow that path. We know our models can’t even begin to capture the true data-generating process. So, we can and usually should think instead about looking for measures that capture relevant concepts in a coarse way and then use simple model forms to combine those measures.

A few experiences and readings have especially shaped my thinking on this issue.

  • When I worked on the Political Instability Task Force (PITF), my colleagues and I found that a logistic regression model with just four variables did a pretty good job assessing relative risks of a few forms of major political crisis in countries worldwide (see here, or ungated here). In fact, one of the four variables in that model—an indicator that four or more bordering countries have ongoing major armed conflicts—has almost no variance, so it’s effectively a three-variable model. We tried adding a lot of other things that were suggested by a lot of smart people, but none of them really improved the model’s predictive power. (There were also a lot of things we couldn’t even try because the requisite data don’t exist, but that’s a different story.)
  • Toward the end of my time with PITF, we ran a “tournament of methods” to compare the predictive power of several statistical techniques that varied in their complexity, from logistic regression to Bayesian hierarchical models with spatial measures (see here for the write-up). We found that the more complex approaches usually didn’t outperform the simpler ones, and when they did, it wasn’t by much. What mattered most for predictive accuracy was finding the inputs with the most forecasting power. Once we had those, the functional form and complexity of the model didn’t make much difference.
  • As Andreas Graefe describes (here), models that assign equal weights to all predictors often forecast at least as accurately as multiple regression models that estimate weights from historical data. “Such findings have led researchers to conclude that the weighting of variables is secondary for the accuracy of forecasts,” Graefe writes. “Once the relevant variables are included and their directional impact on the criterion is specified, the magnitudes of effects are not very important.”

Of course, there will be some situations in which complexity adds value, so it’s worth exploring those ideas when we have a theoretical rationale and the coding skills, data, and time needed to pursue them. In general, though, I am convinced that we should always try simpler forms first and only abandon them if and when we discover that more complex forms significantly increase forecasting power.

Importantly, the evidence for that judgment should come from out-of-sample validation—ideally, from forecasts made about events that hadn’t yet happened. Models with more variables and more complex forms will often score better than simpler ones when applied to the data from which they were derived, but this will usually turn out to be a result of overfitting. If the more complex approach isn’t significantly better at real-time forecasting, it should probably be set aside until it does.

Oh, and a corollary: if you have to choose between a) building more complex models, or even just applying lots of techniques to the same data, and b) testing other theoretically relevant variables for predictive power, do (b).

The Ethics of Political Science in Practice

As citizens and as engaged intellectuals, we all have the right—indeed, an obligation—to make moral judgments and act based on those convictions. As political scientists, however, we have a unique set of potential contributions and constraints. Political scientists do not typically have anything of distinctive value to add to a chorus of moral condemnation or declarations of normative solidarity. What we do have, hopefully, is the methodological training, empirical knowledge and comparative insight to offer informed assessments about alternative courses of action on contentious issues. Our primary ethical commitment as political scientists, therefore must be to get the theory and the empirical evidence right, and to clearly communicate those findings to relevant audiences—however unpalatable or inconclusive they might be.

That’s a manifesto of sorts, nested in a great post by Marc Lynch at the Monkey Cage. Marc’s post focuses on analysis of the Middle East, but everything he writes generalizes to the whole discipline.

I’ve written a couple of posts on this theme, too:

  • This Is Not a Drill,” on the challenges of doing what Marc proposes in the midst of fast-moving and politically charged events with weighty consequences; and
  • Advocascience,” on the ways that researchers’ political and moral commitments shape our analyses, sometimes but not always intentionally.

Putting all of those pieces together, I’d say that I wholeheartedly agree with Marc in principle, but I also believe this is extremely difficult to do in practice. We can—and, I think, should—aspire to this posture, but we can never quite achieve it.

That applies to forecasting, too, by the way. Coincidentally, I saw this great bit this morning in the Letter from the Editors for a new special issue of The Appendix, on “futures of the past”:

Prediction is a political act. Imagined futures can be powerful tools for social change, but they can also reproduce the injustices of the present.

Concern about this possibility played a role in my decision to leave my old job, helping to produce forecasts of political instability around the world for private consumption by the U.S. government. It is also part of what attracts me to my current work on a public early-warning system for mass atrocities. By making the same forecasts available to all comers, I hope that we can mitigate that downside risk in an area where the immorality of the acts being considered is unambiguous.

As a social scientist, though, I also understand that we’ll never know for sure what good or ill effects our individual and collective efforts had. We won’t know because we can’t observe the “control” worlds we would need to confidently establish cause and effect, and we won’t know because the world we seek to understand keeps changing, sometimes even in response to our own actions. This is the paradox at the core of applied, empirical social science, and it is inescapable.

Beware the Confident Counterfactual

Did you anticipate the Syrian uprising that began in 2011? What about the Tunisian, Egyptian, and Libyan uprisings that preceded and arguably shaped it? Did you anticipate that Assad would survive the first three years of civil war there, or that Iraq’s civil war would wax again as intensely as it has in the past few days?

All of these events or outcomes were difficult forecasting problems before they occurred, and many observers have been frank about their own surprise at many of them. At the same time, many of those same observers speak with confidence about the causes of those events. The invasion of Iraq in 2003 surely is or is not the cause of the now-raging civil war in that country. The absence of direct US or NATO military intervention in Syria is or is not to blame for continuation of that country’s civil war and the mass atrocities it has brought—and, by extension, the resurgence of civil war in Iraq.

But here’s the thing: strong causal claims require some confidence about how history would have unfolded in the absence of the cause of interest, and those counterfactual histories are no easier to get right than observed history was to anticipate.

Like all of the most interesting questions, what causality means and how we might demonstrate it will forever be matters for debate—see here on Daniel Little’s blog for an overview of that debate’s recent state—but most conceptions revolve around some idea of necessity. When we say X caused Y, we usually mean that had X not occurred, Y wouldn’t have happened, either. Subtler or less stringent versions might center on salience instead of necessity and insert a “probably” into the final phrase of the previous sentence, but the core idea is the same.

In nonexperimental social science, this logic implicitly obliges us to consider the various ways history might have unfolded in response to X’ rather than X. In a sense, then, both prediction and explanation are forecasting problems. They require us to imagine states of the world we have not seen and to connect them in plausible ways to to ones we have. If anything, the counterfactual predictions required for explanation are more frustrating epistemological problems than the true forecasts, because we will never get to see the outcome(s) against which we could assess the accuracy of our guesses.

As Robert Jervis pointed out in his contribution to a 1996 edited volume on counterfactual thought experiments in world politics, counterfactuals are (or should be) especially hard to construct—and thus causal claims especially hard to make—when the causal processes of interest involve systems. For Jervis,

A system exists when elements or units are interconnected so that the system has emergent properties—i.e., its characteristics and behavior canot be inferred from the characteristics and behavior of the units taken individually—and when changes in one unit or the relationship between any two of them produce ramifying alterations in other units or relationships.

As Jervis notes,

A great deal of thinking about causation…is based on comparing two situations that are the same in all ways except one. Any differences in the outcome, whether actual or expected…can be attributed to the difference in the state of the one element…

Under many circumstances, this method is powerful and appropriate. But it runs into serious problems when we are dealing with systems because other things simply cannot be held constant: as Garret Hardin nicely puts it, in a system, ‘we can never do merely one thing.’

Jervis sketches a few thought experiments to drive this point home. He has a nice one about the effects of external interventions on civil wars that is topical here, but I think his New York traffic example is more resonant:

In everyday thought experiments we ask what would have happened if one element in our world had been different. Living in New York, I often hear people speculate that traffic would be unbearable (as opposed to merely terrible) had Robert Moses not built his highways, bridges, and tunnels. But to try to estimate what things would have been like, we cannot merely subtract these structures from today’s Manhattan landscape. The traffic patterns, the location of businesses and residences, and the number of private automobiles that are now on the streets are in significant measure the product of Moses’s road network. Had it not been built, or had it been built differently, many other things would have been different. Traffic might now be worse, but it is also possible that it would have been better because a more efficient public transportation system would have been developed or because the city would not have grown so large and prosperous without the highways.

Substitute “invade Iraq” or “fail to invade Syria” for Moses’s bridges and tunnels, and I hope you see what I mean.

In the end, it’s much harder to get beyond banal observations about influences to strong claims about causality than our story-telling minds and the popular media that cater to them would like. Of course the invasion of Iraq in 2003 or the absence of Western military intervention in Syria have shaped the histories that followed. But what would have happened in their absence—and, by implication, what would happen now if, for example, the US now re-inserted its armed forces into Iraq or attempted to topple Assad? Those questions are far tougher to answer, and we should beware of anyone who speaks with great confidence about their answers. If you’re a social scientist who isn’t comfortable making and confident in the accuracy of your predictions, you shouldn’t be comfortable making and confident in the validity of your causal claims, either.

Conflict Events, Coup Forecasts, and Data Prospecting

Last week, for an upcoming post to the interim blog of the atrocities early-warning project I direct, I got to digging around in ACLED’s conflict event data for the first time. Once I had the data processed, I started wondering if they might help improve forecasts of coup attempts, too. That train of thought led to the preliminary results I’ll describe here, and to a general reminder of the often-frustrating nature of applied statistical forecasting.

ACLED is the Armed Conflict Location & Event Data Project, a U.S. Department of Defense–funded, multi-year endeavor to capture information about instances of political violence in sub-Saharan Africa from 1997 to the present.ACLED’s coders scan an array of print and broadcast sources, identifiy relevant events from them, and then record those events’ date, location, and form (battle, violence against civilians, or riots/protests); the types of actors involved; whether or not territory changed hands; and the number of fatalities that occurred. Researchers can download all of the project’s data in various formats and structures from the Data page, one of the better ones I’ve seen in political science.

I came to ACLED last week because I wanted to see if violence against civilians in Somalia had waxed, waned, or held steady in recent months. Trying to answer that question with their data meant:

  • Downloading two Excel spreadsheets, Version 4 of the data for 1997-2013 and the Realtime Data file covering (so far) the first five months of this year;
  • Processing and merging those two files, which took a little work because my software had trouble reading the original spreadsheets and the labels and formats differed a bit across them; and
  • Subsetting and summarizing the data on violence against civilians in Somalia, which also took some care because there was an extra space at the end of the relevant label in some of the records.

Once I had done these things, it was easy to generalize it to the entire data set, producing tables with monthly counts of fatalities and events by type  for all African countries over the past 13 years. And, once I had those country-month counts of conflict events, it was easy to imagine using them to try to help forecast of coup attempts in the world’s most coup-prone region. Other things being equal, variations across countries and over time in the frequency of conflict events might tell us a little more about the state of politics in those countries, and therefore where and when coup attempts are more likely to happen.

Well, in this case, it turns out they don’t tell us much more. The plot below shows ROC curves and the areas under those curves for the out-of-sample predictions from a five-fold cross-validation exercise involving a few country-month models of coup attempts. The Base Model includes: national political regime type (the categorization scheme from PITF’s global instability model applied to Polity 3d, the spell-file version); time since last change in Polity score (in days, logged); infant mortality rate (relative to the annual global median, logged); and an indicator for any coup attempts in the previous 24 months (yes/no). The three other models add logged sums of counts of ACLED events by type—battles, violence against civilians, or riots/protests—in the same country over the previous three, six, or 12 months, respectively. These are all logistic regression models, and the dependent variable is a binary one indicating whether or not any coup attempts (successful or failed) occurred in that country during that month, according to Powell and Thyne.

ROC Curves and AUC Scores from Five-Fold Cross-Validation of Coup Models Without and With ACLED Event Counts

ROC Curves and AUC Scores from Five-Fold Cross-Validation of Coup Models Without and With ACLED Event Counts

As the chart shows, adding the conflict event counts to the base model seems to buy us a smidgen more discriminatory power, but not enough to have confidence that they would routinely lead to more accurate forecasts. Intriguingly, the crossing of the ROC curves suggests that the base model, which emphasizes structural conditions, is actually a little better at identifying the most coup-prone countries. The addition of conflict event counts to the model leads to some under-prediction of coups in that high-risk set, but the balance tips the other way in countries with less structural vulnerability. In the aggregate, though, there is virtually no difference in discriminatory power between the base model and the ones that at the conflict event counts.

There are, of course, many other ways to group and slice ACLED’s data, but the rarity of coups leads me to believe that narrower cuts or alternative operationalizations aren’t likely to produce stronger predictive signals. In Africa since 1997, there are only 36 country-months with coup attempts, according to Powell and Thyne. When the events are this rare and complex and the examples this few, there’s really not much point in going beyond the most direct measures. Under these circumstances, we’re unlikely to discover finer patterns, and if we do, we probably shouldn’t have much confidence in them. There are also other models and techniques to try, but I’m dubious for the same reasons. (FWIW, I did try Random Forests and got virtually identical accuracy.)

So those are the preliminary results from this specific exercise. (The R scripts I used are on Github, here). I think those results are interesting in their own right, but the process involved in getting to them is also a great example of the often-frustrating nature of applied statistical forecasting. I spent a few hours each day for three days straight getting from the thought of exploring ACLED to the results described here. Nearly all of that time was spent processing data; only the last half-hour or so involved any modeling. As is often the case, a lot of that data-processing time was really just me staring at my monitor trying to think of another way to solve some problem I’d already tried and failed to solve.

In my experience, that kind of null result is where nearly all statistical forecasting ideas end. Even when you’re lucky enough to have the data to pursue them, few of your ideas pan out. But panning is the right metaphor, I think. Most of the work is repetitive and frustrating, but every so often you catch a nice nugget. Those nuggets tempt you to keep looking for more, and once in a great while, they can make you rich.

Ripple Effects from Thailand’s Coup

Thailand just had another coup, its first since 2006 but its twelfth since 1932. Here are a few things statistical analysis tells us about how that coup is likely to reverberate through Thailand’s economy and politics for the next few years.

1. Economic growth will probably suffer a bit more. Thailand’s economy was already struggling in 2014, thanks in part to the political instability to which the military leadership was reacting. Still, a statistical analysis I did a few years ago indicates that the coup itself will probably impose yet more drag on the economy. When we compare annual GDP growth rates from countries that suffered coups to similarly susceptible ones that didn’t, we see an average difference of about 2 percentage points in the year of the coup and another 1 percentage point the year after. (See this FiveThirtyEight post for a nice plot and discussion of those results.) Thailand might find its way to the “good” side of the distribution underlying those averages, but the central tendency suggests an additional knock on the country’s economy.

2. The risk of yet another coup will remain elevated for several years. The “coup trap” is real. Countries that have recently suffered successful or failed coup attempts are more likely to get hit again than ones that haven’t. This increase in risk seems to persist for several years, so Thailand will probably stick toward the top of the global watch list for these events until at least 2019.

3. Thailand’s risk of state-led mass killing has nearly tripled…but remains modest. The risk and occurrence of coups and the character of a country’s national political regime feature prominently in the multimodel ensemble we’re using in our atrocities early-warning project to assess risks of onsets of state-led mass killing. When I recently updated those assessments using data from year-end 2013—coming soon to a blog near you!—Thailand remained toward the bottom of the global distribution: 100th of 162 countries, with a predicted probability of just 0.3%. If I alter the inputs to that ensemble to capture the occurrence of this week’s coup and its effect on Thailand’s regime type, the predicted probability jumps to about 0.8%.

That’s a big change in relative risk, but it’s not enough of a change in absolute risk to push the country into the end of the global distribution where the vast majority of these events occur. In the latest assessments, a risk of 0.8% would have placed Thailand about 50th in the world, still essentially indistinguishable from the many other countries in that long, thin tail. Even with changes in these important risk factors and an ongoing insurgency in its southern provinces, Thailand remains in the vast bloc of countries where state-led mass killing is extremely unlikely, thanks (statistically speaking) to its relative wealth, the strength of its connection to the global economy, and the absence of certain markers of atrocities-prone regimes.

4. Democracy will probably be restored within the next few years… As Henk Goemans and Nikolay Marinov show in a paper published last year in the British Journal of Political Science, since the end of the Cold War, most coups have been followed within a few years by competitive elections. The pattern they observe is even stronger in countries that have at least seven years of democratic experience and have held at least two elections, as Thailand does and has. In a paper forthcoming in Foreign Policy Analysis that uses a different measure of coups, Jonathan Powell and Clayton Thyne see that same broad pattern. After the 2006 coup, it took Thailand a little over a year to get back to a competitive elections for a civilian government under a new constitution. If anything, I would expect this junta to move a little faster, and I would be very surprised if the same junta was still ruling in 2016.

5. …but it could wind up right back here again after that. As implied by nos. 1 and 2 above, however, the resumption of democracy wouldn’t mean that Thailand won’t repeat the cycle again. Both statistical and game-theoretic models indicate that prospects for yet another democratic breakdown will stay relatively high as long as Thai politics remains sharply polarized. My knowledge of Thailand is shallow, but the people I read or follow who know the country much better skew pessimistic on the prospects for this polarization ending soon. From afar, I wonder if it’s ultimately a matter of generational change and suspect that Thailand will finally switch to a stable and less contentious equilibrium when today’s conservative leaders start retiring from their jobs in the military and bureaucracy age out of street politics.

Military Coup in Thailand

This morning here but this afternoon in Thailand, the country’s military leadership sealed the deal on a coup d’etat when it announced via national television that it was taking control of government.

The declaration of martial law that came two days earlier didn’t quite qualify as a coup because it didn’t involve a seizure of power. Most academic definitions of coups have involve (1) the use or threat of force (2) by political insiders, that is, people inside government or state security forces (3) to seize national political power. Some definitions also specify that the putschists’ actions must be illegal or extra-constitutional. The declaration of martial law certainly involved the use or threat of force by political insiders, but it did not entail a direct grab for power and technically was not even illegal.

Today’s announcement checks those last boxes. Frankly, I’m a bit surprised by this turn of events, but not shocked. In my statistical assessments of coup risk for 2014, Thailand ranked 10th, clearly putting it among the highest-risk countries in the world. In December, though, I judged from a distance that the country’s military leadership probably didn’t want to take ownership of this situation unless things got significantly worse:

The big question now is whether or not the military leadership will respond as desired [by anti-government forces angling for a coup]. They would be very likely to do so if they coveted power for themselves, but I think it’s pretty clear from their actions that many of them don’t. I suspect that’s partly because they saw after 2006 that seizing power didn’t really fix anything and carried all kinds of additional economic and reputational costs. If that’s right, then the military will only seize power again if the situation degenerates enough to make the costs of inaction even worse—say, into sustained fighting between rival factions, like we see in Bangladesh right now.

I guess the growing concerns about an impending civil war and economic recession were finally enough to tip military leaders’ thinking in favor of action. Here’s hoping the final forecast I offered in that December post comes true:

Whatever happens this time around, though, the good news is that within a decade or so, Thai politics will probably stabilize into a new normal in which the military no longer acts directly in politics and parts of what’s now Pheu Thai and its coalition compete against each other and the remnants of today’s conservative forces for power through the ballot box.

Galton’s Experiment Revisited

This is another cross-post from the blog of the Good Judgment Project.

One of my cousins, Steve Ulfelder, writes good mystery novels. He left a salaried writer’s job 13 years ago to freelance and make time to pen those books. In March, he posted this announcement on Facebook:

CONTEST! When I began freelancing, I decided to track the movies I saw to remind myself that this was a nice bennie you can’t have when you’re an employee (I like to see early-afternoon matinees in near-empty theaters). I don’t review them or anything; I simply keep a Word file with dates and titles.

Here’s the contest: How many movies have I watched in the theater since January 1, 2001? Type your answer as a comment. Entries close at 8pm tonight, east coast time. Closest guess gets a WOLVERINE BROS. T-shirt and a signed copy of the Conway Sax novel of your choice. The eminently trustworthy Martha Ruch Ulfelder is holding a slip of paper with the correct answer.

I read that post and thought: Now, that’s my bag. I haven’t seen Steve in a while and didn’t have a clear idea of how many movies he’s seen in the past 13 years, but I do know about Francis Galton and that ox at the county fair. Instead of just hazarding a guess of my own, I would give myself a serious shot at outwitting Steve’s Facebook crowd by averaging their guesses.

After a handful of Steve’s friends had submitted answers, I posted the average of them as a comment of my own, then updated it periodically as more guesses came in. I had to leave the house not long before the contest was set to close, so I couldn’t include the last few entrants in my final answer. Still, I had about 40 guesses in my tally at the time and was feeling pretty good about my chances of winning that t-shirt and book.

In the end, 45 entries got posted before Steve’s 8 PM deadline, and my unweighted average wasn’t even close. The histogram below shows the distribution of the crowd’s guesses and the actual answer. Most people guessed fewer than 300 movies, but a couple of extreme values on the high side pulled the average up to 346.  Meanwhile, the correct answer was 607, nearly one standard deviation (286) above that mean. I hadn’t necessarily expected to win, but I was surprised to see that 12 of the 45 guesses—including the winner at 600—landed closer to the truth than the average did.

steve.movie.data.hist

I read the results of my impromptu experiment as a reminder that crowds are often smart, but they aren’t magically clairvoyant. Retellings of Galton’s experiment sometimes make it seem like even pools of poorly informed guessers will automatically produce an accurate estimate, but, apparently, that’s not true.

As I thought about how I might have done better, I got to wondering if there was something about Galton’s crowd that made it particularly powerful for his task. Maybe we should expect a bunch of county fair–goers in nineteenth century England to be good at guessing the weight of farm animals. Still, the replication of Galton’s experiment under various conditions suggests that domain knowledge helps, but it isn’t essential. So maybe this was just an unusually hard problem. Steve has seen an average of nearly one movie in theaters each week for the past 13 years. In my experience, that’s pretty extreme, so even with the hint he dropped in his post about being a frequent moviegoer, it’s easy to see why the crowd would err on the low side. Or maybe this result was just a fluke, and if we could rerun the process with different or larger pools, the average would usually do much better.

Whatever the reason for this particular failure, though, the results of my experiment also got me thinking again about ways we might improve on the unweighted average as a method of gleaning intelligence from crowds. Unweighted averages are a reasonable strategy when we don’t have reliable information about variation in the quality of the individual guesses (see here), but that’s not always the case. For example, if Steve’s wife or kids had posted answers in this contest, it probably would have been wise to give their guesses more weight on the assumption that they knew better than acquaintances or distant relatives like me.

Figuring out smarter ways to aggregate forecasts is also an area of active experimentation for the Good Judgment Project (GJP), and the results so far are encouraging. The project’s core strategy involves discovering who the most accurate forecasters are and leaning more heavily on them. I couldn’t do this in Steve’s single-shot contest, but GJP gets to see forecasters’ track records on large numbers of questions and has been using them to great effect. In the recently-ended Season 3, GJP’s “super forecasters” were grouped into teams and encouraged to collaborate, and this approach has worked quite well. In a paper published this spring, GJP has also shown that they can do well with nonlinear aggregations derived from a simple statistical model that adjusts for systematic bias in forecasters’ judgments. Team GJP’s bias-correction model beats not only the unweighted average but also a number of widely-used and more complex nonlinear algorithms.

Those are just a couple of the possibilities that are already being explored, and I’m sure people will keep coming up with new and occasionally better ones. After all, there’s a lot of money to be made and bragging rights to be claimed in those margins. In the meantime, we can use Steve’s movie-counting contest to remind ourselves that crowds aren’t automatically as clairvoyant as we might hope, so we should keep thinking about ways to do better.

Follow

Get every new post delivered to your Inbox.

Join 6,688 other followers

%d bloggers like this: