On Evaluating and Presenting Forecasts

On Saturday afternoon, at the International Studies Association‘s annual conference in New Orleans, I’m slated to participate in a round-table discussion with Patrick Brandt, Kristian Gleditsch, and Håvard Hegre on “assessing forecasts of (rare) international events.” Our chair, Andy Halterman, gave us three questions he’d like to discuss:

  1. What are the quantitative assessments of forecasting accuracy that people should use when they publish forecasts?
  2. What’s the process that should be used to determine whether a gain in a scoring metric represents an actual improvement in the model?
  3. How should model uncertainty and past model performance be conveyed to government or non-academic users of forecasts?

As I did for a Friday panel on alternative academic careers (here), I thought I’d use the blog to organize my thoughts ahead of the event and share them with folks who are interested but can’t attend the round table. So, here goes:

When assessing predictive power, we use perfection as the default benchmark. We ask, “Was she right?” or “How close to true value did he get?”

In fields where predictive accuracy is already very good, this approach seems reasonable. When the object of the forecasts are rare international events, however, I think this is a mistake, or at least misleading. It implies that perfection is attainable, and that distance from perfection is what we care about. In fact, approximating perfection is not a realistic goal in many fields, and what we really care about in those situations is distance from the available alternatives. In other words, I think we should always assess accuracy in comparative terms, not absolute ones. So, the question becomes: “Compared to what?”

I can think of two situations in which we’d want to forecast international events, and the ways we assess and describe the accuracy of the results will differ across the two. First, there is basic research, where the goal is to develop and test theory. This is what most scholars are doing most of the time, and here the benchmark should be other relevant theories. We want compare predictive power across nested models or models representing competing hypotheses to see which version does a better job anticipating real-world behavior—and, by implication, explaining it.

Then, of course, there is applied research, where the goal is to support or improve some decision process. Policy-making and investing are probably the two most common ones. Here, the benchmark should be the status quo. What we want to know is: “How much does the proposed forecasting process improve on the one(s) used now?” If the status quo is unclear, that already tells you something important about the state of forecasting in that field—namely, that it probably isn’t great. Even in that case, though, I think it’s still important to pick a benchmark that’s more realistic than perfection. Depending on the rarity of the event in question, that will usually mean either random guessing (for frequent events) or base rates (for rare ones).

How we communicate our findings on predictive power will also differ across basic and applied research, or at least I think it should. This has less to do with the goals of the work than it does with the audiences at which they’re usually aimed. When the audience is other scholars, I think it’s reasonable to expect them to understand the statistics and, so, to use those. For frequent events, Brier or logarithmic scores are often best, whereas for rare events I find that AUC scores are usually more informative, and I know a lot of people like to use F-1 scores in this context, too.

In applied settings, though, we’re usually doing the work as a service to someone else who probably doesn’t know the mechanics of the relevant statistics and shouldn’t be expected to. In my experience, it’s a bad idea in these settings to try to educate your customer on the spot about things like Brier or AUC scores. They don’t need to know those statistics, and you’re liable to come across as aloof or even condescending if you presume to spend time teaching them. Instead, I’d recommend using the practical problem they’re asking you to help solve to frame your representation of your predictive power. Propose a realistic decision process—or, if you can, take the one they’re already using—and describe the results you’d get if you plugged your forecasts into it.

In applied contexts, people often will also want to know how your process performed on crucial cases and what events would have surprised it, so it’s good to be prepared to talk about those as well. These topics are germane to basic research, too, but crucial cases will be defined differently in the two contexts. For scholars, crucial cases are usually understood as most-likely and least-likely ones in relation to the theory being tested. For policy-makers and other applied audiences, the crucial cases are usually understood as the ones where surprise was or would have been costliest.

So that’s how I think about assessing and describing the accuracy of forecasts of the kinds of (rare) international events a lot of us study. Now, if you’ll indulge me, I’d like to close with a pedantic plea: Can we please reserve the terms “forecast” and “prediction” for statements about things that haven’t happened and not apply them to estimates we generate for cases with known outcomes?

This might seem like a petty concern, but it’s actually tied to the philosophy of knowledge that underpins science, or my understanding of it, anyway. Making predictions about things that haven’t already happened is a crucial part of the scientific method. To learn from prediction, we assume that a model’s forecasting power tells us something about its proximity to the “true” data-generating process. This assumption won’t always be right, but it’s proven pretty useful over the past few centuries, so I’m okay sticking with it for now. For obvious reasons, it’s much easier to make accurate “predictions” about cases with known outcomes than unknown ones, so the scientific value of the two endeavors is very different. In light of that fact, I think we should be as clear and honest with ourselves and our audiences as we can about which one we’re doing, and therefore how much we’re learning.

When we’re doing this stuff in practice, there are three basic modes: 1) in-sample fitting, 2) cross-validation (CV), and 3) forecasting. In-sample fitting is the least informative of the three and, in my opinion, really should only be used in exploratory analysis and should not be reported in finished work. It tells us a lot more about the method than the phenomenon of interest.

CV is usually more informative than in-sample fitting, but not always. Each iteration of CV on the same data set moves you a little closer to in-sample fitting, because you effectively train to the idiosyncrasies of your chosen test set. Using multiple iterations of CV may ameliorate this problem, but it doesn’t always eliminate it. And on topics where the available data have already been worked to death—as they have on many problems of interest to scholars of international affairs—cross-validation really isn’t much more informative than in-sample fitting unless you’ve got a brand-new data series you can throw at the task and are focused on learning about it.

True forecasting—making clear statements about things that haven’t happened yet and then seeing how they turn out—is uniquely informative in this regard, so I think it’s important to reserve that term for the situations where that’s actually what we’re doing. When we describe in-sample and cross-validation estimates as forecasts, we confuse our readers, and we risk confusing ourselves about how much we’re really learning.

Of course, that’s easier for some phenomena than it is for others. If your theory concerns the risk of interstate wars, for example, you’re probably (and thankfully) not going to get a lot of opportunities to test it through prediction. Rather than sweeping those issues under the rug, though, I think we should recognize them for what they are. They are not an excuse to elide the huge differences between prediction and fitting models to history. Instead, they are a big haymaker of a reminder that social science is especially hard—not because humans are uniquely unpredictable, but rather because we only have the one grand and always-running experiment to observe, and we and our work are part of it.

Demography and Democracy Revisited

Last spring on this blog, I used Richard Cincotta’s work on age structure to take another look at the relationship between democracy and “development” (here). In his predictive models of democratization, Rich uses variation in median age as a proxy for a syndrome of socioeconomic changes we sometimes call “modernization” and argues that “a country’s chances for meaningful democracy increase as its population ages.” Rich’s models have produced some unconventional predictions that have turned out well, and if you buy the scientific method, this apparent predictive power implies that the underlying theory holds some water.

Over the weekend, Rich sent me a spreadsheet with his annual estimates of median age for all countries from 1972 to 2015, so I decided to take my own look at the relationship between those estimates and the occurrence of democratic transitions. For the latter, I used a data set I constructed for PITF (here) that covers 1955–2010, giving me a period of observation running from 1972 to 2010. In this initial exploration, I focused specifically on switches from authoritarian rule to democracy, which are observed with a binary variable that covers all country-years where an autocracy was in place on January 1. That variable (rgjtdem) is coded 1 if a democratic regime came into being at some point during that calendar year and 0 otherwise. Between 1972 and 2010, 94 of those switches occurred worldwide. The data set also includes, among other things, a “clock” counting consecutive years of authoritarian rule and an indicator for whether or not the country has ever had a democratic regime before.

To assess the predictive power of median age and compare it to other measures of socioeconomic development, I used the base and caret packages in R to run 10 iterations of five-fold cross-validation on the following series of discrete-time hazard (logistic regression) models:

  • Base model. Any prior democracy (0/1), duration of autocracy (logged), and the product of the two.
  • GDP per capita. Base model plus the Maddison Project’s estimates of GDP per capita in 1990 Geary-Khamis dollars (here), logged.
  • Infant mortality. Base model plus the U.S. Census Bureau’s estimates of deaths under age 1 per 1,000 live births (here), logged.
  • Median age. Base model plus Cincotta’s estimates of median age, untransformed.

The chart below shows density plots and averages of the AUC scores (computed with ‘roc.area’ from the verification package) for each of those models across the 10 iterations of five-fold CV. Contrary to the conventional assumption that GDP per capita is a useful predictor of democratic transitions—How many papers have you read that tossed this measure into the model as a matter of course?—I find that the model with the Maddison Project measure actually makes slightly less accurate predictions than the one with duration and prior democracy alone. More relevant to this post, though, the two demographic measures clearly improve the predictions of democratic transitions relative to the base model, and median age adds a smidgen more predictive signal than infant mortality.

transit.auc.by.fold

Of course, all of these things—national wealth, infant mortality rates, and age structures—have also been changing pretty steadily in a single direction for decades, so it’s hard to untangle the effects of the covariates from other features of the world system that are also trending over time. To try to address that issue and to check for nonlinearity in the relationship, I used Simon Wood’s mgcv package in R to estimate a semiparametric logistic regression model with smoothing splines for year and median age alongside the indicator of prior democracy and regime duration. Plots of the marginal effects of year and median age estimated from that model are shown below. As the left-hand plot shows, the time effect is really a hump in risk that started in the late 1980s and peaked sharply in the early 1990s; it is not the across-the-board post–Cold War increase that we often see covered in models with a dummy variable for years after 1991. More germane to this post, though, we still see a marginal effect from median age, even when accounting for those generic effects of time. Consistent with Cincotta’s argument and other things being equal, countries with higher median age are more likely to transition to democracy than countries with younger populations.

transit.ageraw.effect.spline.with.year

I read these results as a partial affirmation of modernization theory—not the whole teleological and normative package, but the narrower empirical conjecture about a bundle of socioeconomic transformations that often co-occur and are associated with a higher likelihood of attempting and sustaining democratic government. Statistical studies of this idea (including my own) have produced varied results, but the analysis I’m describing here suggests that some of the null results may stem from the authors’ choice of measures. GDP per capita is actually a poor proxy for modernization; there are a number of ways countries can get richer, and not all of them foster (or are fostered by) the socioeconomic transformations that form the kernel of modernization theory (cf. Equatorial Guinea). By contrast, demographic measures like infant mortality rates and median age are more tightly coupled to those broader changes about which Seymour Martin Lipset originally wrote. And, according to my analysis, those demographic measures are also associated with a country’s propensity for democratic transition.

Shifting to the applied forecasting side, I think these results confirm that median age is a useful addition to models of regime transitions, and it seems capture more information about those propensities than GDP (by a lot) and infant mortality (by a little). Like all slow-changing structural indicators, though, median age is a blunt instrument. Annual forecasts based on it alone would be pretty clunky, and longer-term forecasts would do well to consider other domestic and international forces that also shape (and are shaped by) these changes.

PS. If you aren’t already familiar with modernization theory and want more background, this ungated piece by Sheri Berman for Foreign Affairs is pretty good: “What to Read on Modernization Theory.”

PPS. The code I used for this analysis is now on GitHub, here. It includes a link to the folder on my Google Drive with all of the required data sets.

Why My Coup Risk Models Don’t Include Any Measures of National Militaries

For the past several years (herehere, here, and here), I’ve used statistical models estimated from country-year data to produce assessments of coup risk in countries worldwide. I rejigger the models a bit each time, but none of the models I’ve used so far has included specific features of countries’ militaries.

That omission strikes a lot of people as a curious one. When I shared this year’s assessments with the Conflict Research Group on Facebook, one group member posted this comment:

Why do none of the covariates feature any data on militaries? Seeing as militaries are the ones who stage the coups, any sort of predictive model that doesn’t account for the militaries themselves would seem incomplete.

I agree in principle. It’s the practical side that gets in the way. I don’t include features of national militaries in the models because I don’t have reliable measures of them with the coverage I need for this task.

To train and then apply these predictive models, I need fairly complete time series for all or nearly all countries of the world that extend back to at least the 1980s and have been updated recently enough to give me a useful input for the current assessment (see here for more on why that’s true). I looked again early this month and still can’t find anything like that on even the big stuff, like military budgets, size, and force structures. There are some series on this topic in the World Bank’s World Development Indicators (WDI) data set, but those series have a lot of gaps, and the presence of those gaps is correlated with other features of the models (e.g., regime type). Ditto for SIPRI. And, of course, those aren’t even the most interesting features for coup risk, like whether or not military promotions favor certain groups over others, or if there is a capable and purportedly loyal presidential guard.

But don’t take my word for it. Here’s what the Correlates of War Project says in the documentation for Version 4.0 of its widely-used data set (PDF) about its measure of military expenditures, one of two features of national militaries it tries to cover (the other is total personnel):

It was often difficult to identify and exclude civil expenditures from reported budgets of less developed nations. For many countries, including some major powers, published military budgets are a catch-all category for a variety of developmental and administrative expenses—public works, colonial administration, development of the merchant marine, construction, and improvement of harbor and navigational facilities, transportation of civilian personnel, and the delivery of mail—of dubious military relevance. Except when we were able to obtain finance ministry reports, it is impossible to make detailed breakdowns. Even when such reports were available, it proved difficult to delineate “purely” military outlays. For example, consider the case in which the military builds a road that facilitates troops movements, but which is used primarily by civilians. A related problem concerns those instances in which the reported military budget does not reflect all of the resources devoted to that sector. This usually happens when a nation tries to hide such expenditures from scrutiny; for instance, most Western scholars and military experts agree that officially reported post-1945 Soviet-bloc totals are unrealistically low, although they disagree on the appropriate adjustments.

And that’s just the part of the “Problems and Possible Errors” section about observing the numerator in a calculation that also requires a complicated denominator. And that’s for what is—in principle, at least—one of the most observable features of a country’s civil-military relations.

Okay, now let’s assume that problem magically disappears, and COW’s has nearly-complete and reliable data on military expenditures. Now we want to use models trained on those data to estimate coup risk for 2015. Whoops: COW only runs through 2010! The World Bank and SIPRI get closer to the current year—observations through 2013 are available now—but there are missing values for lots of countries, and that missingness is caused by other predictors of coup risk, such as national wealth, armed conflict, and political regime type. For example, WDI has no data on military expenditures for Eritrea and North Korea ever, and the series for Central African Republic is patchy throughout and ends in 2010. If I wanted to include military expenditures in my predictive models, I could use multiple imputation to deal with these gaps in the training phase, but then how would I generate current forecasts for these important cases? I could make guesses, but how accurate could those guesses be for a case like Eritrea or North Korea, and then am I adding signal or noise to the resulting forecasts?

Of course, one of the luxuries of applied forecasting is that the models we use can lack important features and still “work.” I don’t need the model to be complete and its parameters to be true for the forecasts to be accurate enough to be useful. Still, I’ll admit that, as a social scientist by training, I find it frustrating to have to set aside so many intriguing ideas because we simply don’t have the data to try them.

Estimating NFL Team-Specific Home-Field Advantage

This morning, I tinkered a bit with my pro-football preseason team strength survey data from 2013 and 2014 to see what other simple things I might do to improve the accuracy of forecasts derived from future versions of them.

My first idea is to go beyond a generic estimate of home-field advantage—about 3 points, according to my and everyone else’s estimates—with team-specific versions of that quantity. The intuition is that some venues confer a bigger advantage than others. For example, I would guess that Denver enjoys a bigger home-field edge than most teams because their stadium is at high altitude. The Broncos live there, so they’re used to it, but visiting teams have to adapt, and that process supposedly takes about a day for every 1,000 feet over 3,000. Some venues are louder than others, and that noise is often dialed up when visiting teams would prefer some quiet. And so on.

To explore this idea, I’m using a simple hierarchical linear model to estimate team-specific intercepts after taking preseason estimates of relative team strength into account. The line of R code used to estimate the model requires the lme4 package and looks like this:

mod1 <- lmer(score.raw ~ wiki.diff + (1 | home_team), results)

Where

score.raw = home_score - visitor_score
wiki.diff = home_wiki - visitor_wiki

Those wiki vectors are the team strength scores estimated from preseason pairwise wiki surveys. The ‘results’ data frame includes scores for all regular and postseason games from those two years so far, courtesy of devstopfix’s NFL results repository on GitHub (here). Because the net game and strength scores are both ordered home to visitor, we can read those random intercepts for each home team as estimates of team-specific home advantage. There are probably other sources of team-specific bias in my data, so those estimates are going to be pretty noisy, because I think it’s a reasonable starting point.

My initial results are shown in the plot below, which I get with these two lines of code, the second of which requires the lattice package:

ha1 <- ranef(mod1, condVar=TRUE)
dotplot(ha1)

Bear in mind that the generic (fixed) intercept is 2.7, so the estimated home-field advantage for each team is what’s shown in the plot plus that number. For example, these estimates imply that my Ravens enjoy a net advantage of about 3 points when they play in Baltimore, while their division-rival Bengals are closer to 6.

home.advantage.estimates

In light of DeflateGate, I guess I shouldn’t be surprised to see the Pats at the top of the chart, almost a whole point higher than the second-highest team. Maybe their insanely home low fumble rate has something to do with it.* I’m also heartened to see relatively high estimates for Denver, given the intuition that started this exercise, and Seattle, which I’ve heard said enjoys an unusually large home-field edge. At the same time, I honestly don’t know what to make of the exceptionally low estimates for DC and Jacksonville, who appear from these estimates to suffer a net home-field disadvantage. That strikes me as odd and undercuts my confidence in the results.

In any case, that’s how far my tinkering took me today. If I get really bored motivated, I might try re-estimating the model with just the 2013 data and then running the 2014 preseason survey scores through that model to generate “forecasts” that I can compare to the ones I got from the simple linear model with just the generic intercept (here). The point of the exercise was to try to get more accurate forecasts from simple models, and the only way to test that is to do it. I’m also trying to decide if I need to cross these team-specific effects with season-specific effects to try to control for differences across years in the biases in the wiki survey results when estimating these team-specific intercepts. But I’m not there yet.

* After I published this post, Michael Lopez helpfully pointed me toward a better take on the Patriots’ fumble rate (here), and Mo Patel observed that teams manage their own footballs on the road, too, so that particular tweak—if it really happened—wouldn’t have a home-field-specific effect.

Statistical Assessments of Coup Risk for 2015

Which countries around the world are more likely to see coup attempts in 2015?

For the fourth year in a row, I’ve used statistical models to generate one answer to that question, where a coup is defined more or less as a forceful seizure of national political authority by military or political insiders. (I say “more or less” because I’m blending data from two sources with slightly different definitions; see below for details.) A coup doesn’t need to succeed to count as an attempt, but it does need to involve public action; alleged plots and rumors of plots don’t qualify. Neither do insurgencies or foreign invasions, which by definition involve military or political outsiders. The heat map below shows variation in estimated coup risk for 2015, with countries colored by quintiles (fifths).

forecast.heatmap.2015

The dot plot below shows the estimates and their 90-percent confidence intervals (CIs) for the 40 countries with the highest estimated risk. The estimates are the unweighted average of forecasts from two logistic regression models; more on those in a sec. To get CIs for estimates from those two models, I took a cue from a forthcoming article by Lyon, Wintle, and Burgman (fourth publication listed here; the version I downloaded last year has apparently been taken down, and I can’t find another) and just averaged the CIs from the two models.

forecast.dotplot.2015

I’ve consistently used simple two– or three-model ensembles to generate these coup forecasts, usually pairing a logistic regression model with an implementation of Random Forests on the same or similar data. This year, I decided to use only a pair of logistic regression models representing somewhat different ideas about coup risk. Consistent with findings from other work in which I’ve been involved (here), k-fold cross-validation told me that Random Forests wasn’t really boosting forecast accuracy, and sticking to logistic regression makes it possible to get and average those CIs. The first model matches one I used last year, and it includes the following covariates:

  • Infant mortality rate. Deaths of children under age 1 per 1,000 live births, relative to the annual global median, logged. This measure that primarily reflects national wealth but is also sensitive to variations in quality of life produced by things like corruption and inequality. (Source: U.S. Census Bureau)
  • Recent coup activity. A yes/no indicator of whether or not there have been any coup attempts in that country in the past five years. I’ve tried logged event counts and longer windows, but this simple version contains as much predictive signal as any other. (Sources: Center for Systemic Peace and Powell and Thyne)
  • Political regime type. Following Fearon and Laitin (here), a categorical measure differentiating between autocracies, anocracies, democracies, and other forms. (Source: Center for Systemic Peace, with hard-coded updates for 2014)
  • Regime durability. The “number of years since the last substantive change in authority characteristics (defined as a 3-point change in the POLITY score).” (Source: Center for Systemic Peace, with hard-coded updates for 2014)
  • Election year. A yes/no indicator for whether or not any national elections (executive, legislative, or general) are scheduled to take place during the forecast year. (Source: NELDA, with hard-coded updates for 2011–2015)
  • Economic growth. The previous year’s annual GDP growth rate. To dampen the effects of extreme values on the model estimates, I take the square root of the absolute value and then multiply that by -1 for cases where the raw value less than 0. (Source: IMF)
  • Political salience of elite ethnicity. A yes/no indicator for whether or not the ethnic identity of national leaders is politically salient. (Source: PITF, with hard-coded updates for 2014)
  • Violent civil conflict. A yes/no indicator for whether or not any major armed civil or ethnic conflict is occurring in the country. (Source: Center for Systemic Peace, with hard-coded updates for 2014)
  • Country age. Years since country creation or independence, logged. (Source: me)
  • Coup-tagion. Two variables representing (logged) counts of coup attempts during the previous year in other countries around the world and in the same geographic region. (Source: me)
  • Post–Cold War period. A binary variable marking years after the disintegration of the USSR in 1991.
  • Colonial heritage. Three separate binary indicators identifying countries that were last colonized by Great Britain, France, or Spain. (Source: me)

The second model takes advantage of new data from Geddes, Wright, and Frantz on autocratic regime types (here) to consider how qualitative differences in political authority structures and leadership might shape coup risk—both directly, and indirectly by mitigating or amplifying the effects of other things. Here’s the full list of covariates in this one:

  • Infant mortality rate. Deaths of children under age 1 per 1,000 live births, relative to the annual global median, logged. This measure that primarily reflects national wealth but is also sensitive to variations in quality of life produced by things like corruption and inequality. (Source: U.S. Census Bureau)
  • Recent coup activity. A yes/no indicator of whether or not there have been any coup attempts in that country in the past five years. I’ve tried logged event counts and longer windows, but this simple version contains as much predictive signal as any other. (Sources: Center for Systemic Peace and Powell and Thyne)
  • Regime type. Using the binary indicators included in the aforementioned data from Geddes, Wright, and Frantz with hard-coded updates for the period 2011–2014, a series of variables differentiating between the following:
    • Democracies
    • Military autocracies
    • One-party autocracies
    • Personalist autocracies
    • Monarchies
  • Regime duration. Number of years since the last change in political regime type, logged. (Source: Geddes, Wright, and Frantz, with hard-coded updates for the period 2011–2014)
  • Regime type * regime duration. Interactions to condition the effect of regime duration on regime type.
  • Leader’s tenure. Number of years the current chief executive has held that office, logged. (Source: PITF, with hard-coded updates for 2014)
  • Regime type * leader’s tenure. Interactions to condition the effect of leader’s tenure on regime type.
  • Election year. A yes/no indicator for whether or not any national elections (executive, legislative, or general) are scheduled to take place during the forecast year. (Source: NELDA, with hard-coded updates for 2011–2015)
  • Regime type * election year. Interactions to condition the effect of election years on regime type.
  • Economic growth. The previous year’s annual GDP growth rate. To dampen the effects of extreme values on the model estimates, I take the square root of the absolute value and then multiply that by -1 for cases where the raw value less than 0. (Source: IMF)
  • Regime type * economic growth. Interactions to condition the effect of economic growth on regime type.
  • Post–Cold War period. A binary variable marking years after the disintegration of the USSR in 1991.

As I’ve done for the past couple of years, I used event lists from two sources—the Center for Systemic Peace (about halfway down the page here) and Jonathan Powell and Clayton Thyne (Dataset 3 here)—to generate the historical data on which those models were trained. Country-years are the unit of observation in this analysis, so a country-year is scored 1 if either CSP or P&T saw any coup attempts there during those 12 months and 0 otherwise. The plot below shows annual counts of successful and failed coup attempts in countries worldwide from 1946 through 2014 according to the two data sources. There is a fair amount of variance in the annual counts and the specific events that comprise them, but the basic trend over time is the same. The incidence of coup attempts rose in the 1950s; spiked in the early 1960s; remained relatively high throughout the rest of the Cold War; declined in the 1990s, after the Cold War ended; and has remained relatively low throughout the 2000s and 2010s.

Annual counts of coup events worldwide from two data sources, 1946-2014

Annual counts of coup events worldwide from two data sources, 1946-2014

I’ve been posting annual statistical assessments of coup risk on this blog since early 2012; see here, here, and here for the previous three iterations. I have rejiggered the modeling a bit each year, but the basic process (and the person designing and implementing it) has remained the same. So, how accurate have these forecasts been?

The table below reports areas under the ROC curve (AUC) and Brier scores (the 0–1 version) for the forecasts from each of those years and their averages, using the the two coup event data sources alone and together as different versions of the observed ground truth. Focusing on the “either” columns, because that’s what I’m usually using when estimating the models, we can see the the average accuracy—AUC in the low 0.80s and Brier score of about 0.03—is comparable to what we see in many other country-year forecasts of rare political events using a variety of modeling techniques (see here). With the AUC, we can also see a downward trend over time. With so few events involved, though, three years is too few to confidently deduce a trend, and those averages are consistent with what I typically see in k-fold cross-validation. So, at this point, I suspect those swings are just normal variation.

AUC and Brier scores for coup forecasts posted on Dart-Throwing Chimp, 2012-2014, by coup event data source

AUC and Brier scores for coup forecasts posted on Dart-Throwing Chimp, 2012-2014, by coup event data source

The separation plot designed by Greenhill, Ward, and Sacks (here) offers a nice way to visualize the accuracy of these forecasts. The ones below show the three annual slices using the “either” version of the outcome, and they reinforce the story told in the table: the forecasts have correctly identified most of the countries that saw coup attempts in the past three years as relatively high-risk cases, but the accuracy has declined over time. Let’s define a surprise as a case that fell outside the top 30 of the ordered forecasts but still saw a coup attempt. In 2012, just one of four countries that saw coup attempts was a surprise: Papua New Guinea, ranked 48. In 2013, that number increased to two of five (Eritrea at 51 and Egypt at 58), and in 2014 it rose to three of five (Burkina Faso at 42, Ukraine at 57, and the Gambia at 68). Again, though, the average accuracy across the three years is consistent with what I typically see in k-fold cross-validation of these kinds of models in the historical data, so I don’t think we should make too much of that apparent time trend just yet.

cou.scoring.sepplot.2012 cou.scoring.sepplot.2013 cou.scoring.sepplot.2014

This year, for the first time, I am also running an experiment in crowdsourcing coup risk assessments by way of a pairwise wiki survey (survey here, blog post explaining it here, and preliminary results discussed here). My long-term goal is to repeat this process numerous times on this topic and some others (for example, onsets of state-led mass killing episodes) to see how the accuracy of the two approaches compares and how their output might be combined. Statistical forecasts are usually much more accurate than human judgment, but that advantage may be reduced or eliminated when we aggregate judgments from large and diverse crowds, or when we don’t have data on important features to use in those statistical models. Models that use annual data also suffer in comparison to crowdsourcing processes that can update continuously, as that wiki survey does (albeit with a lot of inertia).

We can’t incorporate the output from that wiki survey into the statistical ensemble, because the survey doesn’t generate predicted probabilities; it only assesses relative risk. We can, however, compare the rank orderings the two methods produce. The plot below juxtaposes the rankings produced by the statistical models (left) with the ones from the wiki survey (right). About 500 votes have been cast since I wrote up the preliminary results, but I’m going to keep things simple for now and use the preliminary survey results I already wrote up. The colored arrows identify cases ranked at least 10 spots higher (red) or lower (blue) by the crowd than the statistical models. As the plot shows, there are many differences between the two, even toward the top of the rankings where the differences in statistical estimates are bigger and therefore more meaningful. For example, the crowd sees Nigeria, Libya, and Venezuela as top 10 risks while the statistical models do not; of those three, only Nigeria ranks in the top 30 on the statistical forecasts. Meanwhile, the crowd pushes Niger and Guinea-Bissau out of the top 10 down to the 20s, and it sees Madagascar, Afghanistan, Egypt, and Ivory Coast as much lower risks than the models do. Come 2016, it will be interesting to see which version was more accurate.

coup.forecast.comparison.2015

If you are interested in getting hold of the data or R scripts used to produce these forecasts and figures, please send me an email at ulfelder at gmail dot com.

A Crowd’s-Eye View of Coup Risk in 2015

A couple of weeks ago (here), I used the blog to launch an experiment in crowdsourcing assessments of coup risk for 2015 by way of a pairwise wiki survey. The survey is still open and will stay that way until the end of the year, but with nearly 2,700 pairwise votes already cast, I thought it was good time to take stock of the results so far.

Before discussing those results, though, let me say thank you to all the people who voted in the survey or shared the link. These data don’t materialize from thin air. They only exist because busy people contributed their knowledge and time, and I really appreciate all of those contributions.

Okay, so, what does that self-assembled crowd think about relative risks of coup attempts in 2015? The figure below maps the country scores produced from the votes cast so far. Darker grey indicates higher risk. PLEASE NOTE: Those scores fall on a 0–100 scale, but they are not estimated probabilities of a coup attempt. Instead, they are only measures of relative risk, because that’s all we can get from a pairwise wiki survey. Coup attempts are rare events—in most recent years, we’ve seen fewer than a handful of them worldwide—so the safe bet for nearly every country every year is that there won’t be any coup attempts this year.

wikisurvey.couprisk.2015.map

 

Smaller countries can be hard to find on that map, and small differences in scores can be hard to discern, so I also like to have a list of the results to peruse. Here’s a dot plot with countries in descending order by model score. (It’d be nice to make this table sortable so you could also look for countries alphabetically, but my Internet fu is not up to that task.)

wikisurvey.couprisk.2015.dotplot

This survey is open to the public, and participants may cast as many votes as they like in as many sessions as they like. The scores summarized above come from nearly 2,700 votes cast between the morning of January 3, when I published the blog post about the survey, and the morning of January 14, when I downloaded a report on the current results. At present, this blog has a few thousand followers on Wordpress and a few hundred email subscribers. I also publicized the survey twice on Twitter, where I have approximately 6,000 followers: once when I published the initial blog post, and again on January 13. As the plot below shows, participation spiked around both of those pushes and was low otherwise.

votesovertime.20150114

The survey instrument does not collect identifying information about participants, so it is impossible to describe the make-up of the crowd. What we do know is that those votes came from about 100 unique user sessions. Some people probably participated more than once—I know that I cast a dozen or so votes on a few occasions—so 100 unique sessions probably works out to something like 80 or 90 individuals. But that’s a guess.

usersessions.20150114

We also know that those votes came from lots of different parts of the world. As the map below shows, most of the votes came from the U.S., Europe, and Australia, but there were also pockets of activity in the Middle East (especially Israel), Latin America (Brazil and Argentina), Africa (Cote d’Ivoire and Rwanda), and Asia (Thailand and Bangladesh).

votemap.20150114

I’ll talk a little more about the substance of these results when I publish my statistical assessments of coup risk for 2015, hopefully in the next week or so. Meanwhile, number-crunchers can get a .csv with the data used to generate the map and table in this post from my Google Drive (here) and the R script from GitHub (here). If you’re interested in seeing the raw vote-level data from which those scores were generated, drop me a line.

A Forecast of Global Democratization Trends Through 2025

A couple of months ago, I was asked to write up my thoughts on global trends in democratization over the next five to 10 years. I said at the time that, in coarse terms, I see three plausible alternative futures: 1) big net gains, 2) big net losses, and 3) little net change.

  • By big net gains, I mean a rise in the prevalence of democratic regimes above 65 percent, or, or, because of its size and geopolitical importance, democratization in China absent a sharp decline in the global prevalence of democracy. For big net gains to happen, we would need to see a) one or more clusters of authoritarian breakdown and subsequent democratization in the regions where such clusters are still possible, i.e., Asia, the former Soviet Union, and the Middle East and North Africa (or the aforementioned transition in China); and b) no sharp losses in the regions where democracy is now prevalent, i.e., Europe, the Americas, and sub-Saharan Africa. I consider (a) unlikely but possible (see here) and (b) highly likely. The scenario requires both conditions, so it is unlikely.
  • By big net losses, I mean a drop in the global prevalence of democracy below 55 percent. For that to happen, we would need to see the opposite of big net gains—that is, a) no new clusters of democratization and no democratization in China and b) sharp net losses in one or more of the predominantly democratic regions. In my judgment, (a) is likely but (b) is very unlikely. This outcome depends on the conjunction of (a) and (b), so the low probability of (b) means this outcome is highly unlikely. A reversion to autocracy somewhere in Western Europe or North America would also push us into “big net loss” territory, but I consider that event extremely unlikely (see here and here for why).
  • In the absence of either of these larger shifts, we will probably see little net change in the pattern of the past decade or so: a regular trickle of transitions to and from democracy at rates that are largely offsetting, leaving the global prevalence hovering between 55 and 65 percent. Of course, we could also wind up with little net change in the global prevalence of democracy under a scenario in which some longstanding or otherwise significant authoritarian regimes—for example, China, Russia, Iran, or Saudi Arabia— break down, and those breakdowns spread to interdependent regimes, but most of those breakdowns lead to new authoritarian regimes or short-lived attempts at democracy. This is what we saw in the Arab Spring, and base rates from the past several decades suggest that it is the most likely outcome of any regional clusters of authoritarian breakdown in the next decade or so as well. I consider this version of the little-net-change outcome to be more likely than the other one (offsetting trickles of transitions to and from democracy with no new clusters of regime breakdown). Technically, we could also get to an outcome of little net change through a combination of big net losses in predominantly democratic regions and big gains in predominantly authoritarian regions, but I consider this scenario so unlikely in the next five to 10 years that it’s not worth considering in depth.

I believe the probabilities of big net gains and persistence of current levels are both much greater than the probability of big net losses. In other words, I am generally bullish. For the sake of clarity, I would quantify those guesses as follows:

  • Probability of big net gains: 20 percent
  • Probability of little net change: 75 percent
    • With regime breakdown in one or more critical autocracies: 60 percent
    • Without regime breakdown in any critical autocracies: 15 percent
  • Probability of big net losses: 5 percent

That outlook is informed by a few theoretical and empirical observations.

First, when I talk about democratization, I have in mind expansions of the breadth, depth, and protection of consultation between national political regimes and their citizens. As Charles Tilly argues on p. 24 of his 2007 book, Democracy, “A regime is democratic to the degree that political relations between the state and its citizens feature broad, equal, protected, and mutually binding consultation.” Fair and competitive elections are the most obvious and in some ways the most important form this consultation can take, but they are not the only one. Still, for purposes of observing broad trends and coarsely comparing cases, we can define a democracy as a regime in which officials who actually rule are chosen through fair and competitive elections in which nearly all adult citizens can vote. The fairness of elections depends on the existence of numerous civil liberties, including freedoms of speech, assembly, and association, and the presence of a reasonably free press, so this is not a low bar. Freedom House’s list of electoral democracies is a useful proxy for this set of conditions.

Second, we do not understand the causal processes driving democratization well, and we certainly don’t understand them well enough to know how to manipulate them in order to reliably produce desired outcomes. The global political economy, and the political economies of the states that comprise one layer of it, are parts of a complex adaptive system. This system is too complex for us to model and understand in ways that are more than superficial, partly because it continues to evolve as we try to understand and manipulate it. That said, we have seen some regularities in this system over the past half-century or so:

  • States are more likely to try and then to sustain democratic regimes as their economies grow, their economies become more complex, and their societies transform in ways associated with those trends (e.g., live longer, urbanize, and become more literate). These changes don’t produce transitions, but they do create structural conditions that are more conducive to them.
  • Oil-rich countries have been the exceptions to this pattern, but even they are not impervious (e.g., Mexico, Indonesia). Specifically, they are more susceptible to pressures to democratize when their oil income diminishes, and variation over time in that income depends, in part, on forces beyond their control (e.g., oil prices).
  • Consolidated single-party regimes are the most resilient form of authoritarian rule. Personalist dictatorships are also hard to topple as long as the leader survives but often crumble when that changes. Military-led regimes that don’t evolve into personalist or single-party autocracies rarely last more than a few years, especially since the end of the Cold War.
  • Most authoritarian breakdowns occur in the face of popular protests, and those protests are more likely to happen when the economy is slumping, when food or fuel prices are spiking, when protests are occurring in nearby or similar countries, and around elections. Signs that elites are fighting amongst themselves may also help to spur protests, but elite splits are common in autocracies and often emerge in reaction to protests, not ahead of them.
  • Most attempts at democracy end with a reversion to authoritarian rule, but the chances that countries will try again and then that democracy will stick improve as countries get richer and have tried more times before. The origins of the latter pattern are unclear, but they probably have something to do with the creation of new forms of social and political organization and the subsequent selection and adaptation of those organizations into “fitter” competitors under harsh pressures.

Third, whatever its causes, there is a strong empirical trend toward democratization around the world. Since the middle of the twentieth century, both the share of regimes worldwide that are democratic and the share of the global population living in democratic regimes have expanded dramatically. These expansions have not come steadily, and there is always some churn in the system, but the broader trend persists in spite of those dips and churn

The strength and, so far, persistence of this trend lead me to believe that the global system would have to experience a profound collapse or transformation for that trend to be disrupted. Under the conditions that have prevailed for the past century or so, selection pressures in the global system seem to be running strongly in favor of democratic political regimes with market-based economies.

Crucially, this long-term trend has also proved resilient to the global financial crisis that began in 2007-2008 and has persisted to some degree ever since. This crisis was as sharp a stress test of many national political regimes as we have seen in a while, perhaps since World War II. Democracy has survived this test in all of the world’s wealthy countries, and there was no stampede away from democracy in less wealthy countries with younger regimes. Freedom House and many other activists lament the occurrence of a “democratic recession” over the past several years, but global data just don’t support the claim that one is occurring. What we have seen instead is a slight decline in the prevalence of democratic regimes accompanied by a deepening of authoritarian rule in many of the autocracies that survived the last flurry of democratic transitions.

Meanwhile, some authoritarian regimes in the Middle East and North Africa broke down in the face of uprisings demanding greater popular accountability, and some of those breakdowns led to attempts at democratization—in Tunisia, Egypt, and Libya in particular. Most of those attempts at democratization have since failed, but not all did, Tunisia being the notable exception. What’s more, the popular pressure in favor of democratization has not dissipated in all of the cases where authoritarian breakdown didn’t happen. Bahrain, Kuwait, and, to a lesser extent, Saudi Arabia are notable in this regard.

Rising pressures on China and Russia suggest that similar clusters of regime instability are increasingly likely in their respective neighborhoods, even if they remain unlikely in any given year. China faces significant challenges on numerous fronts, including a slowing economy, a looming real-estate debt crisis, swelling popular frustration over industrial pollution, an uptick in labor activism, an anti-corruption campaign that could alienate some political and military insiders, and a separatist insurgency in Xinjiang. No one of those challenges is necessarily likely to topple the regime, but the presence of so many of them at once adds up to a significant risk (or opportunity, depending on one’s perspective). A regime crisis in China could ripple through its region with strongest effect on the most dependent regimes—on North Korea in particular, but also perhaps Vietnam, Laos, and Myanmar. Even if a crisis there didn’t reverberate, China’s population size and rising international influence imply that any movement toward democracy would have a significant impact on the global balance sheet.

The Russian regime is also under increased pressure, albeit for different reasons. Russia is already in recession, and falling oil prices and capital flight are making things much worse without much promise of near-term relief. U.S. and E.U. sanctions deserve significant credit (or blame) for the acceleration of capital flight, and prosecution of the war in Ukraine is also imposing additional direct costs on Russia’s power resources. The extant regime has survived domestic threats before, but 10 more years is a long time for a regime that stands on feet of socioeconomic clay.

Above all else, these last two points—about 1) the resilience of existing democracies to the stress of the past several years and 2) the persistence and even deepening of pressures on many surviving authoritarian regimes—are what make me bullish about the prospects for democracy in next five to 10 years. In light of current trends in China and Russia, I have a hard time imagining both of those regimes surviving to 2025. Democratization might not follow, and if it does, it won’t necessarily stick, at least not right away. Neither regime can really get a whole lot more authoritarian than it is now, however, so the possibilities for change on this dimension are nearly all on the upside. (The emergence of a new authoritarian regime that is more aggressive abroad is also possible in both cases, but that topic is beyond the scope of this memo.)

Talk about the possibility of a wave of democratic reversals usually centers on the role China or Russia might play as either an agent of de-democratization or example of an alternative future. As noted above, though, both of these systems are currently facing substantial stresses at home. These stresses both limit their ability to act as agents of de-democratization and take the shine off any example they might set.

In short, I think that talk of Russia and China’s negative influence on the global democratization trend is overblown. Apart from the (highly unlikely) invasion and successful occupation of other countries, I don’t think either of these governments has the ability to undo democratization elsewhere. Both can and do help some other authoritarian regimes survive, however, and this is why regime crisis or breakdown in either one of them has the potential to catalyze new clusters of regime instability in their respective neighborhoods.

What do you think? If you made it this far and have any (polite) reactions you’d like to share, please leave a comment.

An Experiment in Crowdsourced Coup Forecasting

Later this month, I hope to have the data I need to generate and post new statistical assessments of coup risk for 2015. Meanwhile, I thought it would be interesting and useful to experiment with applying a crowdsourcing tool to this task. So, if you think you know something about coup risk and want to help with this experiment, please cast as many votes as you like here:

2015 Coup Risk Wiki Survey

For this exercise, let’s use Naunihal Singh’s (2014, p. 51) definition of a coup attempt: “An explicit action, involving some portion of the state military, police, or security forces, undertaken with intent to overthrow the government.” As Naunihal notes,

This definition retains most of the aspects commonly found in definitions of coup attempts [Ed.: including the ones I use in my statistical modeling] while excluding a wide range of similar activities, such as conspiracies, mercenary attacks, popular protests, revolutions, civil wars, actions by lone assassins, and mutinies whose goals explicitly excluded taking power (e.g., over unpaid wages). Unlike a civil war, there is no minimum casualty threshold necessary for an event to be considered a coup, and many coups take place bloodlessly.

By this definition, last week’s putsch in the Gambia and November’s power grab by a lieutenant colonel in Burkina Faso would qualify, but last February’s change of government by parliamentary action in Ukraine after President Yanukovich’s flight in the face of popular unrest would not. Nor would state collapses in Libya and Central African Republic, which occurred under pressure from rebels rather than state security forces. And, of course, Gen. Sisi’s seizure of power in Egypt in July 2013 clearly would qualify as a successful coup on these terms.

In a guest post here yesterday, Maggie Dwyer identified one factor—divisions and tensions within the military—that probably increases coup risk in some cases, but that we can’t fold into global statistical modeling because, as often happens, we don’t have the time-series cross-sectional data we would need to do that. Surely there are other such factors and forces. My hope is that this crowdsourcing approach will help spotlight some cases overlooked by the statistical forecasts because their fragility is being driven by things those models can’t consider.

Wiki surveys weren’t designed specifically for forecasting, but I have adapted them to this purpose on two other topics, and in both cases the results have been pretty good. As part of my work for the Early Warning Project, we have run wiki surveys on risks of state-led mass killing onset for 2014 and now 2015. That project’s data-makers didn’t see any such onsets in 2014, but the two countries that came closest—Iraq and Myanmar—ranked fifth and twelfth, respectively, in the wiki survey we ran in December 2013. On pro football, I’ve run surveys ahead of the 2013 and 2014 seasons. The results haven’t been clairvoyant, but they haven’t been shabby, either (see here and here for details).

I will summarize the results of this survey on coup risk in a blog post in mid-January and will make the country– and vote-level data freely available to other researchers when I do.

I don’t necessarily plan to close the survey at that point, though. In fact, I’m really hoping to get a chance to tinker with using it more dynamically. Ideally, we would leave the survey running throughout the year so that participants could factor new information—credible rumors of an impending coup, for example, or a successful post-election transfer of power without military intervention—into their voting decisions, and the survey results would update quickly in response to those more recent votes.

Doing that would require modifying the modeling process that converts the pairwise votes into scores, however, and I’m not sure that I’m up to the task. As developed, the wiki survey effectively weights all votes the same, regardless of when they were cast. To make the survey more sensitive to fresher information, we would need to tweak that process so that recent votes are weighted more heavily—maybe with a time-decaying weighting function, or just a sliding window that closes on older votes after some point. If we wanted to get really fancy, we might find a way to use the statistical forecasts as priors in this process, too, letting the time-sensitive survey results pull cases up or push them down as the year passes.

I can imagine these modifications, but I don’t think I can code them. If you’re reading this and you might like to collaborate on that work (or fund it!) or just have thoughts on how to do it, please drop me a line at ulfelder at gmail dot com.

A Failed Coup Attempt (and Forecast) in the Gambia

This is a guest post by Maggie Dwyer (@MagDwyer). She teaches courses related to security and politics in Africa at University of Saint Andrews and University of Edinburgh.

Things are often not the way they seem in the Gambia, and this may help explain why this week’s coup attempt in Banjul was not anticipated by Jay’s statistical forecasting.

Nicknamed “the smiling coast,” the Gambia has long been known for its beach resorts, which are particularly popular with European tourists. The country has never experienced a civil war and is generally considered peaceful. Its president, Yahya Jammeh, came to power in a coup in 1994 and has won four elections since.

A deeper look at the Gambia, however, shows that the appearance of stability comes at a high cost for the population. The repressive style of the Jammeh government has led to a growing list of human rights abuses. Critics of the regime are often met with harassment, arrest, detention, and disappearance. The country is shrouded in secrecy due to a lack of press freedoms. With no presidential term limits and no significant opposition, many see no end in sight for Jammeh’s regime.

The military is often viewed as the strong arm of the Jammeh regime and is responsible for many of the abuses. Yet, the military is also kept on edge. An endless series of promotions, demotions, firings, and re-hirings leave military personnel in a constant state of uncertainty. The sense of fear within the military is exacerbated by severe punishment for those deemed disloyal. Several military members were officially executed for alleged involvement in coup plots in 2013, and there are many more suspected executions, disappearances, and torture of military personnel on which the government has never commented.

There have also been reports of growing military discontent within the military over preference for Jammeh’s minority ethnic group, the Jola. There are claims that the Jola have been given a disproportionate number of promotions, top positions and opportunities (e.g., training and participation in peacekeeping), and that this favoritism has created divisions and spurred resentment in the military.

Despite the fate of past coup plotters in the Gambia, military personnel have continued to try to oust Jammeh. He has endured at least eight alleged coup attempts during his 20 years in office. Many of the accused plotters had served at the highest military positions, including Army Chief of Staff and Director of the National Intelligence Agency, suggesting divisions at the most senior levels. It should be noted that there is speculation as to whether some of the attempts were real or simply ways to purge members of the military. The ambiguity of these events is another cause of uncertainty and fear within the military.

These tensions, divisions, and dissatisfaction within the Gambian military probably contributed to the most recent and past coup attempts against Jammeh. Unfortunately, internal military tensions are difficult to observe and quantify, especially in repressive states like the Gambia. Because these tensions are hard to quantify, they rarely factor into larger statistical forecasts, even though we have good reason to believe they contribute significantly to coup risks.

The recent coup attempt in the Gambia will switch the “domestic coup activity” variable used in Jay’s models from ‘no’ to ‘yes’ and will thereby increase its ranking in the next iteration of those assessments. The climate of fear in the Gambia will also intensify following the crackdown from this week’s coup attempt, however, and may deter copycats in the near future.

Post Mortem on 2014 Preseason NFL Forecasts

Let’s end the year with a whimper, shall we?

Back in September (here), I used a wiki survey to generate a preseason measure of pro-football team strength and then ran that measure through a statistical model and some simulations to gin up forecasts for all 256 games of the 2014 regular season. That season ended on Sunday, so now we can see how those forecasts turned out.

The short answer: not awful, but not so great, either.

To assess the data and model’s predictive power, I’m going to focus on predicted win totals. Based on my game-level forecasts, how many contests was each team expected to win? Those totals nicely summarize the game-level predictions, and they are the focus of StatsbyLopez’s excellent post-season predictive review, here, against which I can compare my results.

StatsbyLopez used two statistics to assess predictive accuracy: mean absolute error (MAE) and mean squared error (MSE). The first is the average of the distance between each team’s projected and observed win totals. The second is the average of the square of those distances. MAE is a little easier to interpret—on average, how far off was each team’s projected win total?—while MSE punishes larger errors more than the first, which is nice if you care about how noisy your predictions are. StatsbyLopez used those stats to compare five sets of statistical predictions to the preseason betting line (Vegas) and a couple of simple benchmarks: last year’s win totals and a naive forecast of eight wins for everyone, which is what you’d expect to get if you just flipped a coin to pick winners.

Lopez’s post includes some nice column charts comparing those stats across sources, but it doesn’t include the stats themselves, so I’m going to have to eyeball his numbers and do the comparison in prose.

I summarized my forecasts two ways: 1) counts of the games each team had a better-than-even chance of winning, and 2) sums of each team’s predicted probabilities of winning.

  • The MAE for my whole-game counts was 2.48—only a little bit better than the ultra-naive eight-wins-for-everyone prediction and worse than everything else, including just using last year’s win totals. The MSE for those counts was 8.89, still worse than everything except the simple eights. For comparison, the MAE and MSE for the Vegas predictions were roughly 2.0 and 6.0, respectively.
  • The MAE for my sums was 2.31—about as good as the worst of the five “statsheads” Lopez considered, but still a shade worse than just carrying forward the 2013 win totals. The MSE for those summed win probabilities, however, was 7.05. That’s better than one of the sources Lopez considered and pretty close to two others, and it handily beats the two naive benchmarks.

To get a better sense of how large the errors in my forecasts were and how they were distributed, I also plotted the predicted and observed win totals by team. In the charts below, the black dots are the predictions, and the red dots are the observed results. The first plot uses the whole-game counts; the second the summed win probabilities. Teams are ordered from left to right according to their rank in the preseason wiki survey.

Predicted (black) and observed (red) 2014 regular-season win totals by team using whole-game counts

Predicted (black) and observed (red) 2014 regular-season win totals by team using whole-game counts

Predicted (black) and observed (red) 2014 regular-season win totals by team using summed win probabilities

Predicted (black) and observed (red) 2014 regular-season win totals by team using summed win probabilities

Substantively, those charts spotlight some things most football fans could already tell you: Dallas and Arizona were the biggest positive surprises of the 2014 regular season, while San Francisco, New Orleans, and Chicago were probably the biggest disappointments.  Detroit and Buffalo also exceeded expectations, although only one of them made it to the postseason, while Tampa Bay, Tennessee, the NY Giants, and the Washington football team also under-performed.

Statistically, it’s interesting but not surprising that the summed win probabilities do markedly better than the whole-game counts. Pro football is a noisy game, and we throw out a lot of information about the uncertainty of each contest’s outcome when we convert those probabilities into binary win/lose calls. In essence, those binary calls are inherently overconfident, so the win counts they produce are, predictably, much noisier than the ones we get by summing the underlying probabilities.

In spite of its modest performance in 2014, I plan to repeat this exercise next year. The linear regression model I use to convert the survey results into game-level forecasts has home-field advantage and the survey scores as its only inputs. The 2014 version of that model was estimated from just a single prior season’s data, so doubling the size of the historical sample to 512 games will probably help a little. Like all survey results, my team-strength score depends on the pool of respondents, and I keep hoping to get a bigger and better-informed crowd to participate in that part of the exercise. And, most important, it’s fun!

Follow

Get every new post delivered to your Inbox.

Join 10,339 other followers

%d bloggers like this: