Which NFL Teams Are the Biggest Surprises of 2015 So Far?

We’re now 4.0625 weeks into the NFL’s 2015 regular season. (If you don’t know what the NFL is, you should probably stop reading now.) That’s about one-quarter of the whole 256-game shebang, enough to start taking stock of preseason predictions. So I got to wondering: Which teams have been the biggest surprises so far?

To get one answer to this question, I downloaded game results from Pro-Football-Reference.com (here) and compared them to the central tendencies of my preseason predictive simulations (here). The mean error of the predictions for each team so far is plotted below. The error in this case is the difference between the number of points by which the team was expected to win or lose each game and the number of points by which it actually won or lost. For example, my simulations had the Colts, on average, winning this week’s Thursday-night game against the Texans by 4, but they actually won by 7. That’s an error of +3 for the Colts and -3 for Houston. The mean error is the average of those errors across all games played so far. So, a positive mean error (blue dots) means the team is over-performing relative to the preseason predictions, while a negative mean error (red dots) means it’s under-performing.


Most of those results won’t surprise regular NFL watchers. The New York Football Jets finished 4–12 last year and ranked near the bottom in my preseason wiki survey, but they’re off to a 3–1 start this year. The Falcons, who went 6–10 in 2014 and garnered a low-middle score in the wiki survey, are undefeated after four weeks. At the other end of the scale, the Dolphins got a high-middle score in the preseason survey, but they have stumbled to a 1–3 start.

It’s also interesting (to me, anyway) to note how the team-specific errors are only loosely correlated with differences between predicted and observed records. For example, the Giants are only 2–2 so far this year, but they show up as one of the biggest over-performers of the first four weeks. That’s partly because both of those two losses were close games that could easily have flipped the other way. The Giants were expected to be on the bad side of mediocre, but they’ve been competitive in every game so far. Ditto for the Ravens, who only show up as mild under-performers but have a 1–3 record (sob). At least three of those four games were expected to be close, and all of them turned on late scores; unfortunately, only one of those four late turns broke in Baltimore’s favor.

This exercise is only interesting if the preseason predictions on which we’re basing the calls about over– or under-performance are sound. So far, they look pretty solid. After four weeks, the root mean squared error for the predicted net scores is 12.8, and the mean squared error is 165. Those look large, but I think they’re in line with other preseason score forecasts. If we convert the predicted net scores to binary predicted outcomes, the model is 40–23 after four weeks, or 41–23 if we include last night’s Colts-Texans game. That’s not exactly clairvoyant, but it beats eight of ESPN’s 13 experts and matches one more, and they make their predictions each week with updated information.

Military Coup in Burkina Faso

Yesterday, Burkina Faso suffered its second military coup in less than a year. Just a few weeks before scheduled national elections, members of the presidential guard (or RSP, per its French initials) arrested the interim president and prime minister and dissolved the government those men led. According to Reuters:

“The patriotic forces, grouped together in the National Council for Democracy, have decided today to put an end to the deviant transitional regime,” the military official said on RTB state television.

“The transition has progressively distanced itself from the objectives of refounding our democracy,” he said, adding that a revision of the electoral law that blocked supporters of Compaore from running in the planned Oct. 11 had “created divisions and frustrations amongst the people.”

My knowledge of politics in Burkina Faso is shallow, but if I had to guess why this coup happened now, this, also from Reuters, is what I would spotlight:

Burkina Faso’s powerful presidential guard should be dismantled, according to a commission charged with proposing reforms…

In a report submitted to Prime Minister Yacouba Isaac Zida, himself a former commander in the RSP, the national reconciliation and reform commission on Monday described the 1,200 troop strong unit as “an army within an army”.

It called for the regiment to be broken up and its members redeployed within the framework of a broader reform of the military.

In a July post, I spotlighted regional experts’ concerns about another coup by Burkina Faso’s presidential guard, observing how those concerns encapsulated the dilemma that confronts civilian politicians who wish to deepen democracy—or, more cynically, their own power—by strengthening their control over the military. Stronger civilian control means fewer military prerogatives, and as a general rule, political actors prefer not to cede power. I wonder if the RSP saw that reform commission’s report as a harbinger of its fate under the next batch of elected civilian leaders and decided to act now, against the shallow-rooted interim government.

In this year’s statistical assessments of coup risk, Burkina Faso ranked fifth in the world, in no small part because of the coup it suffered last year. As I discussed in a blog post a few years ago, when Mali got hit by its second coup in a 10-month span, coup attempts amplify uncertainty in ways that can keep a country on edge for years. Whether or not the latest coup attempt sticks and without touching the forecasting algorithm, I can tell you that Burkina Faso will land near the top of the global list in next year’s statistical assessments of coup risk, too.

Big Data Doesn’t Automatically Produce Better Predictions

At FiveThirtyEight, Neil Payne and Rob Arthur report on an intriguing puzzle:

In an age of unprecedented baseball data, we somehow appear to be getting worse at knowing which teams are — and will be — good.

Player-level predictions are as good if not better than they used to be, but team-level predictions of performance are getting worse. Payne and Arthur aren’t sure why, but they rank a couple of trends in the industry — significant changes in the age structure of the league’s players and, ironically, the increased use of predictive analytics in team management — among the likely culprits.

This story nicely illustrates a fact that breathless discussions of the power of “Big Data” often elide: more and better data don’t automatically lead to more accurate predictions. Observation and prediction are interrelated, but the latter does not move in lock step with the former. At least two things can weaken the link between those two steps in the analytical process.

First, some phenomena are just inherently difficult or impossible to predict with much accuracy. That’s not entirely true of baseball; as Payne and Arthur show, team-level performance predictions have been pretty good in the past. It is true of many other phenomena or systems, however. Take earthquakes; we can now detect and record these events with tremendous precision, but we’re still pretty lousy at anticipating when they’ll occur and how strong they will be. So far, better observation hasn’t led to big gains in prediction.

Second, the systems we’re observing sometimes change, even as we get better at observing them. This is what Payne and Arthur imply is occurring in baseball when they identify trends in the industry as likely explanations for a decline in the predictive power of models derived from historical data. It’s like trying to develop a cure for a disease that’s evolving rapidly as you work on it; the cure you develop in the lab might work great on the last version you captured, but by the time you deploy it, the disease has evolved further, and the treatment doesn’t have much effect.

I wonder if this is also the trajectory social science will follow over the next few decades. Right now, we’re getting hit by the leading edge of what will probably be a large and sustained flood tide of new data on human behavior.  That inflow is producing some rather optimistic statements about how predictable human behavior in general, and sometimes politics in particular, will become as we discover deeper patterns in those data.

I don’t share that confidence. A lot of human behavior is predictably routine, and a torrent of new behavioral data will almost certainly make us even better at predicting these actions and events. For better or for worse, though, those routines are not especially interesting or important to most political scientists. Political scientists are more inclined to focus on “high” politics, which remains fairly opaque, or on system-level outcomes like wars and revolutions that emerge from changes in individual-level behavior in non-obvious ways. I suspect we’ll get a little better at predicting these things as we accumulate richer data on various parts of those systems, but I am pretty sure we won’t ever get great at it. The processes are too complex, and the systems themselves are constantly evolving, maybe even at an accelerating rate.

Interactive 2015 NFL Forecasts

As promised in my last post, I’ve now built and deployed a web app that lets you poke through my preseason forecasts for the 2015 NFL regular season:

2015 NFL Forecasts

I learned several new tricks in the course of generating these forecasts and building this app, so the exercise served its didactic purpose. (You can find the code for the app here, on GitHub.) I also got lucky with the release of a new R package that solved a crucial problem I was having when I started to work on this project a couple of weeks ago. Open source software can be a wonderful thing.

The forecasts posted right now are based on results of the pairwise wiki survey through the morning of Monday, August 17. At that point, the survey had already logged upwards of 12,000 votes, triple the number cast in last year’s edition. This time around, I posted a link to the survey on the r/nfl subreddit, and that post produced a brief torrent of activity from what I hope was a relatively well-informed crowd.

The regular season doesn’t start until September, and I will update these forecasts at least once more before that happens. With so many votes already cast, though, the results will only change significantly if a) a large number of new votes are cast and b) those votes differ substantially from the ones already cast, and those conditions are highly unlikely to intersect.

One thing these forecasts help to illustrate is how noisy a game professional football is. By noisy, I mean hard to predict with precision. Even in games where one team is much stronger than the other, we still see tremendous variance in the simulated net scores and the associated outcomes. Heavy underdogs will win big every once in a while, and games we’d consider close when watching can produce a wide range of net scores.

Take, for example, the week 1 match-up between the Bears and Packers. Even though Chicago’s the home team, the simulation results (below) favor Green Bay by more than eight points. At the same time, those simulations also include a smattering of outcomes in which the Bears win by multiple touchdowns, and the peak of the distribution of simulations is pretty broad and flat. Some of that variance results from the many imperfections of the model and survey scores, but a lot of it is baked into the game, and plots of the predictive simulations nicely illustrate that noisiness.


The big thing that’s still missing from these forecasts is updating during the season. The statistical model that generates the predictive simulations takes just two inputs for each game — the difference between the two teams’ strength scores and the name of the home team — and, barring catastrophe, only one of those inputs can change as the season passes. I could leave the wiki survey running throughout the season, but the model that turns survey votes into scores doesn’t differentiate between recent and older votes, so updating the forecasts with the latest survey scores is unlikely to move the needle by much.*

I’m now hoping to use this problem as an entry point to learning about Bayesian updating and how to program it in R. Instead of updating the actual survey scores, we could treat the preseason scores as priors and then use observed game scores or outcomes to sequentially update estimates of them. I haven’t figured out how to implement this idea yet, but I’m working on it and will report back if I do.

* The pairwise wiki survey runs on open source software, and I can imagine modifying the instrument to give more weight to recent votes than older ones. Right now, I don’t have the programming skills to make those modifications, but I’m still hoping to find someone who might want to work with me, or just take it upon himself or herself, to do this.

Yes, By Golly, I Am Ready for Some Football

The NFL’s 2015 season sort of got underway last night with the Hall of Fame Game. Real preseason play doesn’t start until this weekend, and the first kickoff of the regular season is still a month away.

No matter, though — I’m taking the Hall of Fame Game as my cue to launch this year’s football forecasting effort. As it has for the past two years (see here and here), the process starts with me asking you to help assess the strength of this year’s teams by voting in a pairwise wiki survey:

In the 2015 NFL season, which team will be better?

That survey produces scores on a scale of 0–100. Those scores will become the crucial inputs into simulations based on a simple statistical model estimated from the past two years’ worth of survey data and game results. Using an R function I wrote, I’ve determined that I should be able to improve the accuracy of my forecasts a bit this year by basing them on a mixed-effects model with random intercepts to account for variation in home-team advantages across the league. Having another season’s worth of predicted and actual outcomes should help, too; with two years on the books, my model-training sample has doubled.

An improvement in accuracy would be great, but I’m also excited about using R Studio’s Shiny to build a web page that will let you explore the forecasts at a few levels: by game, by team, and by week. Here’s a screenshot of the game-level tab from a working version using the 2014 data. It plots the distribution of the net scores (home – visitor) from the 1,000 simulations, and it reports win probabilities for both teams and a line (the median of the simulated scores).


The “By team” tab lets you pick a team to see a plot of the forecasts for all 16 of their games, along with their predicted wins (count of games with win probabilities over 0.5) and expected wins (sum of win probabilities for all games) for the year. The “By week” tab (shown below) lets you pick a week to see the forecasts for all the games happening in that slice of the season. Before, I plan to add annotation to the plot, reporting the lines those forecasts imply (e.g., Texans by 7).


Of course, the quality of the forecasts displayed in that app will depend heavily on participation in the wiki survey. Without a diverse and well-informed set of voters, it will be hard to do much better than guessing that each team will do as well this year as it did last year. So, please vote here; please share this post or the survey link with friends and family who know something about pro football; and please check back in a few weeks for the results.

2015 Tour de France Predictions

I like to ride bikes, I like to watch the pros race their bikes, and I make forecasts for a living, so I thought it would be fun to try to predict the outcome of this year’s Tour de France, which starts this Saturday and ends on July 26. I’m also interested in continuing to explore the predictive power of pairwise wiki surveys, a crowdsourcing tool that I’ve previously used to try to forecast mass-killing onsets, coup attempts, and pro football games, and that ESPN recently used to rank NBA draft prospects.

So, a couple of weeks ago, I used All Our Ideas to create a survey that asks, “Which rider is more likely to win the 2015 Tour de France?” I seeded the survey with the names of 11 riders—the 10 seen by bookmakers at Paddy Power as the most likely winners, plus Peter Sagan because he’s fun to watchposted a link to the survey on Tumblr, and trolled for respondents on Twitter and Facebook. The survey got off to a slow start, but then someone posted a link to it in the r/cycling subreddit, and the votes came pouring in. As of this afternoon, the survey had garnered more than 4,000 votes in 181 unique user sessions that came from five continents (see the map below). The crowd also added a handful of other riders to the set under consideration, bringing the list up to 16.


So how does that self-selected crowd handicap the race? The dot plot below shows the riders in descending order by their survey scores, which range from 0 to 100 and indicate the probability that that rider would beat a randomly chosen other rider for a randomly chosen respondent. In contrast to Paddy Power, which currently shows Chris Froome as the clear favorite and gives Nairo Quintana a slight edge over Alberto Contador, this survey sees Contador as the most likely winner (survey score of 90), followed closely by Froome (87) and a little further by Quintana (80). Both sources put Vincenzo Nibali as fourth likeliest (73) and Tejay van Garderen (65) and Thibaut Pinot (51) in the next two spots, although Paddy Power has them in the opposite order. Below that, the distances between riders’ chances get smaller, but the wiki survey’s results still approximate the handicapping of the real-money markets pretty well.


There are at least a couple of ways to try to squeeze some meaning out those scores. One is to read the chart as a predicted finishing order for the 16 riders listed. That’s useful for something like a bike race, where we—well, some of us, anyway—care not only who wins, but also where other will riders finish, too.

We can also try to convert those scores to predicted probabilities of winning. The chart below shows what happens when we do that by dividing each rider’s score by the sum of all scores and then multiplying the result by 100. The probabilities this produces are all pretty low and more tightly bunched than seems reasonable, but I’m not sure how else to do this conversion. I tried squaring and cubing the scores; the results came closer to what the betting-market odds suggest are the “right” values, but I couldn’t think of a principled reason to do that, so I’m not showing those here. If you know a better way to get from those model scores to well-calibrated win probabilities, please let me know in the comments.


So that’s what the survey says. After the Tour concludes in a few weeks, I’ll report back on how the survey’s predictions fared. Meanwhile, here’s wishing the athletes a crash–, injury–, and drug–free tour. Judging by the other big races I’ve seen so far this year, it should be a great one to watch.

A Plea for More Prediction

The second Annual Bank Conference on Africa happened in Berkeley, CA, earlier this week, and the World Bank’s Development Impact blog has an outstanding summary of the 50-odd papers presented there. If you have to pick between reading this post and that one, go there.

One paper on that roster that caught my eye revisits the choice of statistical models for the study of civil wars. As authors John Paul Dunne and Nan Tian describe, the default choice is logistic regression, although probit gets a little playing time, too. They argue, however, that a zero-inflated Poisson (ZIP) model matches the data-generating process better than either of these traditional picks, and they show that this choice affects what we learn about the causes of civil conflict.

Having worked on statistical models of civil conflict for nearly 20 years, I have some opinions on that model-choice issue, but those aren’t what I want to discuss right now. Instead, I want to wonder aloud why more researchers don’t use prediction as the yardstick—or at least one of the yardsticks—for adjudicating these model comparisons.

In their paper, Dunne and Tian stake their claim about the superiority of ZIP to logit and probit on comparisons of Akaike information criteria (AIC) and Vuong tests. Okay, but if their goal is to see if ZIP fits the underlying data-generating process better than those other choices, what better way to find out than by comparing out-of-sample predictive power?

Prediction is fundamental to the accumulation of scientific knowledge. The better we understand why and how something happens, the more accurate our predictions of it should be. When we estimate models from observational data and only look at how well our models fit the data from which they were estimated, we learn some things about the structure of that data set, but we don’t learn how well those things generalize to other relevant data sets. If we believe that the world isn’t deterministic—that the observed data are just one of many possible realizations of the world—then we need to care about that ability to generalize, because that generalization and the discovery of its current limits is the heart of the scientific enterprise.

From a scientific standpoint, the ideal world would be one in which we could estimate models representing rival theories, then compare the accuracy of the predictions they generate across a large number of relevant “trials” as they unfold in real time. That’s difficult for scholars studying big but rare events like civil wars and wars between states; though; a lot of time has to pass before we’ll see enough new examples to make a statistically powerful comparison across models.

But, hey, there’s an app for that—cross-validation! Instead of using all the data in the initial estimation, hold some out to use as a test set for the models we get from the rest. Better yet, split the data into several equally-sized folds and then iterate the training and testing across all possible groupings of them (k-fold cross-validation). Even better, repeat that process a bunch of times and compare distributions of the resulting statistics.

Prediction is the gold standard in most scientific fields, and cross-validation is standard practice in many areas of applied forecasting, because they are more informative than in-sample tests. For some reason, political science still mostly eschews both.* Here’s hoping that changes soon.

* For some recent exceptions to this rule on topics in world politics, see Ward, Greenhill, and Bakke and Blair, Blattman, and Hartman on predicting civil conflict; Chadefaux on warning signs of interstate war; Hill and Jones on state repression; and Chenoweth and me on the onset of nonviolent campaigns.

An Applied Forecaster’s Bad Dream

This is the sort of thing that freaks me out every time I’m getting ready to deliver or post a new set of forecasts:

In its 2015 States of Fragility report, the Organization for Economic Co-operation and Development (OECD) decided to complicate its usual one-dimensional list of fragile states by assessing five dimensions of fragility: Violence, Justice, Institutions, Economic Foundations and Resilience…

Unfortunately, something went wrong during the calculations. In my attempts to replicate the assessment, I found that the OECD misclassified a large number of states.

That’s from a Monkey Cage post by Thomas Leo Scherer, published today. Here, per Scherer, is why those errors matter:

Recent research by Judith Kelley and Beth Simmons shows that international indicators are an influential policy tool. Indicators focus international attention on low performers to positive and negative effect. They cause governments in poorly ranked countries to take action to raise their scores when they realize they are being monitored or as domestic actors mobilize and demand change after learning how they rate versus other countries. Given their potential reach, indicators should be handled with care.

For individuals or organizations involved in scientific or public endeavors, the best way to mitigate that risk is transparency. We can and should argue about concepts, measures, and model choices, but given a particular set of those elements, we should all get essentially the same results. When one or more of those elements is hidden, we can’t fully understand what the reported results represent, and researchers who want to improve the design by critiquing and perhaps extending it are forced to box shadows. Also, individuals and organizations can double– and triple-check their own work, but errors are almost inevitable. When getting the best possible answers matters more than the risk of being seen making mistakes, then transparency is the way to go. This is why the Early Warning Project shares the data and code used to produce its statistical risk assessments in a public repository, and why Reinhart and Rogoff probably (hopefully?) wish they’d done something similar.

Of course, even though transparency improves the probability of catching errors and improving on our designs, it doesn’t automatically produce those goods. What’s more, we can know that we’re doing the right thing and still dread the public discovery of an error. Add to that risk the near-certainty of other researchers scoffing at your terrible code, and it’s easy see why even the best practices won’t keep you from breaking out in a cold sweat each time you hit “Send” or “Publish” on a new piece of work.


The Myth of Comprehensive Data

“What about using Twitter sentiment?”

That suggestion came to me from someone at a recent Data Science DC meetup, after I’d given a short talk on assessing risks of mass atrocities for the Early Warning Project, and as the next speaker started his presentation on predicting social unrest. I had devoted the first half of my presentation to a digression of sorts, talking about how the persistent scarcity of relevant public data still makes it impossible to produce global forecasts of rare political crises—things like coups, insurgencies, regime breakdowns, and mass atrocities—that are as sharp and dynamic as we would like.

The meetup wasn’t the first time I’d heard that suggestion, and I think all of the well-intentioned people who have made it to me have believed that data derived from Twitter would escape or overcome those constraints. In fact, the Twitter stream embodies them. Over the past two decades, technological, economic, and political changes have produced an astonishing surge in the amount of information available from and about the world, but that surge has not occurred evenly around the globe.

Think of the availability of data as plant life in a rugged landscape, where dry peaks are places of data scarcity and fertile valleys represent data-rich environments. The technological developments of the past 20 years are like a weather pattern that keeps dumping more and more rain on that topography. That rain falls unevenly across the landscape, however, and it doesn’t have the same effect everywhere it lands. As a result, plants still struggle to grow on many of those rocky peaks, and much of the new growth occurs where water already collected and flora were already flourishing.

The Twitter stream exemplifies this uneven distribution of data in a couple of important ways. Take a look at the map below, a screenshot I took after letting Tweetping run for about 16 hours spanning May 6–7, 2015. The brighter the glow, the more Twitter activity Tweetping saw.

tweetping 1530 20150506 to 0805 20150507

Some of the spatial variation in that map reflects differences in the distribution of human populations, but not all of it. Here’s a map of population density, produced by Daysleeper using data from CEISIN (source). If you compare this one to the map of Twitter usage, you’ll see that they align pretty well in Europe, the Americas, and some parts of Asia. In Africa and other parts of Asia, though, not so much. If it were just a matter of population density, then India and eastern China should burn brightest, but they—and especially China—are relatively dark compared to “the West.” Meanwhile, in Africa, we see pockets of activity, but there are whole swathes of the continent that are populated as or more densely than the brighter parts of South America, but from which we see virtually no Twitter activity.

world population density map

So why are some pockets of human settlement less visible than others? Two forces stand out: wealth and politics.

First and most obvious, access to Twitter depends on electricity and telecommunications infrastructure and gadgets and literacy and health and time, all of which are much scarcer in poorer parts of the world than they are in richer places. The map below shows lights at night, as seen from space by U.S. satellites 20 years ago and then mapped by NASA (source). These light patterns are sometimes used as a proxy for economic development (e.g., here).


This view of the world helps explain some of the holes in our map of Twitter activity, but not all of it. For example, many of the densely populated parts of Africa don’t light up much at night, just as they don’t on Tweetping, because they lack the relevant infrastructure and power production. Even 20 years ago, though, India and China looked much brighter through this lens than they do on our Twitter usage map.

So what else is going on? The intensity and character of Twitter usage also depends on freedoms of information and speech—the ability and desire to access the platform and to speak openly on it—and this political layer keeps other areas in the dark in that Tweetping map. China, North Korea, Cuba, Ethiopia, Eritrea—if you’re trying to anticipate important political crises, these are all countries you would want to track closely, but Twitter is barely used or unavailable in all of them as a direct or indirect consequence of public policy. And, of course, there are also many places where Twitter is accessible and used but censorship distorts the content of the stream. For example, Saudi Arabia lights up pretty well on the Twitter-usage map, but it’s hard to imagine people speaking freely on it when a tweet can land you in prison.

Clearly, wealth and political constraints still strongly shape the view of the world we can get from new data sources like Twitter. Contrary to the heavily-marketed myth of “comprehensive data,” poverty and repression continue to hide large swathes of the world out of our digital sight, or to distort the glimpses we get of them.

Unfortunate for efforts to forecast rare political crises, those two structural features that so strongly shape the production and quality of data also correlate with the risks we want to anticipate. The map below shows the Early Warning Project‘s most recent statistical assessments of the risk of onsets of state-led mass-killing episodes. Now flash back to the visualization of Twitter usage above, and you’ll see that many of the countries colored most brightly on this map are among the darkest on that one. Even in 2015, the places about which we most need more information to sharpen our forecasts of rare political crises are the ones that are still hardest to see.


Statistically, this is the second-worst of all possible worlds, the worst one being the total absence of information. Data are missing not at random, and the processes producing those gaps are the same ones that put places at greater risk of mass atrocities and other political calamities. This association means that models we estimate with those data will often be misleading. There are ways to mitigate these problems, but they aren’t necessarily simple, cheap, or effective, and that’s before we even start in on the challenges of extracting useful measures from something as heterogeneous and complex as the Twitter stream.

So that’s what I see when I hear people suggest that social media or Google Trends or other forms of “digital exhaust” have mooted the data problems about which I so often complain. Lots of organizations are spending a lot of money trying to overcome these problems, but the political and economic topography producing them does not readily yield. The Internet is part of this complex adaptive system, not a space outside it, and its power to transform that system is neither as strong nor as fast-acting as many of us—especially in the richer and freer parts of the world—presume.

To Realize the QDDR’s Early-Warning Goal, Invest in Data-Making

The U.S. Department of State dropped its second Quadrennial Diplomacy and Development Review, or QDDR, last week (here). Modeled on the Defense Department’s Quadrennial Defense Review, the QDDR lays out the department’s big-picture concerns and objectives so that—in theory—they can guide planning and shape day-to-day decision-making.

The new QDDR establishes four main goals, one of which is to “strengthen our ability to prevent and respond to internal conflict, atrocities, and fragility.” To help do that, the State Department plans to “increase [its] use of early warning analysis to drive early action on fragility and conflict.” Specifically, State says it will:

  1. Improve our use of tools for analyzing, tracking, and forecasting fragility and conflict, leveraging improvements in analytical capabilities;
  2. Provide more timely and accurate assessments to chiefs of mission and senior decision-makers;
  3. Increase use of early warning data and conflict and fragility assessments in our strategic planning and programming;
  4. Ensure that significant early warning shifts trigger senior-level review of the mission’s strategy and, if necessary, adjustments; and
  5. Train and deploy conflict-specific diplomatic expertise to support countries at risk of conflict or atrocities, including conflict negotiation and mediation expertise for use at posts.

Unsurprisingly, that plan sounds great to me. We can’t now and never will be able to predict precisely where and when violent conflict and atrocities will occur, but we can assess risks with enough accuracy and lead time to enable better strategic planning and programming. These forecasts don’t have to be perfect to be earlier, clearer, and more reliable than the traditional practices of deferring to individual country or regional analysts or just reacting to the news.

Of course, quite a bit of well-designed conflict forecasting is already happening, much of it paid for by the U.S. government. To name a few of the relevant efforts: The Political Instability Task Force (PITF) and the Worldwide Integrated Conflict Early Warning System (W-ICEWS) routinely update forecasts of various forms of political crisis for U.S. government customers. IARPA’s Open Source Indicators (OSI) and Aggregative Contingent Estimation (ACE) programs are simultaneously producing forecasts now and discovering ways to make future forecasts even better. Meanwhile, outside the U.S. government, the European Union has recently developed its own Global Conflict Risk Index (GCRI), and the Early Warning Project now assesses risks of mass atrocities in countries worldwide.

That so much thoughtful risk assessment is being done now doesn’t mean it’s a bad idea to start new projects. If there are any iron laws of forecasting hard-to-predict processes like political violence, one of them is that combinations of forecasts from numerous sources should be more accurate than forecasts from a single model or person or framework. Some of the existing projects already do this kind of combining themselves, but combinations of combinations will often be even better.

Still, if I had to channel the intention expressed in this part of the QDDR into a single activity, it would not be the construction of new models, at least not initially. Instead, it would be data-making. Social science is not Newtonian physics, but it’s not astrology, either. Smart people have been studying politics for a long time, and collectively they have developed a fair number of useful ideas about what causes or precedes violent conflict. But, if you can’t track the things those theorists tell you to track, then your forecasts are going to suffer. To improve significantly on the predictive models of political violence we have now, I think we need better inputs most of all.

When I say “better” inputs, I have a few things in mind. In some cases, we need to build data sets from scratch. When I was updating my coup forecasts earlier this year, a number of people wondered why I didn’t include measures of civil-military relations, which are obviously relevant to this particular risk. The answer was simple: because global data on that topic don’t exist. If we aren’t measuring it, we can’t use it in our forecasts, and the list of relevant features that falls into this set is surprisingly long.

In other cases, we need to revive them. Social scientists often build “boutique” data sets for specific research projects, run the tests they want to run on them, and then move on to the next project. Sometimes, the tests they or others run suggest that some features captured in those data sets would make useful predictors. Those discoveries are great in principle, but if those data sets aren’t being updated, then applied forecasters can’t use that knowledge. To get better forecasts, we need to invest in picking up where those boutique data sets left off so we can incorporate their insights into our applications.

Finally and in almost all cases, we need to observe things more frequently. Most of the data available now to most conflict forecasters is only updated once each year, often on a several-month delay and sometimes as much as two years later (e.g., data describing 2014 becomes available in 2016). That schedule is fine for basic research, but it is crummy for applied forecasting. If we want to be able to give assessments and warnings that as current as possible to those “chiefs of mission and senior decision-makers” mentioned in the QDDR, then we need to build models with data that are updated as frequently as possible. Daily or weekly are ideal, but monthly updates would suffice in many cases and would mark a huge improvement over the status quo.

As I said at the start, we’re never going to get models that reliably tell us far in advance exactly where and when violent conflicts and mass atrocities will erupt. I am confident, however, that we can assess these risks even more accurately than we do now, but only if we start making more, and better versions, of the data our theories tell us we need.

I’ll end with a final plea to any public servants who might be reading this: if you do invest in developing better inputs, please make the results freely available to the public. When you share your data, you give the crowd a chance to help you spot and fix your mistakes, to experiment with various techniques, and to think about what else you might consider, all at no additional cost to you. What’s not to like about that?


Get every new post delivered to your Inbox.

Join 13,562 other followers

%d bloggers like this: