Interactive 2015 NFL Forecasts

As promised in my last post, I’ve now built and deployed a web app that lets you poke through my preseason forecasts for the 2015 NFL regular season:

2015 NFL Forecasts

I learned several new tricks in the course of generating these forecasts and building this app, so the exercise served its didactic purpose. (You can find the code for the app here, on GitHub.) I also got lucky with the release of a new R package that solved a crucial problem I was having when I started to work on this project a couple of weeks ago. Open source software can be a wonderful thing.

The forecasts posted right now are based on results of the pairwise wiki survey through the morning of Monday, August 17. At that point, the survey had already logged upwards of 12,000 votes, triple the number cast in last year’s edition. This time around, I posted a link to the survey on the r/nfl subreddit, and that post produced a brief torrent of activity from what I hope was a relatively well-informed crowd.

The regular season doesn’t start until September, and I will update these forecasts at least once more before that happens. With so many votes already cast, though, the results will only change significantly if a) a large number of new votes are cast and b) those votes differ substantially from the ones already cast, and those conditions are highly unlikely to intersect.

One thing these forecasts help to illustrate is how noisy a game professional football is. By noisy, I mean hard to predict with precision. Even in games where one team is much stronger than the other, we still see tremendous variance in the simulated net scores and the associated outcomes. Heavy underdogs will win big every once in a while, and games we’d consider close when watching can produce a wide range of net scores.

Take, for example, the week 1 match-up between the Bears and Packers. Even though Chicago’s the home team, the simulation results (below) favor Green Bay by more than eight points. At the same time, those simulations also include a smattering of outcomes in which the Bears win by multiple touchdowns, and the peak of the distribution of simulations is pretty broad and flat. Some of that variance results from the many imperfections of the model and survey scores, but a lot of it is baked into the game, and plots of the predictive simulations nicely illustrate that noisiness.


The big thing that’s still missing from these forecasts is updating during the season. The statistical model that generates the predictive simulations takes just two inputs for each game — the difference between the two teams’ strength scores and the name of the home team — and, barring catastrophe, only one of those inputs can change as the season passes. I could leave the wiki survey running throughout the season, but the model that turns survey votes into scores doesn’t differentiate between recent and older votes, so updating the forecasts with the latest survey scores is unlikely to move the needle by much.*

I’m now hoping to use this problem as an entry point to learning about Bayesian updating and how to program it in R. Instead of updating the actual survey scores, we could treat the preseason scores as priors and then use observed game scores or outcomes to sequentially update estimates of them. I haven’t figured out how to implement this idea yet, but I’m working on it and will report back if I do.

* The pairwise wiki survey runs on open source software, and I can imagine modifying the instrument to give more weight to recent votes than older ones. Right now, I don’t have the programming skills to make those modifications, but I’m still hoping to find someone who might want to work with me, or just take it upon himself or herself, to do this.


Yes, By Golly, I Am Ready for Some Football

The NFL’s 2015 season sort of got underway last night with the Hall of Fame Game. Real preseason play doesn’t start until this weekend, and the first kickoff of the regular season is still a month away.

No matter, though — I’m taking the Hall of Fame Game as my cue to launch this year’s football forecasting effort. As it has for the past two years (see here and here), the process starts with me asking you to help assess the strength of this year’s teams by voting in a pairwise wiki survey:

In the 2015 NFL season, which team will be better?

That survey produces scores on a scale of 0–100. Those scores will become the crucial inputs into simulations based on a simple statistical model estimated from the past two years’ worth of survey data and game results. Using an R function I wrote, I’ve determined that I should be able to improve the accuracy of my forecasts a bit this year by basing them on a mixed-effects model with random intercepts to account for variation in home-team advantages across the league. Having another season’s worth of predicted and actual outcomes should help, too; with two years on the books, my model-training sample has doubled.

An improvement in accuracy would be great, but I’m also excited about using R Studio’s Shiny to build a web page that will let you explore the forecasts at a few levels: by game, by team, and by week. Here’s a screenshot of the game-level tab from a working version using the 2014 data. It plots the distribution of the net scores (home – visitor) from the 1,000 simulations, and it reports win probabilities for both teams and a line (the median of the simulated scores).

The “By team” tab lets you pick a team to see a plot of the forecasts for all 16 of their games, along with their predicted wins (count of games with win probabilities over 0.5) and expected wins (sum of win probabilities for all games) for the year. The “By week” tab (shown below) lets you pick a week to see the forecasts for all the games happening in that slice of the season. Before, I plan to add annotation to the plot, reporting the lines those forecasts imply (e.g., Texans by 7).

Of course, the quality of the forecasts displayed in that app will depend heavily on participation in the wiki survey. Without a diverse and well-informed set of voters, it will be hard to do much better than guessing that each team will do as well this year as it did last year. So, please vote here; please share this post or the survey link with friends and family who know something about pro football; and please check back in a few weeks for the results.

How Likely Is (Nuclear) War Between the United States and Russia?

Last week, Vox ran a long piece by Max Fisher claiming that “the prospect of a major war, even a nuclear war, in Europe has become thinkable, [experts] warn, even plausible.” Without ever clarifying what “thinkable” or “plausible” mean in this context, Fisher seems to be arguing that, while still unlikely, the probability of a nuclear war between the United States and Russia is no longer small and is rising.

I finished Fisher’s piece and wondered: Is that true? As someone who’s worked on a couple of projects (here and here) that use “wisdom of crowds” methods to make educated guesses about how likely various geopolitical events are, I know that one way to try to answer that question is to ask a bunch of informed people for their best estimates and then average them.

So, on Thursday morning, I went to SurveyMonkey and set up a two-question survey that asks respondents to assess the likelihood of war between the United States and Russia before 2020 and, if war were to happen, the likelihood that one or both sides would use nuclear weapons. To elicit responses, I tweeted the link once and posted it to the Conflict Research Group on Facebook and the IRstudies subreddit. The survey is still running [UPDATE: It’s now closed, because Survey Monkey won’t show me more than the first 100 responses without a paid subscription], but 100 people have taken it so far, and here are the results—first, on the risk of war:


And then on the risk that one or both sides would nuclear weapons, conditional on the occurrence of war:


These results come from a convenience sample, so we shouldn’t put too much stock in them. Still, my confidence in their reliability got a boost when I learned yesterday that a recent survey of international-relations experts around the world asked an almost-identical question about the risk of a war and obtained similar results. In its 2014 survey, the TRIP project asked: “How likely is war between the United States and Russia over the next decade? Please use the 0–10 scale with 10 indicating that war will definitely occur.” They got 2,040 valid responses to that question, and here’s how they were distributed:


Those results are centered a little further to the right than the ones from my survey, but TRIP asked about a longer time period (“next decade” vs. “before 2020”), and those additional five years could explain the difference. It’s also important to note that the scales aren’t directly comparable; where the TRIP survey’s bins implicitly lie on a linear scale, mine were labeled to give respondents more options toward the extremes (e.g., “Certainly not” and “Almost certainly not”).

In light of that corroborating evidence, let’s assume for the moment that the responses to my survey are not junk. So then, how likely is a US/Russia war in the next several years, and how likely is it that such a war would go nuclear if it happened? To get to estimated probabilities of those events, I did two things:

  1. Assuming that the likelihoods implicit my survey’s labels follow a logistic curve, I converted them to predicted probabilities as follows: p(war) = exp(response – 5)/(1 + exp(response – 5)). That rule produces the following sequence for the 0–10 bins: 0.007, 0.018, 0.047, 0.119, 0.269, 0.500, 0.731, 0.881, 0.953, 0.982, 0.993.

  2. I calculated the unweighted average of those predicted probabilities.

Here are the estimates that process produced, rounded up to the nearest whole percentage point:

  • Probability of war: 11%
  • Probability that one or both sides will use nuclear weapons, conditional on war: 18%

To translate those figures into a single number representing the crowd’s estimate of the probability of nuclear war between the US and Russia before 2020, we take their product: 2%.

Is that number different from what Max Fisher had in mind when he wrote that a nuclear war between the US and Russia is now “thinkable,” “plausible,” and “more likely than you think”? I don’t know. To me, “thinkable” and “plausible” seem about as specific as “possible,” a descriptor that applies to almost any geopolitical event you can imagine. I think Max’s chief concern in writing that piece was to draw attention to a risk that he believes to be dangerously under-appreciated, but it would be nice if he had asked his sources to be more specific about just how likely they think this calamity is.

More important, is that estimate “true”? As Ralph Atkins argued in a recent Financial Times piece about estimating the odds of Grexit, it’s impossible to say. For unprecedented and at least partially unique events like these—an exit from the euro zone, or a nuclear war between major powers—we can never know the event-generating process well enough to estimate their probabilities with high confidence. What we get instead are summaries of peoples’ current beliefs about those events’ likelihood. That’s highly imperfect, but it’s still informative in its own way.

2015 Tour de France Predictions

I like to ride bikes, I like to watch the pros race their bikes, and I make forecasts for a living, so I thought it would be fun to try to predict the outcome of this year’s Tour de France, which starts this Saturday and ends on July 26. I’m also interested in continuing to explore the predictive power of pairwise wiki surveys, a crowdsourcing tool that I’ve previously used to try to forecast mass-killing onsets, coup attempts, and pro football games, and that ESPN recently used to rank NBA draft prospects.

So, a couple of weeks ago, I used All Our Ideas to create a survey that asks, “Which rider is more likely to win the 2015 Tour de France?” I seeded the survey with the names of 11 riders—the 10 seen by bookmakers at Paddy Power as the most likely winners, plus Peter Sagan because he’s fun to watchposted a link to the survey on Tumblr, and trolled for respondents on Twitter and Facebook. The survey got off to a slow start, but then someone posted a link to it in the r/cycling subreddit, and the votes came pouring in. As of this afternoon, the survey had garnered more than 4,000 votes in 181 unique user sessions that came from five continents (see the map below). The crowd also added a handful of other riders to the set under consideration, bringing the list up to 16.


So how does that self-selected crowd handicap the race? The dot plot below shows the riders in descending order by their survey scores, which range from 0 to 100 and indicate the probability that that rider would beat a randomly chosen other rider for a randomly chosen respondent. In contrast to Paddy Power, which currently shows Chris Froome as the clear favorite and gives Nairo Quintana a slight edge over Alberto Contador, this survey sees Contador as the most likely winner (survey score of 90), followed closely by Froome (87) and a little further by Quintana (80). Both sources put Vincenzo Nibali as fourth likeliest (73) and Tejay van Garderen (65) and Thibaut Pinot (51) in the next two spots, although Paddy Power has them in the opposite order. Below that, the distances between riders’ chances get smaller, but the wiki survey’s results still approximate the handicapping of the real-money markets pretty well.


There are at least a couple of ways to try to squeeze some meaning out those scores. One is to read the chart as a predicted finishing order for the 16 riders listed. That’s useful for something like a bike race, where we—well, some of us, anyway—care not only who wins, but also where other will riders finish, too.

We can also try to convert those scores to predicted probabilities of winning. The chart below shows what happens when we do that by dividing each rider’s score by the sum of all scores and then multiplying the result by 100. The probabilities this produces are all pretty low and more tightly bunched than seems reasonable, but I’m not sure how else to do this conversion. I tried squaring and cubing the scores; the results came closer to what the betting-market odds suggest are the “right” values, but I couldn’t think of a principled reason to do that, so I’m not showing those here. If you know a better way to get from those model scores to well-calibrated win probabilities, please let me know in the comments.


So that’s what the survey says. After the Tour concludes in a few weeks, I’ll report back on how the survey’s predictions fared. Meanwhile, here’s wishing the athletes a crash–, injury–, and drug–free tour. Judging by the other big races I’ve seen so far this year, it should be a great one to watch.

A Crowd’s-Eye View of Coup Risk in 2015

A couple of weeks ago (here), I used the blog to launch an experiment in crowdsourcing assessments of coup risk for 2015 by way of a pairwise wiki survey. The survey is still open and will stay that way until the end of the year, but with nearly 2,700 pairwise votes already cast, I thought it was good time to take stock of the results so far.

Before discussing those results, though, let me say thank you to all the people who voted in the survey or shared the link. These data don’t materialize from thin air. They only exist because busy people contributed their knowledge and time, and I really appreciate all of those contributions.

Okay, so, what does that self-assembled crowd think about relative risks of coup attempts in 2015? The figure below maps the country scores produced from the votes cast so far. Darker grey indicates higher risk. PLEASE NOTE: Those scores fall on a 0–100 scale, but they are not estimated probabilities of a coup attempt. Instead, they are only measures of relative risk, because that’s all we can get from a pairwise wiki survey. Coup attempts are rare events—in most recent years, we’ve seen fewer than a handful of them worldwide—so the safe bet for nearly every country every year is that there won’t be any coup attempts this year.


Smaller countries can be hard to find on that map, and small differences in scores can be hard to discern, so I also like to have a list of the results to peruse. Here’s a dot plot with countries in descending order by model score. (It’d be nice to make this table sortable so you could also look for countries alphabetically, but my Internet fu is not up to that task.)


This survey is open to the public, and participants may cast as many votes as they like in as many sessions as they like. The scores summarized above come from nearly 2,700 votes cast between the morning of January 3, when I published the blog post about the survey, and the morning of January 14, when I downloaded a report on the current results. At present, this blog has a few thousand followers on Wordpress and a few hundred email subscribers. I also publicized the survey twice on Twitter, where I have approximately 6,000 followers: once when I published the initial blog post, and again on January 13. As the plot below shows, participation spiked around both of those pushes and was low otherwise.


The survey instrument does not collect identifying information about participants, so it is impossible to describe the make-up of the crowd. What we do know is that those votes came from about 100 unique user sessions. Some people probably participated more than once—I know that I cast a dozen or so votes on a few occasions—so 100 unique sessions probably works out to something like 80 or 90 individuals. But that’s a guess.


We also know that those votes came from lots of different parts of the world. As the map below shows, most of the votes came from the U.S., Europe, and Australia, but there were also pockets of activity in the Middle East (especially Israel), Latin America (Brazil and Argentina), Africa (Cote d’Ivoire and Rwanda), and Asia (Thailand and Bangladesh).


I’ll talk a little more about the substance of these results when I publish my statistical assessments of coup risk for 2015, hopefully in the next week or so. Meanwhile, number-crunchers can get a .csv with the data used to generate the map and table in this post from my Google Drive (here) and the R script from GitHub (here). If you’re interested in seeing the raw vote-level data from which those scores were generated, drop me a line.

An Experiment in Crowdsourced Coup Forecasting

Later this month, I hope to have the data I need to generate and post new statistical assessments of coup risk for 2015. Meanwhile, I thought it would be interesting and useful to experiment with applying a crowdsourcing tool to this task. So, if you think you know something about coup risk and want to help with this experiment, please cast as many votes as you like here:

2015 Coup Risk Wiki Survey

For this exercise, let’s use Naunihal Singh’s (2014, p. 51) definition of a coup attempt: “An explicit action, involving some portion of the state military, police, or security forces, undertaken with intent to overthrow the government.” As Naunihal notes,

This definition retains most of the aspects commonly found in definitions of coup attempts [Ed.: including the ones I use in my statistical modeling] while excluding a wide range of similar activities, such as conspiracies, mercenary attacks, popular protests, revolutions, civil wars, actions by lone assassins, and mutinies whose goals explicitly excluded taking power (e.g., over unpaid wages). Unlike a civil war, there is no minimum casualty threshold necessary for an event to be considered a coup, and many coups take place bloodlessly.

By this definition, last week’s putsch in the Gambia and November’s power grab by a lieutenant colonel in Burkina Faso would qualify, but last February’s change of government by parliamentary action in Ukraine after President Yanukovich’s flight in the face of popular unrest would not. Nor would state collapses in Libya and Central African Republic, which occurred under pressure from rebels rather than state security forces. And, of course, Gen. Sisi’s seizure of power in Egypt in July 2013 clearly would qualify as a successful coup on these terms.

In a guest post here yesterday, Maggie Dwyer identified one factor—divisions and tensions within the military—that probably increases coup risk in some cases, but that we can’t fold into global statistical modeling because, as often happens, we don’t have the time-series cross-sectional data we would need to do that. Surely there are other such factors and forces. My hope is that this crowdsourcing approach will help spotlight some cases overlooked by the statistical forecasts because their fragility is being driven by things those models can’t consider.

Wiki surveys weren’t designed specifically for forecasting, but I have adapted them to this purpose on two other topics, and in both cases the results have been pretty good. As part of my work for the Early Warning Project, we have run wiki surveys on risks of state-led mass killing onset for 2014 and now 2015. That project’s data-makers didn’t see any such onsets in 2014, but the two countries that came closest—Iraq and Myanmar—ranked fifth and twelfth, respectively, in the wiki survey we ran in December 2013. On pro football, I’ve run surveys ahead of the 2013 and 2014 seasons. The results haven’t been clairvoyant, but they haven’t been shabby, either (see here and here for details).

I will summarize the results of this survey on coup risk in a blog post in mid-January and will make the country– and vote-level data freely available to other researchers when I do.

I don’t necessarily plan to close the survey at that point, though. In fact, I’m really hoping to get a chance to tinker with using it more dynamically. Ideally, we would leave the survey running throughout the year so that participants could factor new information—credible rumors of an impending coup, for example, or a successful post-election transfer of power without military intervention—into their voting decisions, and the survey results would update quickly in response to those more recent votes.

Doing that would require modifying the modeling process that converts the pairwise votes into scores, however, and I’m not sure that I’m up to the task. As developed, the wiki survey effectively weights all votes the same, regardless of when they were cast. To make the survey more sensitive to fresher information, we would need to tweak that process so that recent votes are weighted more heavily—maybe with a time-decaying weighting function, or just a sliding window that closes on older votes after some point. If we wanted to get really fancy, we might find a way to use the statistical forecasts as priors in this process, too, letting the time-sensitive survey results pull cases up or push them down as the year passes.

I can imagine these modifications, but I don’t think I can code them. If you’re reading this and you might like to collaborate on that work (or fund it!) or just have thoughts on how to do it, please drop me a line at ulfelder at gmail dot com.

Post Mortem on 2014 Preseason NFL Forecasts

Let’s end the year with a whimper, shall we?

Back in September (here), I used a wiki survey to generate a preseason measure of pro-football team strength and then ran that measure through a statistical model and some simulations to gin up forecasts for all 256 games of the 2014 regular season. That season ended on Sunday, so now we can see how those forecasts turned out.

The short answer: not awful, but not so great, either.

To assess the data and model’s predictive power, I’m going to focus on predicted win totals. Based on my game-level forecasts, how many contests was each team expected to win? Those totals nicely summarize the game-level predictions, and they are the focus of StatsbyLopez’s excellent post-season predictive review, here, against which I can compare my results.

StatsbyLopez used two statistics to assess predictive accuracy: mean absolute error (MAE) and mean squared error (MSE). The first is the average of the distance between each team’s projected and observed win totals. The second is the average of the square of those distances. MAE is a little easier to interpret—on average, how far off was each team’s projected win total?—while MSE punishes larger errors more than the first, which is nice if you care about how noisy your predictions are. StatsbyLopez used those stats to compare five sets of statistical predictions to the preseason betting line (Vegas) and a couple of simple benchmarks: last year’s win totals and a naive forecast of eight wins for everyone, which is what you’d expect to get if you just flipped a coin to pick winners.

Lopez’s post includes some nice column charts comparing those stats across sources, but it doesn’t include the stats themselves, so I’m going to have to eyeball his numbers and do the comparison in prose.

I summarized my forecasts two ways: 1) counts of the games each team had a better-than-even chance of winning, and 2) sums of each team’s predicted probabilities of winning.

  • The MAE for my whole-game counts was 2.48—only a little bit better than the ultra-naive eight-wins-for-everyone prediction and worse than everything else, including just using last year’s win totals. The MSE for those counts was 8.89, still worse than everything except the simple eights. For comparison, the MAE and MSE for the Vegas predictions were roughly 2.0 and 6.0, respectively.
  • The MAE for my sums was 2.31—about as good as the worst of the five “statsheads” Lopez considered, but still a shade worse than just carrying forward the 2013 win totals. The MSE for those summed win probabilities, however, was 7.05. That’s better than one of the sources Lopez considered and pretty close to two others, and it handily beats the two naive benchmarks.

To get a better sense of how large the errors in my forecasts were and how they were distributed, I also plotted the predicted and observed win totals by team. In the charts below, the black dots are the predictions, and the red dots are the observed results. The first plot uses the whole-game counts; the second the summed win probabilities. Teams are ordered from left to right according to their rank in the preseason wiki survey.

Predicted (black) and observed (red) 2014 regular-season win totals by team using whole-game counts

Predicted (black) and observed (red) 2014 regular-season win totals by team using whole-game counts

Predicted (black) and observed (red) 2014 regular-season win totals by team using summed win probabilities

Predicted (black) and observed (red) 2014 regular-season win totals by team using summed win probabilities

Substantively, those charts spotlight some things most football fans could already tell you: Dallas and Arizona were the biggest positive surprises of the 2014 regular season, while San Francisco, New Orleans, and Chicago were probably the biggest disappointments.  Detroit and Buffalo also exceeded expectations, although only one of them made it to the postseason, while Tampa Bay, Tennessee, the NY Giants, and the Washington football team also under-performed.

Statistically, it’s interesting but not surprising that the summed win probabilities do markedly better than the whole-game counts. Pro football is a noisy game, and we throw out a lot of information about the uncertainty of each contest’s outcome when we convert those probabilities into binary win/lose calls. In essence, those binary calls are inherently overconfident, so the win counts they produce are, predictably, much noisier than the ones we get by summing the underlying probabilities.

In spite of its modest performance in 2014, I plan to repeat this exercise next year. The linear regression model I use to convert the survey results into game-level forecasts has home-field advantage and the survey scores as its only inputs. The 2014 version of that model was estimated from just a single prior season’s data, so doubling the size of the historical sample to 512 games will probably help a little. Like all survey results, my team-strength score depends on the pool of respondents, and I keep hoping to get a bigger and better-informed crowd to participate in that part of the exercise. And, most important, it’s fun!

Turning Crowdsourced Preseason NFL Strength Ratings into Game-Level Forecasts

For the past week, nearly all of my mental energy has gone into the Early Warning Project and a paper for the upcoming APSA Annual Meeting here in Washington, DC. Over the weekend, though, I found some time for a toy project on forecasting pro-football games. Here are the results.

The starting point for this toy project is a pairwise wiki survey that turns a crowd’s beliefs about relative team strength into scalar ratings. Regular readers will recall that I first experimented with one of these before the 2013-2014 NFL season, and the predictive power wasn’t terrible, especially considering that the number of participants was small and the ratings were completed before the season started.

This year, to try to boost participation and attract a more knowledgeable crowd of respondents, I paired with Trey Causey to announce the survey on his pro-football analytics blog, The Spread. The response has been solid so far. Since the survey went up, the crowd—that’s you!—has cast nearly 3,400 votes in more than 100 unique user sessions (see the Data Visualizations section here).

The survey will stay open throughout the season, but that doesn’t mean it’s too early to start seeing what it’s telling us. One thing I’ve already noticed is that the crowd does seem to be updating in response to preseason action. For example, before the first round of games, I noticed that the Baltimore Ravens, my family’s favorites, were running mid-pack with a rating of about 50. After they trounced the defending NFC champion 49ers in their preseason opener, however, the Ravens jumped to the upper third with a rating of 59. (You can always see up-to-the-moment survey results here, and you can cast your own votes here.)

The wiki survey is a neat way to measure team strength. On their own, though, those ratings don’t tell us what we really want to know, which is how each game is likely to turn out, or how well our team might be expected to do this season. The relationship between relative strength and game outcomes should be pretty strong, but we might want to consider other factors, too, like home-field advantage. To turn a strength rating into a season-level forecast for a single team, we need to consider the specifics of its schedule. In game play, it’s relative strength that matters, and some teams will have tougher schedules than others.

A statistical model is the best way I can think to turn ratings into game forecasts. To get a model to apply to this season’s ratings, I estimated a simple linear one from last year’s preseason ratings and the results of all 256 regular-season games (found online in .csv format here). The model estimates net score (home minus visitor) from just one feature, the difference between the two teams’ preseason ratings (again, home minus visitor). Because the net scores are all ordered the same way and the model also includes an intercept, though, it implicitly accounts for home-field advantage as well.

The scatterplot below shows the raw data on those two dimensions from the 2013 season. The model estimated from these data has an intercept of 3.1 and a slope of 0.1 for the score differential. In other words, the model identifies a net home-field advantage of 3 points—consistent with the conventional wisdom—and it suggests that every point of advantage on the wiki-survey ratings translates into a net swing of one-tenth of a point on the field. I also tried a generalized additive model with smoothing splines to see if the association between the survey-score differential and net game score was nonlinear, but as the scatterplot suggests, it doesn’t seem to be.

2013 NFL Games Arranged by Net Game Score and Preseason Wiki Survey Rating Differentials

2013 NFL Games Arranged by Net Game Score and Preseason Wiki Survey Rating Differentials

In sample, the linear model’s accuracy was good, not great. If we convert the net scores the model postdicts to binary outcomes and compare those postdictions to actual outcomes, we see that the model correctly classifies 60 percent of the games. That’s in sample, but it’s also based on nothing more than home-field advantage and a single preseason rating for each team from a survey with a small number of respondents. So, all things considered, it looks like a potentially useful starting point.

Whatever its limitations, that model gives us the tool we need to convert 2014 wiki survey results into game-level predictions. To do that, we also need a complete 2014 schedule. I couldn’t find one in .csv format, but I found something close (here) that I saved as text, manually cleaned in a minute or so (deleted extra header rows, fixed remaining header), and then loaded and merged with a .csv of the latest survey scores downloaded from the manager’s view of the survey page on All Our Ideas.

I’m not going to post forecasts for all 256 games—at least not now, with three more preseason games to learn from and, hopefully, lots of votes yet to be cast. To give you a feel for how the model is working, though, I’ll show a couple of cuts on those very preliminary results.

The first is a set of forecasts for all Week 1 games. The labels show Visitor-Home, and the net score is ordered the same way. So, a predicted net score greater than 0 means the home team (second in the paired label) is expected to win, while a predicted net score below 0 means the visitor (first in the paired label) is expected to win. The lines around the point predictions represent 90-percent confidence intervals, giving us a partial sense of the uncertainty around these estimates.

Week 1 Game Forecasts from Preseason Wiki Survey Results on 10 August 2014

Week 1 Game Forecasts from Preseason Wiki Survey Results on 10 August 2014

Of course, as a fan of particular team, I’m most interested in what the model says about how my guys are going to do this season. The next plot shows predictions for all 16 of Baltimore’s games. Unfortunately, the plotting command orders the data by label, and my R skills and available time aren’t sufficient to reorder them by week, but the information is all there. In this plot, the dots for the point predictions are colored red if they predict a Baltimore win and black for an expected loss. The good news for Ravens fans is that this plot suggests an 11-5 season, good enough for a playoff berth. The bad news is that an 8-8 season also lies within the 90-percent confidence intervals, so the playoffs don’t look like a lock.

2014 Game-Level Forecasts for the Baltimore Ravens from 10 August 2014 Wiki Survey Scores

2014 Game-Level Forecasts for the Baltimore Ravens from 10 August 2014 Wiki Survey Scores

So that’s where the toy project stands now. My intuition tells me that the predicted net scores aren’t as well calibrated as I’d like, and the estimated confidence intervals surely understate the true uncertainty around each game (“On any given Sunday…”). Still, I think this exercise demonstrates the potential of this forecasting process. If I were a betting man, I wouldn’t lay money on these estimates. As an applied forecaster, though, I can imagine using these predictions as priors in a more elaborate process that incorporates additional and, ideally, more dynamic information about each team and game situation over the course of the season. Maybe my doppelganger can take that up while I get back to my day job…

Postscript. After I published this post, Jeff Fogle suggested via Twitter that I compare the Week 1 forecasts to the current betting lines for those games. The plot below shows the median point spread from an NFL odds-aggregating site as blue dots on top of the statistical forecasts already shown above. As you can see, the statistical forecasts are tracking the betting lines pretty closely. There’s only one game—Carolina at Tampa Bay—where the predictions from the two series fall on different sides of the win/loss line, and it’s a game the statistical model essentially sees as a toss-up. It’s also reassuring that there isn’t a consistent direction to the differences, so the statistical process doesn’t seem to be biased in some fundamental way.

Week 1 Game-Level Forecasts Compared to Median Point Spread from Betting Sites on 11 August 2014

Week 1 Game-Level Forecasts Compared to Median Point Spread from Betting Sites on 11 August 2014

Forecasting Round-Up No. 7

1. I got excited when I heard on Twitter yesterday about a machine-learning process that turns out to be very good at predicting U.S. Supreme Court decisions (blog post here, paper here). I got even more excited when I saw that the guys who built that process have also been running a play-money prediction market on the same problem for the past several years, and that the most accurate forecasters in that market have done even better than that model (here). It sounds like they are now thinking about more rigorous ways to compare and cross-pollinate the two. That’s part of what we’re trying to do with the Early Warning Project, so I hope that they do and we can learn from their findings.

2. A paper in the current issue of the Journal of Personality and Social Psychology (here, but paywalled; hat-tip to James Igoe Walsh) adds to the growing pile of evidence on the forecasting power of crowds, with an interesting additional finding on the willingness of others to trust and use those forecasts:

We introduce the select-crowd strategy, which ranks judges based on a cue to ability (e.g., the accuracy of several recent judgments) and averages the opinions of the top judges, such as the top 5. Through both simulation and an analysis of 90 archival data sets, we show that select crowds of 5 knowledgeable judges yield very accurate judgments across a wide range of possible settings—the strategy is both accurate and robust. Following this, we examine how people prefer to use information from a crowd. Previous research suggests that people are distrustful of crowds and of mechanical processes such as averaging. We show in 3 experiments that, as expected, people are drawn to experts and dislike crowd averages—but, critically, they view the select-crowd strategy favorably and are willing to use it. The select-crowd strategy is thus accurate, robust, and appealing as a mechanism for helping individuals tap collective wisdom.

3. Adam Elkus recently spotlighted two interesting papers involving agent-based modeling (ABM) and forecasting.

  • The first (here) “presents a set of guidelines, imported from the field of forecasting, that can help social simulation and, more specifically, agent-based modelling practitioners to improve the predictive performance and the robustness of their models.”
  • The second (here), from 2009 but new to me, describes an experiment in deriving an agent-based model of political conflict from event data. The results were pretty good; a model built from event data and then tweaked by a subject-matter expert was as accurate as one built entirely by hand, and the hybrid model took much less time to construct.

4. Nautilus ran a great piece on Lewis Fry Richardson, a pioneer in weather forecasting who also applied his considerable intellect to predicting violent conflict. As the story notes,

At the turn of the last century, the notion that the laws of physics could be used to predict weather was a tantalizing new idea. The general idea—model the current state of the weather, then apply the laws of physics to calculate its future state—had been described by the pioneering Norwegian meteorologist Vilhelm Bjerknes. In principle, Bjerkens held, good data could be plugged into equations that described changes in air pressure, temperature, density, humidity, and wind velocity. In practice, however, the turbulence of the atmosphere made the relationships among these variables so shifty and complicated that the relevant equations could not be solved. The mathematics required to produce even an initial description of the atmosphere over a region (what Bjerknes called the “diagnostic” step) were massively difficult.

Richardson helped solve that problem in weather forecasting by breaking the task into many more manageable parts—atmospheric cells, in this case—and thinking carefully about how those parts fit together. I wonder if we will see similar advances in forecasts of social behavior in the next 100 years. I doubt it, but the trajectory of weather prediction over the past century should remind us to remain open to the possibility.

5. Last, a bit of fun: Please help Trey Causey and me forecast the relative strength of this year’s NFL teams by voting in this pairwise wiki survey! I did this exercise last year, and the results weren’t bad, even though the crowd was pretty small and probably not especially expert. Let’s see what happens if more people participate, shall we?

Relative Risks of State-Led Mass Killing Onset in 2014: Results from a Wiki Survey

In early December, as part of our ongoing work for the Holocaust Museum’s Center for the Prevention of Genocide, Ben Valentino and I launched a wiki survey to help assess risks of state-led mass killing onsets in 2014 (here).

The survey is now closed and the results are in. Here, according to our self-selected crowd on five continents and the nearly 5,000 pairwise votes it cast, is a map of how the world looks right now on this score. The darker the shade of gray, the greater the relative risk that in 2014 we will see the start of an episode of mass killing in which the deliberate actions of state agents or other groups acting at their behest result in the deaths of at least 1,000 noncombatant civilians from a discrete group over a period of a year or less.

Smaller countries are hard to find on that map, and it’s difficult to compare colors across regions, so here is a dot plot of the same data in rank order. Countries with red dots are ones that had ongoing episodes of state-led mass killing at the end of 2013: DRC, Egypt, Myanmar, Nigeria, North Korea, Sudan, and Syria. It’s possible that these countries will experience additional onsets in 2014, but we wonder if some of our respondents didn’t also conflate the risk of a new onset with the presence or intensity of an ongoing one. Also, there’s an ongoing episode in CAR that was arguably state-led for a time in 2013, but the Séléka militias no longer appear to be acting at the behest of the supposed government, so we didn’t color that dot. And, of course, there are at least a few ongoing episodes of mass killing being perpetrated by non-state actors (see this recent post for some ideas), but that’s not what we asked our crowd to consider in this survey.


It is very important to understand that the scores being mapped and plotted here are not probabilities of mass-killing onset. Instead, they are model-based estimates of the probability that the country in question is at greater risk than any other country chosen at random. In other words, these scores tell us which countries our crowd thinks we should worry about more, not how likely our crowd thinks a mass-killing onset is.

We think the results of this survey are useful in their own right, but we also plan to compare them to, and maybe even combine them with, other forecasts of mass killing onsets as part of the public early-warning system we expect to launch later this year.

In the meantime, if you’re interested in tinkering with the scores and our plots of them, you can find the code I used to make the map and dot plot on GitHub (here) and the data in .csv format on my Google Drive (here). If you have better ideas on how to visualize this information, please let us know and share your code.

UPDATE: Bad social scientist! With a tweet, Alex Hanna reminded me that I really need to say more about the survey method and respondents. So:

We used All Our Ideas to conduct this survey, and we embedded that survey in a blog post that defined our terms and explained the process. The blog post was published on December 1, and we publicized it through a few channels, including: a note to participants in a password-protected opinion pool we’re running to forecast various mass atrocities-related events; a posting to a Conflict Research group on Facebook; an email to the president of the American Association of Genocide Scholars asking him to announce it on their listserv; and a few tweets from my Twitter account at the beginning and end of the month. Some of those tweets were retweeted, and I saw a few other people post or tweet their own links to the blog post or survey as well.

As for Alex’s specific question about who comprised our crowd, the short answer is that we don’t and can’t know. Participation in All Our Ideas surveys is anonymous, and our blog post was not private. From the vote-level data (here), I can see that we ended the month with 4,929 valid votes from 147 unique voting sessions. I know for a fact that some people voted in more than one session—I cast a small number of votes on a few occasions, and I know at least one colleague voted more than once—so the number of people who participated was some unknown number smaller than 147 who found their way to the survey through those postings and tweets.

  • Author

  • Follow me on Twitter

  • Follow Dart-Throwing Chimp on
  • Enter your email address to follow this blog and receive notifications of new posts by email.

    Join 13,651 other followers

  • Archives

%d bloggers like this: