Let’s end the year with a whimper, shall we?
Back in September (here), I used a wiki survey to generate a preseason measure of pro-football team strength and then ran that measure through a statistical model and some simulations to gin up forecasts for all 256 games of the 2014 regular season. That season ended on Sunday, so now we can see how those forecasts turned out.
The short answer: not awful, but not so great, either.
To assess the data and model’s predictive power, I’m going to focus on predicted win totals. Based on my game-level forecasts, how many contests was each team expected to win? Those totals nicely summarize the game-level predictions, and they are the focus of StatsbyLopez’s excellent post-season predictive review, here, against which I can compare my results.
StatsbyLopez used two statistics to assess predictive accuracy: mean absolute error (MAE) and mean squared error (MSE). The first is the average of the distance between each team’s projected and observed win totals. The second is the average of the square of those distances. MAE is a little easier to interpret—on average, how far off was each team’s projected win total?—while MSE punishes larger errors more than the first, which is nice if you care about how noisy your predictions are. StatsbyLopez used those stats to compare five sets of statistical predictions to the preseason betting line (Vegas) and a couple of simple benchmarks: last year’s win totals and a naive forecast of eight wins for everyone, which is what you’d expect to get if you just flipped a coin to pick winners.
Lopez’s post includes some nice column charts comparing those stats across sources, but it doesn’t include the stats themselves, so I’m going to have to eyeball his numbers and do the comparison in prose.
I summarized my forecasts two ways: 1) counts of the games each team had a better-than-even chance of winning, and 2) sums of each team’s predicted probabilities of winning.
- The MAE for my whole-game counts was 2.48—only a little bit better than the ultra-naive eight-wins-for-everyone prediction and worse than everything else, including just using last year’s win totals. The MSE for those counts was 8.89, still worse than everything except the simple eights. For comparison, the MAE and MSE for the Vegas predictions were roughly 2.0 and 6.0, respectively.
- The MAE for my sums was 2.31—about as good as the worst of the five “statsheads” Lopez considered, but still a shade worse than just carrying forward the 2013 win totals. The MSE for those summed win probabilities, however, was 7.05. That’s better than one of the sources Lopez considered and pretty close to two others, and it handily beats the two naive benchmarks.
To get a better sense of how large the errors in my forecasts were and how they were distributed, I also plotted the predicted and observed win totals by team. In the charts below, the black dots are the predictions, and the red dots are the observed results. The first plot uses the whole-game counts; the second the summed win probabilities. Teams are ordered from left to right according to their rank in the preseason wiki survey.
Substantively, those charts spotlight some things most football fans could already tell you: Dallas and Arizona were the biggest positive surprises of the 2014 regular season, while San Francisco, New Orleans, and Chicago were probably the biggest disappointments. Detroit and Buffalo also exceeded expectations, although only one of them made it to the postseason, while Tampa Bay, Tennessee, the NY Giants, and the Washington football team also under-performed.
Statistically, it’s interesting but not surprising that the summed win probabilities do markedly better than the whole-game counts. Pro football is a noisy game, and we throw out a lot of information about the uncertainty of each contest’s outcome when we convert those probabilities into binary win/lose calls. In essence, those binary calls are inherently overconfident, so the win counts they produce are, predictably, much noisier than the ones we get by summing the underlying probabilities.
In spite of its modest performance in 2014, I plan to repeat this exercise next year. The linear regression model I use to convert the survey results into game-level forecasts has home-field advantage and the survey scores as its only inputs. The 2014 version of that model was estimated from just a single prior season’s data, so doubling the size of the historical sample to 512 games will probably help a little. Like all survey results, my team-strength score depends on the pool of respondents, and I keep hoping to get a bigger and better-informed crowd to participate in that part of the exercise. And, most important, it’s fun!