How’d Those Football Forecasts Turn Out?

Yes, it’s February, and yes, the Winter Olympics are on, but it’s a cold Sunday so I’ve got football on the brain. Here’s where that led today:

Last August, I used a crowdsourcing technique called a wiki survey to generate a set of preseason predictions on who would win Super Bowl 48 (see here). I did this fun project to get a better feel for how wiki surveys work so I could start applying them to more serious things, but I’m also a pro football fan who wanted to know what the season portended.

Now that Super Bowl 48’s in the books, I thought I would see how those forecasts fared. One way to do that is to take the question and results at face value and see if the crowd picked the right winner. The short answer is “no,” but it didn’t miss by a lot. The dot plot below shows teams in descending order by their final score on the preseason survey. My crowd picked New England to win, but Seattle was second by just a whisker, and the four teams that made the conference championship games occupied the top four slots.

nflpostmortem.dotplot So the survey did great, right? Well, maybe not if you look a little further down the list. The Atlanta Falcons, who finished the season 4-12, ranked fifth in the wiki survey, and the Houston Texans—widely regarded as the worst team in the league this year—also landed in the top 10. Meanwhile, the 12-4 Carolina Panthers and 11-5 KC Chiefs got stuck in the basement. Poke around a bit more, and I’m sure you can find a few other chuckles.

Still, the results didn’t look crazy, and I was intrigued enough to want to push it further. To get a fuller picture of how well this survey worked as a forecasting tool, I decided to treat the results as power rankings and compare them across the board to postseason rankings. In other words, instead of treating this as a classification problem (find the Super Bowl winner), I thought I’d treat it as a calibration problem, where the latent variable I was trying to observe before and after is relative team strength.

That turned out to be surprisingly difficult—not because it’s hard to compare preseason and postseason scores, but because it’s hard to measure team strength, even after the season’s over. I asked Trey Causey and Sean J. Taylor, a couple of professional acquaintances who know football analytics, to point me toward an off-the-shelf “ground truth,” and neither one could. Lots of people publish ordered lists, but those lists don’t give us any information about the distance between rungs on the ladder, a critical piece of any calibration question. (Sean later produced and emailed me a set of postseason Bradley-Terry rankings that look excellent, but I’m going to leave the presentation of that work to him.)

About ready to give up on the task, it occurred to me that I could use the same instrument, a wiki survey, to convert those ordered lists into a set of scores that would meet my criteria. Instead of pinging the crowd, I would put myself in the shoes of those lists’ authors for a while, using their rankings to guide my answers to the pairwise comparisons the wiki survey requires. Basically, I would kluge my way to a set of rankings that amalgamated the postseason judgments of several supposed experts. The results would have the added advantage of being on the same scale as my preseason assessments, so the two series could be directly compared.

To get started, I Googled “nfl postseason power rankings” and found four lists that showed up high in the search results and had been updated since the Super Bowl (here, here, here, and here). Then I set up a wiki survey and started voting as List Author #1. My initial thought was to give each list 100 votes, but when I got to 100, the results of the survey in progress didn’t look as much like the original list as I’d expected. Things were a little better at 200 but still not terrific. In the end, I decided to give each survey 320 votes, or the equivalent of 10 votes for each item (team) on the list. When I got to 320 with List 1, the survey results were nearly identical to the original, so I declared victory and stuck with that strategy. That meant 1,280 votes in all, with equal weight for each of the four list-makers.

The plot below compares my preseason wiki survey’s ratings with the results of this Mechanical Turk-style amalgamation of postseason rankings. Teams in blue scored higher than the preseason survey anticipated (i.e., over-performed), while teams in red scored lower (i.e., under-performed).

nflpostmortemplot

Looking at the data this way, it’s even clearer that the preseason survey did well at the extremes and less well in the messy middle. The only stinkers the survey badly overlooked were Houston and Atlanta, and I think it’s fair to say that a lot of people were surprised by how dismal their seasons were. Ditto the Washington [bleep]s and Minnesota Vikings, albeit to a lesser extent. On the flip side, Carolina stands out as a big miss, and KC, Philly, Arizona, and the Colts can also thumb their noses at me and my crowd. Statistically minded readers might want to know that the root mean squared error (RMSE) here is about 27, where the observations are on a 0-100 scale. That 27 is better than random guessing, but it’s certainly not stellar.

A single season doesn’t offer a robust test of a forecasting technique. Still, as a proof of concept, I think this exercise was a success. My survey only drew about 1,800 votes from a few hundred respondents whom I recruited casually through my blog and Twitter feed, which focuses on international affairs and features very little sports talk. When that crowd was voting, the only information they really had was the previous season’s performance and whatever they knew about off-season injuries and personnel changes. Under the circumstances, I’d say a RMSE of 27 ain’t terrible.

It’d be fun to try this again in August 2014 with a bigger crowd and see how that turns out. Before and during the season, it would also be neat to routinely rerun that Mechanical Turk exercise to produce up-to-date “wisdom of the (expert) crowd” power rankings and see if they can help improve predictions about the coming week’s games. Better yet, we could write some code to automate the ingestion of those lists, simulate their pairwise voting, and apply All Our Ideas‘ hierarchical model to the output. In theory, this approach could scale to incorporate as many published lists as we can find, culling the purported wisdom of our hand-selected crowd without the hassle of all that recruiting and voting.

Unfortunately, that crystal palace was a bit too much for me to build on this dim and chilly Sunday. And now, back to our regularly scheduled programming…

PS If you’d like to tinker with the data, you can find it here.