Turning Crowdsourced Preseason NFL Strength Ratings into Game-Level Forecasts

For the past week, nearly all of my mental energy has gone into the Early Warning Project and a paper for the upcoming APSA Annual Meeting here in Washington, DC. Over the weekend, though, I found some time for a toy project on forecasting pro-football games. Here are the results.

The starting point for this toy project is a pairwise wiki survey that turns a crowd’s beliefs about relative team strength into scalar ratings. Regular readers will recall that I first experimented with one of these before the 2013-2014 NFL season, and the predictive power wasn’t terrible, especially considering that the number of participants was small and the ratings were completed before the season started.

This year, to try to boost participation and attract a more knowledgeable crowd of respondents, I paired with Trey Causey to announce the survey on his pro-football analytics blog, The Spread. The response has been solid so far. Since the survey went up, the crowd—that’s you!—has cast nearly 3,400 votes in more than 100 unique user sessions (see the Data Visualizations section here).

The survey will stay open throughout the season, but that doesn’t mean it’s too early to start seeing what it’s telling us. One thing I’ve already noticed is that the crowd does seem to be updating in response to preseason action. For example, before the first round of games, I noticed that the Baltimore Ravens, my family’s favorites, were running mid-pack with a rating of about 50. After they trounced the defending NFC champion 49ers in their preseason opener, however, the Ravens jumped to the upper third with a rating of 59. (You can always see up-to-the-moment survey results here, and you can cast your own votes here.)

The wiki survey is a neat way to measure team strength. On their own, though, those ratings don’t tell us what we really want to know, which is how each game is likely to turn out, or how well our team might be expected to do this season. The relationship between relative strength and game outcomes should be pretty strong, but we might want to consider other factors, too, like home-field advantage. To turn a strength rating into a season-level forecast for a single team, we need to consider the specifics of its schedule. In game play, it’s relative strength that matters, and some teams will have tougher schedules than others.

A statistical model is the best way I can think to turn ratings into game forecasts. To get a model to apply to this season’s ratings, I estimated a simple linear one from last year’s preseason ratings and the results of all 256 regular-season games (found online in .csv format here). The model estimates net score (home minus visitor) from just one feature, the difference between the two teams’ preseason ratings (again, home minus visitor). Because the net scores are all ordered the same way and the model also includes an intercept, though, it implicitly accounts for home-field advantage as well.

The scatterplot below shows the raw data on those two dimensions from the 2013 season. The model estimated from these data has an intercept of 3.1 and a slope of 0.1 for the score differential. In other words, the model identifies a net home-field advantage of 3 points—consistent with the conventional wisdom—and it suggests that every point of advantage on the wiki-survey ratings translates into a net swing of one-tenth of a point on the field. I also tried a generalized additive model with smoothing splines to see if the association between the survey-score differential and net game score was nonlinear, but as the scatterplot suggests, it doesn’t seem to be.

2013 NFL Games Arranged by Net Game Score and Preseason Wiki Survey Rating Differentials

2013 NFL Games Arranged by Net Game Score and Preseason Wiki Survey Rating Differentials

In sample, the linear model’s accuracy was good, not great. If we convert the net scores the model postdicts to binary outcomes and compare those postdictions to actual outcomes, we see that the model correctly classifies 60 percent of the games. That’s in sample, but it’s also based on nothing more than home-field advantage and a single preseason rating for each team from a survey with a small number of respondents. So, all things considered, it looks like a potentially useful starting point.

Whatever its limitations, that model gives us the tool we need to convert 2014 wiki survey results into game-level predictions. To do that, we also need a complete 2014 schedule. I couldn’t find one in .csv format, but I found something close (here) that I saved as text, manually cleaned in a minute or so (deleted extra header rows, fixed remaining header), and then loaded and merged with a .csv of the latest survey scores downloaded from the manager’s view of the survey page on All Our Ideas.

I’m not going to post forecasts for all 256 games—at least not now, with three more preseason games to learn from and, hopefully, lots of votes yet to be cast. To give you a feel for how the model is working, though, I’ll show a couple of cuts on those very preliminary results.

The first is a set of forecasts for all Week 1 games. The labels show Visitor-Home, and the net score is ordered the same way. So, a predicted net score greater than 0 means the home team (second in the paired label) is expected to win, while a predicted net score below 0 means the visitor (first in the paired label) is expected to win. The lines around the point predictions represent 90-percent confidence intervals, giving us a partial sense of the uncertainty around these estimates.

Week 1 Game Forecasts from Preseason Wiki Survey Results on 10 August 2014

Week 1 Game Forecasts from Preseason Wiki Survey Results on 10 August 2014

Of course, as a fan of particular team, I’m most interested in what the model says about how my guys are going to do this season. The next plot shows predictions for all 16 of Baltimore’s games. Unfortunately, the plotting command orders the data by label, and my R skills and available time aren’t sufficient to reorder them by week, but the information is all there. In this plot, the dots for the point predictions are colored red if they predict a Baltimore win and black for an expected loss. The good news for Ravens fans is that this plot suggests an 11-5 season, good enough for a playoff berth. The bad news is that an 8-8 season also lies within the 90-percent confidence intervals, so the playoffs don’t look like a lock.

2014 Game-Level Forecasts for the Baltimore Ravens from 10 August 2014 Wiki Survey Scores

2014 Game-Level Forecasts for the Baltimore Ravens from 10 August 2014 Wiki Survey Scores

So that’s where the toy project stands now. My intuition tells me that the predicted net scores aren’t as well calibrated as I’d like, and the estimated confidence intervals surely understate the true uncertainty around each game (“On any given Sunday…”). Still, I think this exercise demonstrates the potential of this forecasting process. If I were a betting man, I wouldn’t lay money on these estimates. As an applied forecaster, though, I can imagine using these predictions as priors in a more elaborate process that incorporates additional and, ideally, more dynamic information about each team and game situation over the course of the season. Maybe my doppelganger can take that up while I get back to my day job…

Postscript. After I published this post, Jeff Fogle suggested via Twitter that I compare the Week 1 forecasts to the current betting lines for those games. The plot below shows the median point spread from an NFL odds-aggregating site as blue dots on top of the statistical forecasts already shown above. As you can see, the statistical forecasts are tracking the betting lines pretty closely. There’s only one game—Carolina at Tampa Bay—where the predictions from the two series fall on different sides of the win/loss line, and it’s a game the statistical model essentially sees as a toss-up. It’s also reassuring that there isn’t a consistent direction to the differences, so the statistical process doesn’t seem to be biased in some fundamental way.

Week 1 Game-Level Forecasts Compared to Median Point Spread from Betting Sites on 11 August 2014

Week 1 Game-Level Forecasts Compared to Median Point Spread from Betting Sites on 11 August 2014

Advertisements

How’d Those Football Forecasts Turn Out?

Yes, it’s February, and yes, the Winter Olympics are on, but it’s a cold Sunday so I’ve got football on the brain. Here’s where that led today:

Last August, I used a crowdsourcing technique called a wiki survey to generate a set of preseason predictions on who would win Super Bowl 48 (see here). I did this fun project to get a better feel for how wiki surveys work so I could start applying them to more serious things, but I’m also a pro football fan who wanted to know what the season portended.

Now that Super Bowl 48’s in the books, I thought I would see how those forecasts fared. One way to do that is to take the question and results at face value and see if the crowd picked the right winner. The short answer is “no,” but it didn’t miss by a lot. The dot plot below shows teams in descending order by their final score on the preseason survey. My crowd picked New England to win, but Seattle was second by just a whisker, and the four teams that made the conference championship games occupied the top four slots.

nflpostmortem.dotplotSo the survey did great, right? Well, maybe not if you look a little further down the list. The Atlanta Falcons, who finished the season 4-12, ranked fifth in the wiki survey, and the Houston Texans—widely regarded as the worst team in the league this year—also landed in the top 10. Meanwhile, the 12-4 Carolina Panthers and 11-5 KC Chiefs got stuck in the basement. Poke around a bit more, and I’m sure you can find a few other chuckles.

Still, the results didn’t look crazy, and I was intrigued enough to want to push it further. To get a fuller picture of how well this survey worked as a forecasting tool, I decided to treat the results as power rankings and compare them across the board to postseason rankings. In other words, instead of treating this as a classification problem (find the Super Bowl winner), I thought I’d treat it as a calibration problem, where the latent variable I was trying to observe before and after is relative team strength.

That turned out to be surprisingly difficult—not because it’s hard to compare preseason and postseason scores, but because it’s hard to measure team strength, even after the season’s over. I asked Trey Causey and Sean J. Taylor, a couple of professional acquaintances who know football analytics, to point me toward an off-the-shelf “ground truth,” and neither one could. Lots of people publish ordered lists, but those lists don’t give us any information about the distance between rungs on the ladder, a critical piece of any calibration question. (Sean later produced and emailed me a set of postseason Bradley-Terry rankings that look excellent, but I’m going to leave the presentation of that work to him.)

About ready to give up on the task, it occurred to me that I could use the same instrument, a wiki survey, to convert those ordered lists into a set of scores that would meet my criteria. Instead of pinging the crowd, I would put myself in the shoes of those lists’ authors for a while, using their rankings to guide my answers to the pairwise comparisons the wiki survey requires. Basically, I would kluge my way to a set of rankings that amalgamated the postseason judgments of several supposed experts. The results would have the added advantage of being on the same scale as my preseason assessments, so the two series could be directly compared.

To get started, I Googled “nfl postseason power rankings” and found four lists that showed up high in the search results and had been updated since the Super Bowl (here, here, here, and here). Then I set up a wiki survey and started voting as List Author #1. My initial thought was to give each list 100 votes, but when I got to 100, the results of the survey in progress didn’t look as much like the original list as I’d expected. Things were a little better at 200 but still not terrific. In the end, I decided to give each survey 320 votes, or the equivalent of 10 votes for each item (team) on the list. When I got to 320 with List 1, the survey results were nearly identical to the original, so I declared victory and stuck with that strategy. That meant 1,280 votes in all, with equal weight for each of the four list-makers.

The plot below compares my preseason wiki survey’s ratings with the results of this Mechanical Turk-style amalgamation of postseason rankings. Teams in blue scored higher than the preseason survey anticipated (i.e., over-performed), while teams in red scored lower (i.e., under-performed).

nflpostmortemplot

Looking at the data this way, it’s even clearer that the preseason survey did well at the extremes and less well in the messy middle. The only stinkers the survey badly overlooked were Houston and Atlanta, and I think it’s fair to say that a lot of people were surprised by how dismal their seasons were. Ditto the Washington [bleep]s and Minnesota Vikings, albeit to a lesser extent. On the flip side, Carolina stands out as a big miss, and KC, Philly, Arizona, and the Colts can also thumb their noses at me and my crowd. Statistically minded readers might want to know that the root mean squared error (RMSE) here is about 27, where the observations are on a 0-100 scale. That 27 is better than random guessing, but it’s certainly not stellar.

A single season doesn’t offer a robust test of a forecasting technique. Still, as a proof of concept, I think this exercise was a success. My survey only drew about 1,800 votes from a few hundred respondents whom I recruited casually through my blog and Twitter feed, which focuses on international affairs and features very little sports talk. When that crowd was voting, the only information they really had was the previous season’s performance and whatever they knew about off-season injuries and personnel changes. Under the circumstances, I’d say a RMSE of 27 ain’t terrible.

It’d be fun to try this again in August 2014 with a bigger crowd and see how that turns out. Before and during the season, it would also be neat to routinely rerun that Mechanical Turk exercise to produce up-to-date “wisdom of the (expert) crowd” power rankings and see if they can help improve predictions about the coming week’s games. Better yet, we could write some code to automate the ingestion of those lists, simulate their pairwise voting, and apply All Our Ideas‘ hierarchical model to the output. In theory, this approach could scale to incorporate as many published lists as we can find, culling the purported wisdom of our hand-selected crowd without the hassle of all that recruiting and voting.

Unfortunately, that crystal palace was a bit too much for me to build on this dim and chilly Sunday. And now, back to our regularly scheduled programming…

PS If you’d like to tinker with the data, you can find it here.

In Praise of Fun Projects

Over the past year, I’ve watched a few people I know in digital life sink a fair amount of time into statistical modeling projects that other people might see as “just for fun,” if not downright frivolous. Last April, for example, public-health grad student Brett Keller delivered an epic blog post that used event history models to explore why some competitors survive longer than others in the fictional Hunger Games. More recently, sociology Ph.D. student Alex Hanna has been using the same event history techniques to predict who’ll get booted each week from the reality TV show RuPaul’s Drag Race (see here and here so far). And then there’s Against the Spread, a nascent pro-football forecasting project from sociology Ph.D. candidate Trey Causey, whose dissertation uses natural language processing and agent-based modeling to examine information ecology in authoritarian regimes.

I happen to think these kinds of projects are a great idea, if you can find the time to do them–and if you’re reading this blog post, you probably can. Based on personal experience, I’m a big believer in learning by doing. Concepts don’t stick in my brain when I only read about them; I’ve got to see the concepts in action and attach them to familiar contexts and examples to really see what’s going on. Blog posts like Brett’s and Alex’s are a terrific way to teach yourself new methods by applying them to toy problems where the data sets are small, the domain is familiar and interesting, and the costs of being wrong are negligible.

Lego-Raspberry-Pi-case

A bigger project like Trey’s requires you to solve a lot of complex procedural and methodological problems, but all the skills you develop along the way transfer to other domains. If you can build and run a decent forecasting system from scratch for something as complex as pro football, you can do the same for “seriouser” problems, too. I think that demonstrated skill on fun tasks says as much about someone’s ability to execute complex research in the real world as any job talk or publication in a peer-reviewed journal. Done well, these hobby projects can even evolve into rewarding enterprises of their own. Just ask Nate Silver, who kickstarted his now-prodigious career as a statistical forecaster with PECOTA, a baseball forecasting system that he ginned up for fun while working for pay as a consultant.

I suspect that a lot of people in the private sector already get this. Academia, not so much, but then they’re the ones who wind up poorer for it.

Why Is Academic Writing So Bad? A Brief Response to Stephen Walt

On his Foreign Policy blog, Stephen Walt picks up on a Daily Dish thread and asks, “Why is academic writing so bad?” He suggests a few reasons but concludes that, for the most part, scholars write poorly on purpose. In his view, bad writing is “a form of academic camouflage designed to shield the author from criticism.”

Is this really such a mystery, though? Writing well is hard to do, and it depends in no small part on talent. Like all talents, the ability to write well is probably distributed normally across the population. Most people are mediocre at it, some are really bad, and some are really good. Scholars just happen to work in a profession where writing is the preferred form of communication. Map that normal distribution onto a profession that churns out a ton of writing, and you’ll get the result we see.

Walt’s argument implies that most scholars could write well but choose not to. I find that hard to believe. I think the kind of dense, jargony writing Walt sees as camouflage is actually easier for most people to produce than the concise writing he rightly prefers. Skill and good editing are what get you from the former to the latter. Skill varies widely, and anyone who’s ever written for an academic journal or press knows that peer reviewers and editors usually give you zero help with your prose.

What’s more interesting, I think, is why academia doesn’t select for writing skill, given how much writing scholars are expected to do. You don’t see a lot of terrible writing in top newspapers and magazines because editors don’t want to hire and retain journalists who make their jobs that much harder. Orchestras don’t hire musicians who have great ideas about melody and harmony but can’t play.

Of course, it’s possible that academia would reward excellent writing if it got the chance, but the best writers are simply choosing to take their skills elsewhere. I suspect this self-sorting process does play a role. Writing for a living doesn’t make very many people rich, but neither does scholarship, and writers have a lot more room to be playful in their work outside academia.

Still, as a social scientist, I have to think that incentives within the profession have some effect, too. When reading each others’ work, scholars (ahem) tend to skim. Readers of quantitative papers often jump to the charts and tables summarizing the results and only selectively scan the other bits. The intended audiences for most academic writing are colleagues who speak the same jargon. Peer reviewers care a lot more about the novelty of one’s findings than the quality of the language used to convey them. In this environment, scholars can’t expect to be rewarded for time spent making marginal improvements to their prose, and they behave accordingly. As Trey Causey put it on Twitter this morning, “Everyone admires work that’s important and well-written. No one cares about unimportant but well-written work.”

Forecasting Round-Up No. 2

N.B. This is the second in an occasional series of posts I’m expecting to do on forecasting miscellany. You can find the first one here.

1. Over at Bad Hessian a few days ago, Trey Causey asked, “Where are the predictions in sociology?” After observing how the accuracy of some well-publicized forecasts of this year’s U.S. elections has produced “growing public recognition that quantitative forecasting models can produce valid results,” Trey wonders:

If the success of these models in forecasting the election results is seen as a victory for social science, why don’t sociologists emphasize the value of prediction and forecasting more? As far as I can tell, political scientists are outpacing sociologists in this area.

I gather that Trey intended his post to stimulate discussion among sociologists about the value of forecasting as an element of theory-building, and I’m all for that. As a political scientist, though, I found myself focusing on the comparison Trey drew between the two disciplines, and that got me thinking again about the state of forecasting in political science. On that topic, I had two brief thoughts.

First, my simple answer to why forecasting is getting more attention from political scientists that it used to is: money! In the past 20 years, arms of the U.S. government dealing with defense and intelligence seem to have taken a keener interest in using tools of social science to try to anticipate various calamities around the world. The research program I used to help manage, the Political Instability Task Force (PITF), got its start in the mid-1990s for that reason, and it’s still alive and kicking. PITF draws from several disciplines, but there’s no question that it’s dominated by political scientists, in large part because the events it tries to forecast—civil wars, mass killings, state collapses, and such—are traditionally the purview of political science.

I don’t have hard data to back this up, but I get the sense that the number and size of government contracts funding similar work has grown substantially since the mid-1990s, especially in the past several years. Things like the Department of Defense’s Minerva Initiative; IARPA’s ACE Program; the ICEWS program that started under DARPA and is now funded by the Office of Naval Research; and Homeland Security’s START consortium come to mind. Like PITF, all of these programs are interdisciplinary by design, but many of the topics they cover have their theoretical centers of gravity in political science.

In other words, through programs like these, the U.S. government is now spending millions of dollars each year to generate forecasts of things political scientists like to think about. Some of that money goes to private-sector contractors, but some of it is also flowing to research centers at universities. I don’t think any political scientists are getting rich off these contracts, but I gather there are bureaucratic and career incentives (as well as intellectual ones) that make the contracts rewarding to pursue. If that’s right, it’s not hard to understand why we’d be seeing more forecasting come out of political science than we used to.

My second reaction to Trey’s question is to point out that there actually isn’t a whole lot of forecasting happening in political science, either. That might seem like it contradicts the first, but it really doesn’t. The fact is that forecasting has long been pooh-poohed in academic social sciences, and even if that’s changing at the margins in some corners of the discipline, it’s still a peripheral endeavor.

The best evidence I have for this assertion is the brief history of the American Political Science Association’s Political Forecasting Group. To my knowledge—which comes from my participation in the group since its establishment—the Political Forecasting Group was only formed several years ago, and its membership is still too small to bump it up to the “organized section” status that groups representing more established subfields enjoy. What’s more, almost all of the panels the group has sponsored so far have focused on forecasts of U.S. elections. That’s partly because those papers are popular draws in election years, but it’s also because the group’s leadership has had a really hard time finding enough scholars doing forecasting on other topics to assemble panels.

If the discipline’s flagship association in one of the countries most culturally disposed to doing this kind of work has trouble cobbling together occasional panels on forecasts of things other than elections, then I think it’s fair to say that forecasting still isn’t a mainstream pursuit in political science, either.

2. Speaking of U.S. election forecasting, Drew Linzer recently blogged a clinic in how statistical forecasts should be evaluated. Via his web site, Votamatic, Drew:

1) began publishing forecasts about the 2012 elections well in advance of Election Day (so there couldn’t be any post hoc hemming and hawing about what his forecasts really were);

2) described in detail how his forecasting model works;

3) laid out a set of criteria he would use to judge those forecasts after the election; and then

4) walked us through his evaluations soon after the results were (mostly) in.

Oh, and in case you’re wondering: Drew’s model performed very well, thank you.

3. But you know what worked a little better than Drew’s election-forecasting model, and pretty much everyone else’s, too? An average of the forecasts from several of them. As it happens, this pattern is pretty robust. A well-designed statistical model is great for forecasting, but an average of forecasts from a number of them is usually going to be even better. Just ask the weather guys.

4. Finally, for those of you—like me—who want to keep holding pundits’ feet to the fire long after the election’s over, rejoice that Pundit Tracker is now up and running, and they even have a stream devoted specifically to politics. Among other things, they’ve got John McLaughlin on the record predicting that Hillary Clinton will win the presidency in 2016, and that President Obama will not nominate Susan Rice to be Secretary of State. McLaughlin’s hit rate so far is a rather mediocre 49 percent (18 of 37 graded calls correct), so make of those predictions what you will.

  • Author

  • Follow me on Twitter

  • Follow Dart-Throwing Chimp on WordPress.com
  • Enter your email address to follow this blog and receive notifications of new posts by email.

    Join 13,629 other followers

  • Archives

  • Advertisements
%d bloggers like this: