Halloween, Quantified

Some parents dress up for Halloween. Some throw parties. In our house, we—well, I, really; my wife was bemused, my younger son vaguely interested, and my elder son embarrassed—I collect and chart the data.

First, the flow of trick-or-treaters. The figure below shows counts in 15-minute bins of kids who came to our door for candy. The first arrival, a little girl in a fairy/princess costume, showed up around 5:50 PM, well before sunset. The deluge came an hour later, when a mob from a party next door blended with an uptick in other arrivals. The other peak came almost an hour after that and probably had a much higher median age than the earlier one. The final handful strolled through around 8:40, right when we were shutting down so we could fetch and drop off our own teenage boys from other parts of town.

trickortreat.2015

This year, I also tallied which candy the trick-or-treaters chose. The figure below plots the resulting data. If the line ends early, it means we ran out of that kind of candy. As my wife predicted, the kids’ taste is basically the inverse of ours, which, as one costumed adult chaperoning his child pointed out, is “perfect.”

halloween.2015.candy

To collect the data, I sat on my front porch in a beach chair with a notepad, greeted the arriving kids, asked them to pick one, and then jotted tick marks as they left. Colleague Ali Necamp suggested that I put the candies in separate containers to make it easier to track who took what; I did, and she was right. Only a couple of people asked me why the candies were laid out in bins, and I clearly heard one kid approaching the house ask, “Mommy, why is that man sitting on the porch?”

A Fictitional But Telling Take on Policy Relevance

I finally read and really enjoyed Todd Moss’s first novel, The Golden Hour. It’s a thriller starring Judd Ryker, a political scientist who gets pulled into service at the State Department to help apply a theory he developed on how to nip coups and civil wars in the bud. Before he’s offered that government job, Ryker comes to Washington to brief a small group at State on his ideas. At that point, Ryker has written about his theory but not really tested it. Here’s how the briefing ends:

“What is driving the results on coups? How can you explain what’s so special about timing? I understand the idea of a Golden Hour, but why does it exist?”

“We don’t really know. We can theorize that it probably has something to do with the dynamics of consolidating power after seizure. The coup makers must line up the rest of the security forces and maybe buy off parliament and other local political leaders before those loyal to the deposed president are able to react and countermove. It’s a race for influence. But these are just hypotheses.”

“What about external intervention? Does it matter if an external force gets involved diplomatically?” asked one staffer.

“Or militarily?” interjected another.

“We don’t have classifications for intervention, so it’s not in there,” replied Judd. “The numbers can’t tell us. So we don’t know. I guess we would—”

Parker interrupted abruptly. “But in your expert opinion, Ryker, does it matter? Would it make a difference? Does the United States need to find ways to intervene more rapidly in emerging crises in the developing world? Can we prevent more wars and coups by reacting more quickly?”

Judd looked around the room at all the eyes locked on him. My numbers don’t answer that question. Isn’t that what you guys are here for?

But instead he stood up straight, turned to look Landon Parker directly in the eyes, and said simply, “Yes.”

I think that passage says more about the true nature of the “policy relevance” dance than most of the blog posts I’ve read on that subject. It’s fiction, of course, but it’s written by someone who knows well both sides of that exchange, and it rang true to me.

As we learn later in the novel, the people Ryker was briefing already had a plan, and Ryker’s theory of a Golden Hour—a short window when emerging crises might still be averted—aligned nicely with their existing agenda. This is true, in part, because Ryker’s theory supports the view that U.S. policy makers can and should play an active role in defusing those crises. If Ryker’s theory had implied that U.S. involvement would only make things worse, he would never have been invited to give that briefing.

Scholars who spend time talking to policy makers joke about how much those audiences don’t like to hear “I don’t know” as an answer to questions about why something is happening. That’s real, but I think those audiences might get even more frustrated at hearing “There’s nothing you can do about it” or “Your efforts will only make things worse” in response to questions about what they should do. I suspect that many of those people pursued or accepted government jobs to try to effect change in the world—to “make a difference”—and they don’t want to sit idly while their short windows of opportunity pop open and slam shut.

Then there is Ryker’s decision to submit to his audience’s agenda. Ryker doesn’t know the answer to Parker’s question, and he knows he doesn’t know. Yet, in the moment, he chooses to feign confidence and say “yes” anyway.

The novel hints that this performance owes something to Ryker’s desire to please a mentor who has encouraged him to go into public service. That feels plausible to me, but I would also suspect a deeper and more generic motive: a desire to be wanted by powerful people, to “matter.” If my own experience is any guide, I’d say that we are flattered by attention, and we are eager to stand out. Having government officials ask for your advice feeds both of those cravings.

In short, selection effects abound. The subset of scholars who choose to pursue policy relevance is not a random sample of all academics, and the subset of that subset whose work resonates with policy audiences is not a not a random sample, either. Both partners in this dance have emotional agendas that draw them to each other and then shape their behavior in ways that don’t always align with their ostensible professional ideals: to advance national interests, and to be true to the evidence.

I won’t spoil the novel by telling you how things turn out in Ryker’s case. Instead, I’ll just invite those of you who ever find yourselves on one side or the other of these exchanges—or hope to land there—to consider why you’re doing what you’re doing, and to consider the alternatives before acting.

Interactive 2015 NFL Forecasts

As promised in my last post, I’ve now built and deployed a web app that lets you poke through my preseason forecasts for the 2015 NFL regular season:

2015 NFL Forecasts

I learned several new tricks in the course of generating these forecasts and building this app, so the exercise served its didactic purpose. (You can find the code for the app here, on GitHub.) I also got lucky with the release of a new R package that solved a crucial problem I was having when I started to work on this project a couple of weeks ago. Open source software can be a wonderful thing.

The forecasts posted right now are based on results of the pairwise wiki survey through the morning of Monday, August 17. At that point, the survey had already logged upwards of 12,000 votes, triple the number cast in last year’s edition. This time around, I posted a link to the survey on the r/nfl subreddit, and that post produced a brief torrent of activity from what I hope was a relatively well-informed crowd.

The regular season doesn’t start until September, and I will update these forecasts at least once more before that happens. With so many votes already cast, though, the results will only change significantly if a) a large number of new votes are cast and b) those votes differ substantially from the ones already cast, and those conditions are highly unlikely to intersect.

One thing these forecasts help to illustrate is how noisy a game professional football is. By noisy, I mean hard to predict with precision. Even in games where one team is much stronger than the other, we still see tremendous variance in the simulated net scores and the associated outcomes. Heavy underdogs will win big every once in a while, and games we’d consider close when watching can produce a wide range of net scores.

Take, for example, the week 1 match-up between the Bears and Packers. Even though Chicago’s the home team, the simulation results (below) favor Green Bay by more than eight points. At the same time, those simulations also include a smattering of outcomes in which the Bears win by multiple touchdowns, and the peak of the distribution of simulations is pretty broad and flat. Some of that variance results from the many imperfections of the model and survey scores, but a lot of it is baked into the game, and plots of the predictive simulations nicely illustrate that noisiness.

nfl.forecast.packers.bears

The big thing that’s still missing from these forecasts is updating during the season. The statistical model that generates the predictive simulations takes just two inputs for each game — the difference between the two teams’ strength scores and the name of the home team — and, barring catastrophe, only one of those inputs can change as the season passes. I could leave the wiki survey running throughout the season, but the model that turns survey votes into scores doesn’t differentiate between recent and older votes, so updating the forecasts with the latest survey scores is unlikely to move the needle by much.*

I’m now hoping to use this problem as an entry point to learning about Bayesian updating and how to program it in R. Instead of updating the actual survey scores, we could treat the preseason scores as priors and then use observed game scores or outcomes to sequentially update estimates of them. I haven’t figured out how to implement this idea yet, but I’m working on it and will report back if I do.

* The pairwise wiki survey runs on open source software, and I can imagine modifying the instrument to give more weight to recent votes than older ones. Right now, I don’t have the programming skills to make those modifications, but I’m still hoping to find someone who might want to work with me, or just take it upon himself or herself, to do this.

2015 Tour de France Predictions

I like to ride bikes, I like to watch the pros race their bikes, and I make forecasts for a living, so I thought it would be fun to try to predict the outcome of this year’s Tour de France, which starts this Saturday and ends on July 26. I’m also interested in continuing to explore the predictive power of pairwise wiki surveys, a crowdsourcing tool that I’ve previously used to try to forecast mass-killing onsets, coup attempts, and pro football games, and that ESPN recently used to rank NBA draft prospects.

So, a couple of weeks ago, I used All Our Ideas to create a survey that asks, “Which rider is more likely to win the 2015 Tour de France?” I seeded the survey with the names of 11 riders—the 10 seen by bookmakers at Paddy Power as the most likely winners, plus Peter Sagan because he’s fun to watchposted a link to the survey on Tumblr, and trolled for respondents on Twitter and Facebook. The survey got off to a slow start, but then someone posted a link to it in the r/cycling subreddit, and the votes came pouring in. As of this afternoon, the survey had garnered more than 4,000 votes in 181 unique user sessions that came from five continents (see the map below). The crowd also added a handful of other riders to the set under consideration, bringing the list up to 16.

tourdefrance.2015.votemap

So how does that self-selected crowd handicap the race? The dot plot below shows the riders in descending order by their survey scores, which range from 0 to 100 and indicate the probability that that rider would beat a randomly chosen other rider for a randomly chosen respondent. In contrast to Paddy Power, which currently shows Chris Froome as the clear favorite and gives Nairo Quintana a slight edge over Alberto Contador, this survey sees Contador as the most likely winner (survey score of 90), followed closely by Froome (87) and a little further by Quintana (80). Both sources put Vincenzo Nibali as fourth likeliest (73) and Tejay van Garderen (65) and Thibaut Pinot (51) in the next two spots, although Paddy Power has them in the opposite order. Below that, the distances between riders’ chances get smaller, but the wiki survey’s results still approximate the handicapping of the real-money markets pretty well.

tourdefrance.2015.scores

There are at least a couple of ways to try to squeeze some meaning out those scores. One is to read the chart as a predicted finishing order for the 16 riders listed. That’s useful for something like a bike race, where we—well, some of us, anyway—care not only who wins, but also where other will riders finish, too.

We can also try to convert those scores to predicted probabilities of winning. The chart below shows what happens when we do that by dividing each rider’s score by the sum of all scores and then multiplying the result by 100. The probabilities this produces are all pretty low and more tightly bunched than seems reasonable, but I’m not sure how else to do this conversion. I tried squaring and cubing the scores; the results came closer to what the betting-market odds suggest are the “right” values, but I couldn’t think of a principled reason to do that, so I’m not showing those here. If you know a better way to get from those model scores to well-calibrated win probabilities, please let me know in the comments.

tourdefrance.2015.winprobs

So that’s what the survey says. After the Tour concludes in a few weeks, I’ll report back on how the survey’s predictions fared. Meanwhile, here’s wishing the athletes a crash–, injury–, and drug–free tour. Judging by the other big races I’ve seen so far this year, it should be a great one to watch.

A Good Dream

The novel Station Eleven—an immediate addition to my short list of favorite books—imagines the world after the global political economy has disintegrated. A flu pandemic has killed almost all humans, and the ones who remain inhabit the kinds of traveling bands or small encampments that are only vaguely familiar to most of us. There is no gasoline, no Internet, no electricity.

“I dreamt last night I saw an airplane,” Dieter whispered. They were lying a few feet apart in the dark of the tent. They had only ever been friends—in a hazy way Kirsten thought of him as family—but her thirty-year-old tent had finally fallen apart a year ago and she hadn’t yet managed to find a new one. For obvious reasons she was no longer sharing a tent with Sayid, so Dieter, who had one of the largest tents in the Symphony, had been hosting her. Kirsten heard soft voices outside, the tuba and the first violin on watch. The restless movements of the horses, penned between the three caravans for safety.

“I haven’t thought of an airplane in so long.”

“That’s because you’re so young.” A slight edge to his voice. “You don’t remember anything.”

“I do remember things. Of course I do. I was eight.”

Dieter had been twenty years old when the world ended. The main difference between Dieter and Kirsten was that Dieter remembered everything. She listened to him breathe.

“I used to watch for it,” he said. “I used to think about the countries on the other side of the ocean, wonder if any of them had somehow been spared. If I ever saw an airplane, that meant that somewhere planes still took off. For a whole decade after the pandemic, I kept looking at the sky.”

“Was it a good dream?”

“In the dream I was so happy,” he whispered. “I looked up and there it was, the plane had finally come. There was still a civilization somewhere. I fell to my knees. I started weeping and laughing, and then I woke up.”

Leaving New Orleans by jet yesterday morning only a couple of weeks after reading that book, flying—with wifi on a tablet!—felt miraculous again. As we lifted away from the airport an hour after sunrise on a clear day, I could see a dozen freighters lined up on the Mississippi, a vast industrial plant of some kind billowing steam on the adjacent shore, a railway spreading like capillaries as it ran out of the plant.

As we inhabit that world, it feels inevitable, but it was not. Our political economy is as natural as a termite mound, but it did not have to arise and cohere, to turn out like this—to turn out at all.

Nor does it have to persist. The first and only other time I visited New Orleans was in 2010, for the same conference in the same part of town—the Warehouse District, next to the river. Back then, a little closer to Katrina, visual reminders of the flood that already happened gave that part of the city an eerie feel. I stayed in hotel a half-mile south of the conference venue, and the walk to the Hilton led me past whole blocks that were still mostly empty, fresh coats of bright paint covering the facades that water had submerged five years before.

Now, with pictures in the news of tunnels scratched out of huge snow banks in Boston and Manhattan ringed by ice, it’s the future flood that haunts New Orleans in my mind as I walk back from an excursion to the French Quarter to get the best possible version of a drink made from boiled water and beans grown thousands of miles away, scores of Mardi Gras bead strings still hanging from some gutters. Climate change is “weirding” our weather, rendering the models we use to anticipate events like Katrina less and less reliable. A flood will happen again, probably sooner than we expect, and yet here everybody is, returning and rebuilding and cavorting right where all that water will want to go.

Self Points

For the second year in a row, Dart-Throwing Chimp won Best Blog (Individual) at the Online Achievement in International Studies awards, a.k.a. the Duckies (see below). Thank you for continuing to read and, apparently, for voting.

Duckie 2015

Before the awards, Eva Brittin-Snell, a student of IR at University of Sussex, interviewed a few of last year’s winners, including me, about blogging on international affairs. You can read her post on the SAGE Connection blog, here.

Estimating NFL Team-Specific Home-Field Advantage

This morning, I tinkered a bit with my pro-football preseason team strength survey data from 2013 and 2014 to see what other simple things I might do to improve the accuracy of forecasts derived from future versions of them.

My first idea is to go beyond a generic estimate of home-field advantage—about 3 points, according to my and everyone else’s estimates—with team-specific versions of that quantity. The intuition is that some venues confer a bigger advantage than others. For example, I would guess that Denver enjoys a bigger home-field edge than most teams because their stadium is at high altitude. The Broncos live there, so they’re used to it, but visiting teams have to adapt, and that process supposedly takes about a day for every 1,000 feet over 3,000. Some venues are louder than others, and that noise is often dialed up when visiting teams would prefer some quiet. And so on.

To explore this idea, I’m using a simple hierarchical linear model to estimate team-specific intercepts after taking preseason estimates of relative team strength into account. The line of R code used to estimate the model requires the lme4 package and looks like this:

mod1 <- lmer(score.raw ~ wiki.diff + (1 | home_team), results)

Where

score.raw = home_score - visitor_score
wiki.diff = home_wiki - visitor_wiki

Those wiki vectors are the team strength scores estimated from preseason pairwise wiki surveys. The ‘results’ data frame includes scores for all regular and postseason games from those two years so far, courtesy of devstopfix’s NFL results repository on GitHub (here). Because the net game and strength scores are both ordered home to visitor, we can read those random intercepts for each home team as estimates of team-specific home advantage. There are probably other sources of team-specific bias in my data, so those estimates are going to be pretty noisy, because I think it’s a reasonable starting point.

My initial results are shown in the plot below, which I get with these two lines of code, the second of which requires the lattice package:

ha1 <- ranef(mod1, condVar=TRUE)
dotplot(ha1)

Bear in mind that the generic (fixed) intercept is 2.7, so the estimated home-field advantage for each team is what’s shown in the plot plus that number. For example, these estimates imply that my Ravens enjoy a net advantage of about 3 points when they play in Baltimore, while their division-rival Bengals are closer to 6.

home.advantage.estimates

In light of DeflateGate, I guess I shouldn’t be surprised to see the Pats at the top of the chart, almost a whole point higher than the second-highest team. Maybe their insanely home low fumble rate has something to do with it.* I’m also heartened to see relatively high estimates for Denver, given the intuition that started this exercise, and Seattle, which I’ve heard said enjoys an unusually large home-field edge. At the same time, I honestly don’t know what to make of the exceptionally low estimates for DC and Jacksonville, who appear from these estimates to suffer a net home-field disadvantage. That strikes me as odd and undercuts my confidence in the results.

In any case, that’s how far my tinkering took me today. If I get really bored motivated, I might try re-estimating the model with just the 2013 data and then running the 2014 preseason survey scores through that model to generate “forecasts” that I can compare to the ones I got from the simple linear model with just the generic intercept (here). The point of the exercise was to try to get more accurate forecasts from simple models, and the only way to test that is to do it. I’m also trying to decide if I need to cross these team-specific effects with season-specific effects to try to control for differences across years in the biases in the wiki survey results when estimating these team-specific intercepts. But I’m not there yet.

* After I published this post, Michael Lopez helpfully pointed me toward a better take on the Patriots’ fumble rate (here), and Mo Patel observed that teams manage their own footballs on the road, too, so that particular tweak—if it really happened—wouldn’t have a home-field-specific effect.

Post Mortem on 2014 Preseason NFL Forecasts

Let’s end the year with a whimper, shall we?

Back in September (here), I used a wiki survey to generate a preseason measure of pro-football team strength and then ran that measure through a statistical model and some simulations to gin up forecasts for all 256 games of the 2014 regular season. That season ended on Sunday, so now we can see how those forecasts turned out.

The short answer: not awful, but not so great, either.

To assess the data and model’s predictive power, I’m going to focus on predicted win totals. Based on my game-level forecasts, how many contests was each team expected to win? Those totals nicely summarize the game-level predictions, and they are the focus of StatsbyLopez’s excellent post-season predictive review, here, against which I can compare my results.

StatsbyLopez used two statistics to assess predictive accuracy: mean absolute error (MAE) and mean squared error (MSE). The first is the average of the distance between each team’s projected and observed win totals. The second is the average of the square of those distances. MAE is a little easier to interpret—on average, how far off was each team’s projected win total?—while MSE punishes larger errors more than the first, which is nice if you care about how noisy your predictions are. StatsbyLopez used those stats to compare five sets of statistical predictions to the preseason betting line (Vegas) and a couple of simple benchmarks: last year’s win totals and a naive forecast of eight wins for everyone, which is what you’d expect to get if you just flipped a coin to pick winners.

Lopez’s post includes some nice column charts comparing those stats across sources, but it doesn’t include the stats themselves, so I’m going to have to eyeball his numbers and do the comparison in prose.

I summarized my forecasts two ways: 1) counts of the games each team had a better-than-even chance of winning, and 2) sums of each team’s predicted probabilities of winning.

  • The MAE for my whole-game counts was 2.48—only a little bit better than the ultra-naive eight-wins-for-everyone prediction and worse than everything else, including just using last year’s win totals. The MSE for those counts was 8.89, still worse than everything except the simple eights. For comparison, the MAE and MSE for the Vegas predictions were roughly 2.0 and 6.0, respectively.
  • The MAE for my sums was 2.31—about as good as the worst of the five “statsheads” Lopez considered, but still a shade worse than just carrying forward the 2013 win totals. The MSE for those summed win probabilities, however, was 7.05. That’s better than one of the sources Lopez considered and pretty close to two others, and it handily beats the two naive benchmarks.

To get a better sense of how large the errors in my forecasts were and how they were distributed, I also plotted the predicted and observed win totals by team. In the charts below, the black dots are the predictions, and the red dots are the observed results. The first plot uses the whole-game counts; the second the summed win probabilities. Teams are ordered from left to right according to their rank in the preseason wiki survey.

Predicted (black) and observed (red) 2014 regular-season win totals by team using whole-game counts

Predicted (black) and observed (red) 2014 regular-season win totals by team using whole-game counts

Predicted (black) and observed (red) 2014 regular-season win totals by team using summed win probabilities

Predicted (black) and observed (red) 2014 regular-season win totals by team using summed win probabilities

Substantively, those charts spotlight some things most football fans could already tell you: Dallas and Arizona were the biggest positive surprises of the 2014 regular season, while San Francisco, New Orleans, and Chicago were probably the biggest disappointments.  Detroit and Buffalo also exceeded expectations, although only one of them made it to the postseason, while Tampa Bay, Tennessee, the NY Giants, and the Washington football team also under-performed.

Statistically, it’s interesting but not surprising that the summed win probabilities do markedly better than the whole-game counts. Pro football is a noisy game, and we throw out a lot of information about the uncertainty of each contest’s outcome when we convert those probabilities into binary win/lose calls. In essence, those binary calls are inherently overconfident, so the win counts they produce are, predictably, much noisier than the ones we get by summing the underlying probabilities.

In spite of its modest performance in 2014, I plan to repeat this exercise next year. The linear regression model I use to convert the survey results into game-level forecasts has home-field advantage and the survey scores as its only inputs. The 2014 version of that model was estimated from just a single prior season’s data, so doubling the size of the historical sample to 512 games will probably help a little. Like all survey results, my team-strength score depends on the pool of respondents, and I keep hoping to get a bigger and better-informed crowd to participate in that part of the exercise. And, most important, it’s fun!

The Political Power of Inertia

Political scientists devote a lot of energy to theorizing about dramatic changes—things like revolutions, coups, popular uprisings, transitions to democracy, and the outbreak of wars within and between states. These changes are fascinating and consequential, but they are also extremely rare. In politics, as in physics, inertia is a powerful force. Our imagination is drawn to change, but if we want to understand the world as it is, then we have to explain the prevalence of continuity as well.

Examples of inertia in politics are easy to find. War is justifiably a central concern for political science, but for many decades now, almost none of the thousands of potential wars within and between states have actually happened. Once a war does start, though, it often persists for years in spite of the tremendous costs involved. The international financial system suffers frequent and sometimes severe shocks and has no sovereign to defend it, and yet the basic structure of that system has persisted for decades. Whole journals are devoted to popular uprisings and other social movements, but they very rarely happen, and when they do, they often fail to produce lasting institutional change. For an array of important phenomena in the social sciences, by far the best predictor of the status of the system at time (t + 1) is the status of the system at time (t).

One field in which inertia gets its due is organization theory. A central theme in that neck of the intellectual woods is the failure of firms and agencies to adapt to changes in their environment and the search for patterns that might explain those failures. Some theories of institutional design at the level of whole political systems also emphasize stasis over change. Institutions are sometimes said to be “sticky,” meaning that they often persist in spite of evident flaws and available alternatives. As Paul Pierson observes, “Once established, patterns of political mobilization, the institutional ‘rules of the game,’ and even citizens’ basic ways of thinking about the political world will often generate self-reinforcing dynamics.”

In international relations and comparative politics, we see lots of situations in which actions that might improve the lot of one or more parties are not taken. These are situations in which inertia is evident, even though it appears to be counterproductive. We often explain failures to act in these situations as the result of collective action problems. As Mancur Olson famously observed, people, organizations, and other agents have diverse interests; action to try to produce change is costly; and the benefits of those costly actions are often diffuse. Under these circumstances, a tally of expected costs and benefits will often discourage agents from taking action, tempting them instead to forego those costs and look to free ride on the contributions of others instead.

Collective action problems are real and influential. Still, I wonder if our theories put too much emphasis on those system-level sources of inertia and too little on causes at the level of the individual. We like to think of ourselves as free and unpredictable, but humans really are creatures of habit. For example, a study published in 2010 in Science (here) used data sampled from millions of mobile-phone users to show that there is “a potential 93% average predictability” in where users go and when, “an exceptionally high value rooted in the inherent regularity of human behavior.” The authors conclude that,

Despite our deep-rooted desire for change and spontaneity, our daily mobility is, in fact, characterized by a deep-rooted regularity.

A related study (here) used mobility and survey data from Kenya and found essentially the same thing. Its authors reported that “mobility estimates are surprisingly robust to the substantial biases in phone ownership across different geographical and socioeconomic groups.” Apparently, this regularity is not unique to rich countries.

The microfoundations of our devotion to routine may be evident in neurobiology. Behavioral routines are physically expressed and reinforced in the development of neural pathways related to specific memories and actions, and in the thickening of the myelin sheaths that facilitate conduction along those pathways. The result is a virtuous or vicious circle, depending on the behavior and context. Athletes and musicians take advantage of this process through practice, but practice is mostly repetition, and repetition is a form of routine. Repetition begets habituation begets repetition.

This innate attachment to routine may contribute to political inertia. Norms and institutions are often regarded as clever solutions to collective action problems that would otherwise thwart our interests and aspirations. At least in part, those norms and institutions may also be social manifestations of an inborn and profound preference for routine and regularity.

In our theoretical imaginations, we privilege change over stasis. As alternative futures, however, the two are functionally equivalent, and stasis is vastly more common than change. In principle, our theories should cover both alternatives. In practice, that is very hard to do, and many of us choose to emphasize the dramatic over the routine. I wonder if we have chosen wrong.

For now, I’ll give the last word on this topic to Frank Rich. He wrote a nice essay for the October 20, 2014, issue of New York Magazine about an exercise in which he read his way back through the daily news from 1964 to compare it to the supposedly momentous changes afoot in 2014. His conclusion:

Even as we recognize that the calendar makes for a crude and arbitrary marker, we like to think that history visibly marches on, on a schedule we can codify.

The more I dove back into the weeds of 1964, the more I realized that this is both wishful thinking and an optical illusion. I came away with a new appreciation of how selective our collective memory is, and of just how glacially history moves.

Turning Crowdsourced Preseason NFL Strength Ratings into Game-Level Forecasts

For the past week, nearly all of my mental energy has gone into the Early Warning Project and a paper for the upcoming APSA Annual Meeting here in Washington, DC. Over the weekend, though, I found some time for a toy project on forecasting pro-football games. Here are the results.

The starting point for this toy project is a pairwise wiki survey that turns a crowd’s beliefs about relative team strength into scalar ratings. Regular readers will recall that I first experimented with one of these before the 2013-2014 NFL season, and the predictive power wasn’t terrible, especially considering that the number of participants was small and the ratings were completed before the season started.

This year, to try to boost participation and attract a more knowledgeable crowd of respondents, I paired with Trey Causey to announce the survey on his pro-football analytics blog, The Spread. The response has been solid so far. Since the survey went up, the crowd—that’s you!—has cast nearly 3,400 votes in more than 100 unique user sessions (see the Data Visualizations section here).

The survey will stay open throughout the season, but that doesn’t mean it’s too early to start seeing what it’s telling us. One thing I’ve already noticed is that the crowd does seem to be updating in response to preseason action. For example, before the first round of games, I noticed that the Baltimore Ravens, my family’s favorites, were running mid-pack with a rating of about 50. After they trounced the defending NFC champion 49ers in their preseason opener, however, the Ravens jumped to the upper third with a rating of 59. (You can always see up-to-the-moment survey results here, and you can cast your own votes here.)

The wiki survey is a neat way to measure team strength. On their own, though, those ratings don’t tell us what we really want to know, which is how each game is likely to turn out, or how well our team might be expected to do this season. The relationship between relative strength and game outcomes should be pretty strong, but we might want to consider other factors, too, like home-field advantage. To turn a strength rating into a season-level forecast for a single team, we need to consider the specifics of its schedule. In game play, it’s relative strength that matters, and some teams will have tougher schedules than others.

A statistical model is the best way I can think to turn ratings into game forecasts. To get a model to apply to this season’s ratings, I estimated a simple linear one from last year’s preseason ratings and the results of all 256 regular-season games (found online in .csv format here). The model estimates net score (home minus visitor) from just one feature, the difference between the two teams’ preseason ratings (again, home minus visitor). Because the net scores are all ordered the same way and the model also includes an intercept, though, it implicitly accounts for home-field advantage as well.

The scatterplot below shows the raw data on those two dimensions from the 2013 season. The model estimated from these data has an intercept of 3.1 and a slope of 0.1 for the score differential. In other words, the model identifies a net home-field advantage of 3 points—consistent with the conventional wisdom—and it suggests that every point of advantage on the wiki-survey ratings translates into a net swing of one-tenth of a point on the field. I also tried a generalized additive model with smoothing splines to see if the association between the survey-score differential and net game score was nonlinear, but as the scatterplot suggests, it doesn’t seem to be.

2013 NFL Games Arranged by Net Game Score and Preseason Wiki Survey Rating Differentials

2013 NFL Games Arranged by Net Game Score and Preseason Wiki Survey Rating Differentials

In sample, the linear model’s accuracy was good, not great. If we convert the net scores the model postdicts to binary outcomes and compare those postdictions to actual outcomes, we see that the model correctly classifies 60 percent of the games. That’s in sample, but it’s also based on nothing more than home-field advantage and a single preseason rating for each team from a survey with a small number of respondents. So, all things considered, it looks like a potentially useful starting point.

Whatever its limitations, that model gives us the tool we need to convert 2014 wiki survey results into game-level predictions. To do that, we also need a complete 2014 schedule. I couldn’t find one in .csv format, but I found something close (here) that I saved as text, manually cleaned in a minute or so (deleted extra header rows, fixed remaining header), and then loaded and merged with a .csv of the latest survey scores downloaded from the manager’s view of the survey page on All Our Ideas.

I’m not going to post forecasts for all 256 games—at least not now, with three more preseason games to learn from and, hopefully, lots of votes yet to be cast. To give you a feel for how the model is working, though, I’ll show a couple of cuts on those very preliminary results.

The first is a set of forecasts for all Week 1 games. The labels show Visitor-Home, and the net score is ordered the same way. So, a predicted net score greater than 0 means the home team (second in the paired label) is expected to win, while a predicted net score below 0 means the visitor (first in the paired label) is expected to win. The lines around the point predictions represent 90-percent confidence intervals, giving us a partial sense of the uncertainty around these estimates.

Week 1 Game Forecasts from Preseason Wiki Survey Results on 10 August 2014

Week 1 Game Forecasts from Preseason Wiki Survey Results on 10 August 2014

Of course, as a fan of particular team, I’m most interested in what the model says about how my guys are going to do this season. The next plot shows predictions for all 16 of Baltimore’s games. Unfortunately, the plotting command orders the data by label, and my R skills and available time aren’t sufficient to reorder them by week, but the information is all there. In this plot, the dots for the point predictions are colored red if they predict a Baltimore win and black for an expected loss. The good news for Ravens fans is that this plot suggests an 11-5 season, good enough for a playoff berth. The bad news is that an 8-8 season also lies within the 90-percent confidence intervals, so the playoffs don’t look like a lock.

2014 Game-Level Forecasts for the Baltimore Ravens from 10 August 2014 Wiki Survey Scores

2014 Game-Level Forecasts for the Baltimore Ravens from 10 August 2014 Wiki Survey Scores

So that’s where the toy project stands now. My intuition tells me that the predicted net scores aren’t as well calibrated as I’d like, and the estimated confidence intervals surely understate the true uncertainty around each game (“On any given Sunday…”). Still, I think this exercise demonstrates the potential of this forecasting process. If I were a betting man, I wouldn’t lay money on these estimates. As an applied forecaster, though, I can imagine using these predictions as priors in a more elaborate process that incorporates additional and, ideally, more dynamic information about each team and game situation over the course of the season. Maybe my doppelganger can take that up while I get back to my day job…

Postscript. After I published this post, Jeff Fogle suggested via Twitter that I compare the Week 1 forecasts to the current betting lines for those games. The plot below shows the median point spread from an NFL odds-aggregating site as blue dots on top of the statistical forecasts already shown above. As you can see, the statistical forecasts are tracking the betting lines pretty closely. There’s only one game—Carolina at Tampa Bay—where the predictions from the two series fall on different sides of the win/loss line, and it’s a game the statistical model essentially sees as a toss-up. It’s also reassuring that there isn’t a consistent direction to the differences, so the statistical process doesn’t seem to be biased in some fundamental way.

Week 1 Game-Level Forecasts Compared to Median Point Spread from Betting Sites on 11 August 2014

Week 1 Game-Level Forecasts Compared to Median Point Spread from Betting Sites on 11 August 2014

  • Follow me on Twitter

  • Follow Dart-Throwing Chimp on WordPress.com
  • Enter your email address to follow this blog and receive notifications of new posts by email.

    Join 13.6K other subscribers
  • Archives