Pro football starts back up for real in just a few weeks. Who’s going to win the next Super Bowl?

This is a hard forecasting problem. The real action hasn’t even started yet, so most of the information we have about how a team will play this season is just extrapolated from other recent seasons, and even without big personnel changes, team performance can vary quite a bit from year to year. Also, luck plays a big role in pro football; one bad injury to a key player or one unlucky break in a playoff game can visibly bend (or end) the arc of a team’s season. I suspect that most fans could do a pretty good job sorting teams now by their expected strength, but correctly guessing exactly who will win the championship is a much tougher nut to crack.

Of course, that doesn’t mean people aren’t trying. PredictWise borrows from online betting site Betfair to give us one well-informed set of forecasts about that question. Here’s a screenshot from this morning (August 11, 2013) of PredictWise’s ten most likely winners of Super Bowl XLVIII:

I’m not trying to make the leap into sports bookmaking, but I am interested in hard forecasting questions, and I’m currently using this Super Bowl-champs question as a toy problem to explore how we might apply a relatively new crowdsourced survey technology to forecasting rare events. So far, I think the results look promising.

The technology is a pairwise wiki survey. It’s being developed by a Princeton-based research project called All Our Ideas, and according to its creators, here’s how it works:

[A pairwise wiki survey] consists of a single question with many possible answer items. Respondents can participate in a pairwise wiki survey in two ways: first, they can make pairwise comparisons between items (i.e., respondents vote between item A and item B); and second, they can add new items that are then presented to future respondents.

The resulting pairwise votes are converted into aggregate ratings using a Bayesian hierarchical model that estimates collective preferences most consistent with the observed data.

Pairwise wiki surveys weren’t designed specifically for forecasting, but I think we can readily adapt them to versions of that task that involve comparisons of risk. Instead of asking which item in a pair respondents prefer, we can ask them which item in a pair is more likely. The resulting scores won’t quite be the forecasts we’re looking for, but they’ll contain a lot of useful information about rank ordering and relative likelihoods.

My Super Bowl survey has only been running for a few days now, but I’ve already got more than 1,300 votes, and the results it’s produced so far look pretty credible to me. Here’s a screenshot of the top 10 from my wiki survey on the next NFL champs, as of 8 AM on August 11, 2013. As you can see, the top 10 is nearly identical to the top 10 at Betfair—they’ve got the Bengals where my survey has the NY Giants—and even within the top 10, the teams sort pretty similarly.

The 0-100 scores in that chart aren’t estimates of the probability that a team will win the Super Bowl. Because only one team can win, those estimates would have to sum to 100 across all 32 teams in the league. Instead, the scores shown here are the model-estimated chances that each team will win if pitted against another team chosen at random. As such, they’re better thought of as estimates of relative strength with an as-yet unspecified relationship to the probability of winning the championship.

This scalar measure of relative strength will often be interesting on its own, but for forecasting applications, we’d usually prefer to have these values expressed as probabilities. Following PredictWise, we can get there by summing all the scores and treating that sum as the denominator in a likelihood ratio that behaves like a true probability. For example, when that screenshot of my wiki survey was taken, the scores across all 32 NFL teams summed to 1,607, so the estimated probability of the Atlanta Falcons winning Super Bowl XLVIII were 5.4% (87/1,607), while the chances that my younger son’s Ravens will repeat were pegged about 4.9% (79/1,607).

For problems with a unique outcome, this conversion is easy to do, because the contours of the event space are known in advance. As the immortals in *Highlander* would have it, “There can be only one.”

Things get tougher if we want to apply this technique to problems where there isn’t a fixed number of events—say, trying to anticipate where coups or insurgencies or mass atrocities are likely to happen in the coming year. One way to extend the method to these kinds of problems would be to use historical data to identify the base rate of relevant events and then use that base rate as a multiplier in the transformation math as follows:

predicted probability = base rate * [ score / (sum of scores) ]

When a rare-events model is well calibrated, the sum of the predicted probabilities it produces should roughly equal the number of events that actually occur. The approach I’m proposing just works that logic in reverse, starting with the (reasonable) assumption that the base rate is a good forecast of the number of events that will occur and then inflating or deflating the estimated probabilities accordingly.

For example, say I’m interested in forecasting onsets of state-sponsored mass killing around the world, and I know the annual base rate of these events over the past few decades has only been about 1.2. I could use a pairwise wiki survey to ask respondents “Which country is more likely to experience an onset of state-sponsored mass killing in 2014?” and get scores like the ones in the All Our Ideas chart above. To convert the score for Country X to a predicted probability, I could sum the resulting scores for all countries, divide Country X’s score by that sum, and then multiply the result by 1.2.

This process might seem a bit ad hoc, but I think it’s one reasonable solution to a tough problem. In fact, this is basically the same thing that happens in a logistic regression model, which statisticians (and wannabes like me) often use to forecast discrete events. In the equation we get from a logistic regression model, the intercept captures information about the base rate and uses that as a starting point for all of the responses, which are initially expressed as log odds. The vector of predictors and the associated weights just slides the log odds up or down from that benchmark, and a final operation converts those log odds to a familiar 0-1 probability.

On the football problem, I would expect Betfair to be more accurate than my wiki survey, because Betfair’s odds are based on the behavior of people who are putting money on the line. For rare events in international politics, though, there is no Betfair equivalent. In situations where statistical modeling is inefficient or impossible—or we just want to know what some pool of respondents believe the risks are—I think this wiki-survey approach could be a useful tool.

## Tom Parris

/ August 11, 2013Jay,

This is really interesting. I have two comments.

1. Your method for converting relative rankings to probabilities is ignoring a major known factor — the rules of the game. In this case we know the schedules in advance of the season and the rules for who advances to the playoffs, super bowl, etc. So, what you really want to do is use your relative strength pairings in a Monte Carlo simulation of the season. This would be analogous to what Nate Silver did with the electoral college. My only question is how you would deal with “home field advantage.” You might need respondents to answer two questions about relative strength based on where the game is played. Note, that you can also trim the set of possible pairings by using the schedule information. Without having fully worked out the math, I believe that your current method basically assumes all teams play each other in a round robin tournament and the team with the best overall record wins.

2. There is another application. Imagine a situation in which you want to subjectively rank countries on some metric (e.g., level of corruption). No individual respondent would have the knowledge required to effectively do that for all countries, at best they would only have such knowledge for a small subset of all the countries in the world (hopefully >=2 countries per respondent). A mild adaptation of the pairwise approach would be to have respondents rank the subset of countries about which they have some level of expertise upon which to base their judgement. Assuming a rich enough set of respondents, one could then construct a global ranking from the small subsets provided by individual respondents. I believe this would be much more effective than current surveys that either ask people to respond for a single country (and therefore have no way to calibrate against responses from other countries), or rank countries about which they have very little knowledge.

— Tom

## dartthrowingchimp

/ August 11, 2013Thanks, Tom. Great comments as always.

On your first point, that would certainly be a more sophisticated way to use the resulting data for football forecasting. Unfortunately, all of the problems that really interest me professionally have data-generating processes that aren’t so well defined. So for my toy problem, I’m treating football as if it were also fuzzier. In any case, it’d be interesting to see how much added power you’d get from that additional complexity, especially this far away from the playoffs.

On your second point, I agree that this method could be really useful for eliciting rankings on many topics of interest to observers of international politics. In fact, part of what turned me on to this idea was an attempt by political scientist Xavier Marquez to use a pairwise wiki survey to produce global data on degrees of democracy (see here). As implemented by All Our Ideas, the process actually allows for people to restrict their responses to cases about which they have more knowledge by giving them the option of saying “I can’t tell” and then picking from a menu of reasons why, including “I don’t know enough about X” or Y or both. So I think their off-the-shelf (and open source!) version is ready now to handle the kind of self-limited responses you’re proposing. So, hey, why not give it a whirl—maybe for a new version of the State Capacity Index?

## Tom Parris

/ August 11, 2013p.s. You only need to deal with the “home field advantage” issue for the playoffs in the Monte Carlo simulation. That information is known for the regular season. One could use historical statistics to develop a weighting factor for home field advantage rather than making the survey instrument more complex.

## Sean J. Taylor (@seanjtaylor)

/ August 11, 2013Hey Jay,

We’ve played around with a UI like this for Creds. The idea is to not show the users a probability for either statement, then ask them which has a higher probability. If their choice violates the ordering of current market prices, we make a bet on their behalf. That way we’re able to use odds and turn (some of) the decisions into bets.

What I find so valuable about this approach is that the users don’t need to calibrate their internal probability models. I think this is probably a much better way of eliciting subjective probability distributions than asking for a number.

Regarding consistency of the probability distribution, it falls our naturally from our market maker approach (we enforce that a categorical distribution constrains the sum of the probabilities). I’m working on a bit of constraint-enforcer code that can accommodate complex dependencies between statements. Creds should have something like that in time for the NFL season, so when you bet on playoff odds, the individual game probabilities will change and vice versa.

## dartthrowingchimp

/ August 12, 2013Great stuff, Sean, thanks.

On eliciting subjective probability distributions, I’m now aware of at least three ways to do this: 1) ask people for a number, 2) ask people to adjust a plotted distribution, and 3) let people trade on a market. I’ve seen evidence from the Good Judgment Project comparing the accuracy of aggregates from 1 & 3, and so far the results favoring 3, but the differences aren’t dramatic. Meanwhile, I’ve only seen one implementation of 2 and no comparison of the accuracy of those results to other strategies. I guess the pairwise comparison approach I’m proposing here is a fourth option, and you’re already tinkering with a method that blends that elicitation strategy with a market mechanism for aggregation. Wow: so many possibilities, so little time…