Pro football starts back up for real in just a few weeks. Who’s going to win the next Super Bowl?

This is a hard forecasting problem. The real action hasn’t even started yet, so most of the information we have about how a team will play this season is just extrapolated from other recent seasons, and even without big personnel changes, team performance can vary quite a bit from year to year. Also, luck plays a big role in pro football; one bad injury to a key player or one unlucky break in a playoff game can visibly bend (or end) the arc of a team’s season. I suspect that most fans could do a pretty good job sorting teams now by their expected strength, but correctly guessing exactly who will win the championship is a much tougher nut to crack.

Of course, that doesn’t mean people aren’t trying. PredictWise borrows from online betting site Betfair to give us one well-informed set of forecasts about that question. Here’s a screenshot from this morning (August 11, 2013) of PredictWise’s ten most likely winners of Super Bowl XLVIII:

I’m not trying to make the leap into sports bookmaking, but I am interested in hard forecasting questions, and I’m currently using this Super Bowl-champs question as a toy problem to explore how we might apply a relatively new crowdsourced survey technology to forecasting rare events. So far, I think the results look promising.

The technology is a pairwise wiki survey. It’s being developed by a Princeton-based research project called All Our Ideas, and according to its creators, here’s how it works:

[A pairwise wiki survey] consists of a single question with many possible answer items. Respondents can participate in a pairwise wiki survey in two ways: first, they can make pairwise comparisons between items (i.e., respondents vote between item A and item B); and second, they can add new items that are then presented to future respondents.

The resulting pairwise votes are converted into aggregate ratings using a Bayesian hierarchical model that estimates collective preferences most consistent with the observed data.

Pairwise wiki surveys weren’t designed specifically for forecasting, but I think we can readily adapt them to versions of that task that involve comparisons of risk. Instead of asking which item in a pair respondents prefer, we can ask them which item in a pair is more likely. The resulting scores won’t quite be the forecasts we’re looking for, but they’ll contain a lot of useful information about rank ordering and relative likelihoods.

My Super Bowl survey has only been running for a few days now, but I’ve already got more than 1,300 votes, and the results it’s produced so far look pretty credible to me. Here’s a screenshot of the top 10 from my wiki survey on the next NFL champs, as of 8 AM on August 11, 2013. As you can see, the top 10 is nearly identical to the top 10 at Betfair—they’ve got the Bengals where my survey has the NY Giants—and even within the top 10, the teams sort pretty similarly.

The 0-100 scores in that chart aren’t estimates of the probability that a team will win the Super Bowl. Because only one team can win, those estimates would have to sum to 100 across all 32 teams in the league. Instead, the scores shown here are the model-estimated chances that each team will win if pitted against another team chosen at random. As such, they’re better thought of as estimates of relative strength with an as-yet unspecified relationship to the probability of winning the championship.

This scalar measure of relative strength will often be interesting on its own, but for forecasting applications, we’d usually prefer to have these values expressed as probabilities. Following PredictWise, we can get there by summing all the scores and treating that sum as the denominator in a likelihood ratio that behaves like a true probability. For example, when that screenshot of my wiki survey was taken, the scores across all 32 NFL teams summed to 1,607, so the estimated probability of the Atlanta Falcons winning Super Bowl XLVIII were 5.4% (87/1,607), while the chances that my younger son’s Ravens will repeat were pegged about 4.9% (79/1,607).

For problems with a unique outcome, this conversion is easy to do, because the contours of the event space are known in advance. As the immortals in *Highlander* would have it, “There can be only one.”

Things get tougher if we want to apply this technique to problems where there isn’t a fixed number of events—say, trying to anticipate where coups or insurgencies or mass atrocities are likely to happen in the coming year. One way to extend the method to these kinds of problems would be to use historical data to identify the base rate of relevant events and then use that base rate as a multiplier in the transformation math as follows:

predicted probability = base rate * [ score / (sum of scores) ]

When a rare-events model is well calibrated, the sum of the predicted probabilities it produces should roughly equal the number of events that actually occur. The approach I’m proposing just works that logic in reverse, starting with the (reasonable) assumption that the base rate is a good forecast of the number of events that will occur and then inflating or deflating the estimated probabilities accordingly.

For example, say I’m interested in forecasting onsets of state-sponsored mass killing around the world, and I know the annual base rate of these events over the past few decades has only been about 1.2. I could use a pairwise wiki survey to ask respondents “Which country is more likely to experience an onset of state-sponsored mass killing in 2014?” and get scores like the ones in the All Our Ideas chart above. To convert the score for Country X to a predicted probability, I could sum the resulting scores for all countries, divide Country X’s score by that sum, and then multiply the result by 1.2.

This process might seem a bit ad hoc, but I think it’s one reasonable solution to a tough problem. In fact, this is basically the same thing that happens in a logistic regression model, which statisticians (and wannabes like me) often use to forecast discrete events. In the equation we get from a logistic regression model, the intercept captures information about the base rate and uses that as a starting point for all of the responses, which are initially expressed as log odds. The vector of predictors and the associated weights just slides the log odds up or down from that benchmark, and a final operation converts those log odds to a familiar 0-1 probability.

On the football problem, I would expect Betfair to be more accurate than my wiki survey, because Betfair’s odds are based on the behavior of people who are putting money on the line. For rare events in international politics, though, there is no Betfair equivalent. In situations where statistical modeling is inefficient or impossible—or we just want to know what some pool of respondents believe the risks are—I think this wiki-survey approach could be a useful tool.