The Wisdom of Crowds, Oscars Edition

Forecasts derived from prediction markets did an excellent job predicting last night’s Academy Awards.

PredictWise uses odds from online bookmaker Betfair for its Oscars forecasts, and it nearly ran the table. PredictWise assigned the highest probability to the eventual winner in 21 of 24 awards, and its three “misses” came in less prominent categories (Best Documentary, Best Short, Best Animated Short). Even more telling, its calibration was excellent. The probability assigned to the eventual winner in each category averaged 87 percent, and most winners were correctly identified as nearly sure things.

Inkling Markets also did quite well. This public, play-money prediction market has a lot less liquidity than BetFair, but it still assigned the highest probability of winning to the eventual winner in 17 of the 18 categories is covered—it “missed” on Best Original Song—and for extra credit it correctly identified Gravity as the film that would probably win the most Oscars. Just by eyeballing, it’s clear that Inkling’s calibration wasn’t as good as PredictWise’s, but that’s what we’d expect from a market with a much smaller pool of participants. In any case, you still probably would have one your Oscars pool if you’d relied on it.

This is the umpteen-gajillionth reminder that crowds are powerful forecasters. “When our imperfect judgments are aggregated in the right way,” James Surowiecki wrote (p. xiv), “our collective intelligence is often excellent.” Or, as PredictWise’s David Rothschild said in his live blog last night,

This is another case of pundits and insiders advertising a close event when the proper aggregation of data said it was not. As I noted on Twitter earlier, my acceptance speech is short. I would like to thank prediction markets for efficiently aggregating dispersed and idiosyncratic data.

Using Wiki Surveys to Forecast Rare Events

Pro football starts back up for real in just a few weeks. Who’s going to win the next Super Bowl?

This is a hard forecasting problem. The real action hasn’t even started yet, so most of the information we have about how a team will play this season is just extrapolated from other recent seasons, and even without big personnel changes, team performance can vary quite a bit from year to year. Also, luck plays a big role in pro football; one bad injury to a key player or one unlucky break in a playoff game can visibly bend (or end) the arc of a team’s season. I suspect that most fans could do a pretty good job sorting teams now by their expected strength, but correctly guessing exactly who will win the championship is a much tougher nut to crack.

Of course, that doesn’t mean people aren’t trying. PredictWise borrows from online betting site Betfair to give us one well-informed set of forecasts about that question. Here’s a screenshot from this morning (August 11, 2013) of PredictWise’s ten most likely winners of Super Bowl XLVIII:

predictwise 2014 super bowl forecast 20130811

I’m not trying to make the leap into sports bookmaking, but I am interested in hard forecasting questions, and I’m currently using this Super Bowl-champs question as a toy problem to explore how we might apply a relatively new crowdsourced survey technology to forecasting rare events. So far, I think the results look promising.

The technology is a pairwise wiki survey. It’s being developed by a Princeton-based research project called All Our Ideas, and according to its creators, here’s how it works:

[A pairwise wiki survey] consists of a single question with many possible answer items. Respondents can participate in a pairwise wiki survey in two ways: fi rst, they can make pairwise comparisons between items (i.e., respondents vote between item A and item B); and second, they can add new items that are then presented to future respondents.

The resulting pairwise votes are converted into aggregate ratings using a Bayesian hierarchical model that estimates collective preferences most consistent with the observed data.

Pairwise wiki surveys weren’t designed specifically for forecasting, but I think we can readily adapt them to versions of that task that involve comparisons of risk. Instead of asking which item in a pair respondents prefer, we can ask them which item in a pair is more likely. The resulting scores won’t quite be the forecasts we’re looking for, but they’ll contain a lot of useful information about rank ordering and relative likelihoods.

My Super Bowl survey has only been running for a few days now, but I’ve already got more than 1,300 votes, and the results it’s produced so far look pretty credible to me. Here’s a screenshot of the top 10 from my wiki survey on the next NFL champs, as of 8 AM on August 11, 2013. As you can see, the top 10 is nearly identical to the top 10 at Betfair—they’ve got the Bengals where my survey has the NY Giants—and even within the top 10, the teams sort pretty similarly.

allourideas 2014 super bowl survey results 20130811

The 0-100 scores in that chart aren’t estimates of the probability that a team will win the Super Bowl. Because only one team can win, those estimates would have to sum to 100 across all 32 teams in the league. Instead, the scores shown here are the model-estimated chances that each team will win if pitted against another team chosen at random. As such, they’re better thought of as estimates of relative strength with an as-yet unspecified relationship to the probability of winning the championship.

This scalar measure of relative strength will often be interesting on its own, but for forecasting applications, we’d usually prefer to have these values expressed as probabilities. Following PredictWise, we can get there by summing all the scores and treating that sum as the denominator in a likelihood ratio that behaves like a true probability. For example, when that screenshot of my wiki survey was taken, the scores across all 32 NFL teams summed to 1,607, so the estimated probability of the Atlanta Falcons winning Super Bowl XLVIII were 5.4% (87/1,607), while the chances that my younger son’s Ravens will repeat were pegged about 4.9% (79/1,607).

For problems with a unique outcome, this conversion is easy to do, because the contours of the event space are known in advance. As the immortals in Highlander would have it, “There can be only one.”

Things get tougher if we want to apply this technique to problems where there isn’t a fixed number of events—say, trying to anticipate where coups or insurgencies or mass atrocities are likely to happen in the coming year.  One way to extend the method to these kinds of problems would be to use historical data to identify the base rate of relevant events and then use that base rate as a multiplier in the transformation math as follows:

predicted probability = base rate * [ score / (sum of scores) ]

When a rare-events model is well calibrated, the sum of the predicted probabilities it produces should roughly equal the number of events that actually occur. The approach I’m proposing just works that logic in reverse, starting with the (reasonable) assumption that the base rate is a good forecast of the number of events that will occur and then inflating or deflating the estimated probabilities accordingly.

For example, say I’m interested in forecasting onsets of state-sponsored mass killing around the world, and I know the annual base rate of these events over the past few decades has only been about 1.2. I could use a pairwise wiki survey to ask respondents “Which country is more likely to experience an onset of state-sponsored mass killing in 2014?” and get scores like the ones in the All Our Ideas chart above. To convert the score for Country X to a predicted probability, I could sum the resulting scores for all countries, divide Country X’s score by that sum, and then multiply the result by 1.2.

This process might seem a bit ad hoc, but I think it’s one reasonable solution to a tough problem. In fact, this is basically the same thing that happens in a logistic regression model, which statisticians (and wannabes like me) often use to forecast discrete events. In the equation we get from a logistic regression model, the intercept captures information about the base rate and uses that as a starting point for all of the responses, which are initially expressed as log odds. The vector of predictors and the associated weights just slides the log odds up or down from that benchmark, and a final operation converts those log odds to a familiar 0-1 probability.

On the football problem, I would expect Betfair to be more accurate than my wiki survey, because Betfair’s odds are based on the behavior of people who are putting money on the line. For rare events in international politics, though, there is no Betfair equivalent. In situations where statistical modeling is inefficient or impossible—or we just want to know what some pool of respondents believe the risks are—I think this wiki-survey approach could be a useful tool.

A Quick Post Mortem on Oscars Forecasting

I was intrigued to see that statistical forecasts of the Academy Awards from PredictWise and FiveThirtyEight performed pretty well this year. Neither nailed it, but they both used sound processes to generate probabilistic estimates that turned out to be fairly accurate.

In the six categories both sites covered, PredictWise assigned very high probabilities to the eventual winner in four: Picture, Actor, Actress, and Supporting Actress. PredictWise didn’t miss by much in one more—Supporting Actor, where winner Christoph Waltz ran a close second to Tommy Lee Jones (40% to 44%). Its biggest miss came in the Best Director category, where PredictWise’s final forecast favored Steven Spielberg (76%) over winner Ang Lee (22%).

At FiveThirtyEight, Nate Silver and co. also gave the best odds to the same four of six eventual winners, but they were a little less confident than PredictWise about a couple of them. FiveThirtyEight also had a bigger miss in the Best Supporting Actor category, putting winner Christoph Waltz neck and neck with Philip Seymour Hoffman and both of them a ways behind Jones. FiveThirtyEight landed closer to the mark than PredictWise in the Best Director category, however, putting Lee just a hair’s breadth behind Spielberg (0.56 to 0.58 on its index).

If this were a showdown, I’d give the edge to PredictWise for three reasons. One, my eyeballing of the results tells me that PredictWise’s forecasts were slightly better calibrated. Both put four of the six winners in front and didn’t miss by much on one more, but PredictWise was more confident in the four they both got “right.” Second, PredictWise expressed its forecasts as probabilities, while FiveThirtyEight used some kind of unitless index that I found harder to understand. Last but not least, PredictWise also gets bonus points for forecasting all 24 of the categories presented on Sunday night, and against that larger list it went an impressive 19 for 24.

It’s also worth noting the two forecasters used different methods. Silver and co. based their index on lists of awards that were given out before the Oscars, treating those results like the pre-election polls they used to accurately forecast the last couple of U.S. general elections. Meanwhile, PredictWise used an algorithm to combine forecasts from a few different prediction markets, which themselves combine the judgments of thousands of traders. PredictWise’s use of prediction markets gave it the added advantage of making its forecasts dynamic; as the prediction markets moved in the weeks before the awards ceremony, its forecasts updated in real time. We don’t have enough data to say yet, but it may also be that prediction markets are better predictors than the other award results, and that’s why PredictWise did a smidgen better.

If I’m looking to handicap the Oscars next year and both of these guys are still in the game, I would probably convert Silver’s index to a probability scale and then average the forecasts from the two of them. That approach wouldn’t have improved on the four-of-six record they each managed this year, but the results would have been better calibrated than either one alone, and that bodes well for future iterations. Again and again, we’re seeing that model averaging just works, so whenever the opportunity presents itself, do it.

UPDATE: Later on Monday, Harry Enten did a broader version of this scan for the Guardian‘s Film Blog and reached a similar conclusion:

A more important point to take away is that there was at least one statistical predictor got it right in all six major categories. That suggests that a key fact about political forecasting holds for the Oscars: averaging of the averages works. You get a better idea looking at multiple models, even if they themselves include multiple factors, than just looking at one.

  • Author

  • Follow me on Twitter

  • Follow Dart-Throwing Chimp on WordPress.com
  • Enter your email address to follow this blog and receive notifications of new posts by email.

    Join 13,615 other followers

  • Archives

%d bloggers like this: