The Wisdom of Crowds, Oscars Edition

Forecasts derived from prediction markets did an excellent job predicting last night’s Academy Awards.

PredictWise uses odds from online bookmaker Betfair for its Oscars forecasts, and it nearly ran the table. PredictWise assigned the highest probability to the eventual winner in 21 of 24 awards, and its three “misses” came in less prominent categories (Best Documentary, Best Short, Best Animated Short). Even more telling, its calibration was excellent. The probability assigned to the eventual winner in each category averaged 87 percent, and most winners were correctly identified as nearly sure things.

Inkling Markets also did quite well. This public, play-money prediction market has a lot less liquidity than BetFair, but it still assigned the highest probability of winning to the eventual winner in 17 of the 18 categories is covered—it “missed” on Best Original Song—and for extra credit it correctly identified Gravity as the film that would probably win the most Oscars. Just by eyeballing, it’s clear that Inkling’s calibration wasn’t as good as PredictWise’s, but that’s what we’d expect from a market with a much smaller pool of participants. In any case, you still probably would have one your Oscars pool if you’d relied on it.

This is the umpteen-gajillionth reminder that crowds are powerful forecasters. “When our imperfect judgments are aggregated in the right way,” James Surowiecki wrote (p. xiv), “our collective intelligence is often excellent.” Or, as PredictWise’s David Rothschild said in his live blog last night,

This is another case of pundits and insiders advertising a close event when the proper aggregation of data said it was not. As I noted on Twitter earlier, my acceptance speech is short. I would like to thank prediction markets for efficiently aggregating dispersed and idiosyncratic data.

Will Unarmed Civilians Soon Get Massacred in Ukraine?

According to one pool of forecasters, most probably not.

As part of a public atrocities early-warning system I am currently helping to build for the U.S. Holocaust Memorial Museum’s Center for the Prevention of Genocide (see here), we are running a kind of always-on forecasting survey called an opinion pool. An opinion pool is similar in spirit to a prediction market, but instead of having participants trade shares tied the occurrence of some future event, we simply ask participants to estimate the probability of each event’s occurrence. In contrast to a traditional survey, every question remains open until the event occurs or the forecasting window closes. This way, participants can update their forecasts as often as they like, as they see or hear relevant information or just change their minds.

With generous support from Inkling, we started up our opinion pool in October, aiming to test and refine it before our larger early-warning system makes its public debut this spring (we hope). So far, we have only recruited opportunistically among colleagues and professional acquaintances, but we already have more than 70 registered participants. In the first four months of operation, we have used the system to ask more than two dozen questions, two of which have since closed because the relevant events occurred (mass killing in CAR and the Geneva II talks on Syria).

Over the next few years, we aim to recruit a large and diverse pool of volunteer forecasters from around the world with some claim to topical expertise or relevant local knowledge. The larger and more diverse our pool, the more accurate we expect our forecasts to be, and the wider the array of questions we can ask. (If you are interested in participating, please drop me a line at ulfelder <at> gmail <dot> com.)

A few days ago and prompted by a couple of our more active members, I posted a question to our pool asking, “Before 1 March 2014, will any massacres occur in Ukraine?” As of this morning, our pool had made a total of 13 forecasts, and the unweighted average of the latest of those estimates from each participating forecaster was just 15 percent. Under the criteria we specified (see Background Information below), this forecast does not address the risk of large-scale violence against or among armed civilians, nor does it exclude the possibility of a series of small but violent encounters that cumulatively produce a comparable or larger death toll. Still, for those of us concerned that security forces or militias will soon kill nonviolent protesters in Ukraine on a large scale, our initial forecast implies that those fears are probably unwarranted.

Crowd-Estimated Probability of Any Massacres in Ukraine Before 1 March 2014

Crowd-Estimated Probability of Any Massacres in Ukraine Before 1 March 2014

Obviously, we don’t have a crystal ball, and this is just an aggregation of subjective estimates from a small pool of people, none of whom (I think) is on the scene in Ukraine or has inside knowledge of the decision-making of relevant groups. Still, a growing body of evidence shows that aggregations of subjective forecasts like this one can often be usefully accurate (see here), even with a small number of contributing forecasters (see here). On this particular question, I very much hope our crowd is right. Whatever happens in Ukraine over the next few weeks, though, principle and evidence suggest that the method is sound, and we soon expect to be using this system to help assess risks of mass atrocities all over the world in real time.

Background Information

We define a “massacre” as an event that has the following features:

  • At least 10 noncombatant civilians are killed in one location (e.g., neighborhood, town, or village) in less than 48 hours. A noncombatant civilian is any person who is not a current member of a formal or irregular military organization and who does not apparently pose an immediate threat to the life, physical safety, or property of other people.
  • The victims appear to have been the primary target of the violence that killed them.
  • The victims do not appear to have been engaged in violent action or criminal activity when they were killed, unless that violent action was apparently in self-defense.
  • The relevant killings were carried out by individuals affiliated with a social group or organization engaged in a wider political conflict and appear to be connected to each other and to that wider conflict.

Those features will not always be self-evident or uncontroversial, so we use the following series of ad hoc rules to make more consistent judgments about ambiguous events.

  • Police, soldiers, prison guards, and other agents of state security are never considered noncombatant civilians, even if they are killed while off duty or out of uniform.
  • State officials and bureaucrats are not considered civilians when they are apparently targeted because of their professional status (e.g., assassinated).
  • Civilian deaths that occur in the context of operations by uniformed military-service members against enemy combatants are considered collateral damage, not atrocities, and should be excluded unless there is strong evidence that the civilians were targeted deliberately. We will err on the side of assuming that they were not.
  • Deaths from state repression of civilians engaged in nonviolent forms of protest are considered atrocities. Deaths resulting from state repression targeting civilians who were clearly engaged in rioting, looting, attacks on property, or other forms of collective aggression or violence are not.
  • Non-state militant or paramilitary groups, such as militias, gangs, vigilante groups, or raiding parties, are considered combatants, not civilians.

We will use contextual knowledge to determine whether or not a discrete event is linked to a wider conflict or campaign of violence, and we will err on the side of assuming that it is.

Determinations of whether or not a massacre has occurred will be made by the administrator of this system using publicly available secondary sources. Relevant evidence will be summarized in a blog post published when the determination is announced, and any dissenting views will be discussed as well.

Disclosure

I have argued on this blog that scholars have an obligation to disclose potential conflicts of interest when discussing their research, so let me do that again here: For the past two years, I have been paid as a contractor by the U.S. Holocaust Memorial Museum for my work on the atrocities early-warning system discussed in this post. Since the spring of 2013, I have also been paid to write questions for the Good Judgment Project, in which I participated as a forecaster the year before. To the best of my knowledge, I have no financial interests in, and have never received any payments from, any companies that commercially operate prediction markets or opinion pools.

A Quick Post Mortem on Oscars Forecasting

I was intrigued to see that statistical forecasts of the Academy Awards from PredictWise and FiveThirtyEight performed pretty well this year. Neither nailed it, but they both used sound processes to generate probabilistic estimates that turned out to be fairly accurate.

In the six categories both sites covered, PredictWise assigned very high probabilities to the eventual winner in four: Picture, Actor, Actress, and Supporting Actress. PredictWise didn’t miss by much in one more—Supporting Actor, where winner Christoph Waltz ran a close second to Tommy Lee Jones (40% to 44%). Its biggest miss came in the Best Director category, where PredictWise’s final forecast favored Steven Spielberg (76%) over winner Ang Lee (22%).

At FiveThirtyEight, Nate Silver and co. also gave the best odds to the same four of six eventual winners, but they were a little less confident than PredictWise about a couple of them. FiveThirtyEight also had a bigger miss in the Best Supporting Actor category, putting winner Christoph Waltz neck and neck with Philip Seymour Hoffman and both of them a ways behind Jones. FiveThirtyEight landed closer to the mark than PredictWise in the Best Director category, however, putting Lee just a hair’s breadth behind Spielberg (0.56 to 0.58 on its index).

If this were a showdown, I’d give the edge to PredictWise for three reasons. One, my eyeballing of the results tells me that PredictWise’s forecasts were slightly better calibrated. Both put four of the six winners in front and didn’t miss by much on one more, but PredictWise was more confident in the four they both got “right.” Second, PredictWise expressed its forecasts as probabilities, while FiveThirtyEight used some kind of unitless index that I found harder to understand. Last but not least, PredictWise also gets bonus points for forecasting all 24 of the categories presented on Sunday night, and against that larger list it went an impressive 19 for 24.

It’s also worth noting the two forecasters used different methods. Silver and co. based their index on lists of awards that were given out before the Oscars, treating those results like the pre-election polls they used to accurately forecast the last couple of U.S. general elections. Meanwhile, PredictWise used an algorithm to combine forecasts from a few different prediction markets, which themselves combine the judgments of thousands of traders. PredictWise’s use of prediction markets gave it the added advantage of making its forecasts dynamic; as the prediction markets moved in the weeks before the awards ceremony, its forecasts updated in real time. We don’t have enough data to say yet, but it may also be that prediction markets are better predictors than the other award results, and that’s why PredictWise did a smidgen better.

If I’m looking to handicap the Oscars next year and both of these guys are still in the game, I would probably convert Silver’s index to a probability scale and then average the forecasts from the two of them. That approach wouldn’t have improved on the four-of-six record they each managed this year, but the results would have been better calibrated than either one alone, and that bodes well for future iterations. Again and again, we’re seeing that model averaging just works, so whenever the opportunity presents itself, do it.

UPDATE: Later on Monday, Harry Enten did a broader version of this scan for the Guardian‘s Film Blog and reached a similar conclusion:

A more important point to take away is that there was at least one statistical predictor got it right in all six major categories. That suggests that a key fact about political forecasting holds for the Oscars: averaging of the averages works. You get a better idea looking at multiple models, even if they themselves include multiple factors, than just looking at one.

It’s Not Just The Math

This week, statistics-driven political forecasting won a big slab of public vindication after the U.S. election predictions of an array of number-crunching analysts turned out to be remarkably accurate. As John Sides said over at the Monkey Cage, “2012 was the Moneyball election.” The accuracy of these forecasts, some of them made many months before Election Day,

…shows us that we can use systematic data—economic data, polling data—to separate momentum from no-mentum, to dispense with the gaseous emanations of pundits’ “guts,” and ultimately to forecast the winner.  The means and methods of political science, social science, and statistics, including polls, are not perfect, and Nate Silver is not our “algorithmic overlord” (a point I don’t think he would disagree with). But 2012 has showed how useful and necessary these tools are for understanding how politics and elections work.

Now I’ve got a short piece up at Foreign Policy explaining why I think statistical forecasts of world politics aren’t at the same level and probably won’t be very soon. I hope you’ll read the whole thing over there, but the short version is: it’s the data. If U.S. electoral politics is a data hothouse, most of international politics is a data desert. Statistical models make very powerful forecasting tools, but they can’t run on thin air, and the density and quality of the data available for political forecasting drops off precipitously as you move away from U.S. elections.

Seriously: you don’t have to travel far in the data landscape to start running into trouble. In a piece posted yesterday, Stephen Tall asks rhetorically why there isn’t a British Nate Silver and then explains that it’s because “we [in the U.K.] don’t have the necessary quality of polls.” And that’s the U.K., for crying out loud. Now imagine how things look in, say, Ghana or Sierra Leone, both of which are holding their own national elections this month.

Of course, difficult does not mean impossible. I’m a bit worried, actually, that some readers of that Foreign Policy piece will hear me saying that most political forecasting is still stuck in the Dark Ages, when that’s really not what I meant. I think we actually do pretty well with statistical forecasting on many interesting problems in spite of the dearth of data, as evidenced by the predictive efforts of colleagues like Mike Ward and Phil Schrodt and some of the work I’ve posted here on things like coups and popular uprisings.

I’m also optimistic that the global spread of digital connectivity and associated developments in information-processing hardware and software are going to help fill some of those data gaps in ways that will substantially improve our ability to forecast many political events. I haven’t seen any big successes along those lines yet, but the changes in the enabling technologies are pretty radical, so it’s plausible that the gains in data quality and forecasting power will happen in big leaps, too.

Meanwhile, while we wait for those leaps to happen, there are some alternatives to statistical models that can help fill some of the gaps. Based partly on my own experiences and partly on my read of relevant evidence (see here, here, and here for a few tidbits), I’m now convinced that prediction markets and other carefully designed systems for aggregating judgments can produce solid forecasts. These tools are most useful in situations where the outcome isn’t highly predictable but relevant information is available to those who dig for it. They’re somewhat less useful for forecasting the outcomes of decision processes that are idiosyncratic and opaque, like North Korean government or even the U.S. Supreme Court. There’s no reason to let the perfect be the enemy of the good, but we should use these tools with full awareness of their limitations as well as their strengths.

More generally, though, I remain convinced that, when trying to forecast political events around the world, there’s a complexity problem we will never overcome no matter how many terabytes of data we produce and consume, how fast our processors run, and how sophisticated our methods become. Many of the events that observers of international politics care about are what Nassim Nicholas Taleb calls “gray swans”—“rare and consequential, but somewhat predictable, particularly to those who are prepared for them and have the tools to understand them.”

These events are hard to foresee because they bubble up from a complex adaptive system that’s constantly evolving underfoot. The patterns we think we discern in one time and place can’t always be generalized to others, and the farther into the future we try to peer, the thinner those strands get stretched. Events like these “are somewhat tractable scientifically,” as Taleb puts it, but we should never expect to predict their arrival the way we can foresee the outcomes of more orderly processes like U.S. elections.

Follow

Get every new post delivered to your Inbox.

Join 8,205 other followers

%d bloggers like this: