Galton’s Experiment Revisited

This is another cross-post from the blog of the Good Judgment Project.

One of my cousins, Steve Ulfelder, writes good mystery novels. He left a salaried writer’s job 13 years ago to freelance and make time to pen those books. In March, he posted this announcement on Facebook:

CONTEST! When I began freelancing, I decided to track the movies I saw to remind myself that this was a nice bennie you can’t have when you’re an employee (I like to see early-afternoon matinees in near-empty theaters). I don’t review them or anything; I simply keep a Word file with dates and titles.

Here’s the contest: How many movies have I watched in the theater since January 1, 2001? Type your answer as a comment. Entries close at 8pm tonight, east coast time. Closest guess gets a WOLVERINE BROS. T-shirt and a signed copy of the Conway Sax novel of your choice. The eminently trustworthy Martha Ruch Ulfelder is holding a slip of paper with the correct answer.

I read that post and thought: Now, that’s my bag. I haven’t seen Steve in a while and didn’t have a clear idea of how many movies he’s seen in the past 13 years, but I do know about Francis Galton and that ox at the county fair. Instead of just hazarding a guess of my own, I would give myself a serious shot at outwitting Steve’s Facebook crowd by averaging their guesses.

After a handful of Steve’s friends had submitted answers, I posted the average of them as a comment of my own, then updated it periodically as more guesses came in. I had to leave the house not long before the contest was set to close, so I couldn’t include the last few entrants in my final answer. Still, I had about 40 guesses in my tally at the time and was feeling pretty good about my chances of winning that t-shirt and book.

In the end, 45 entries got posted before Steve’s 8 PM deadline, and my unweighted average wasn’t even close. The histogram below shows the distribution of the crowd’s guesses and the actual answer. Most people guessed fewer than 300 movies, but a couple of extreme values on the high side pulled the average up to 346.  Meanwhile, the correct answer was 607, nearly one standard deviation (286) above that mean. I hadn’t necessarily expected to win, but I was surprised to see that 12 of the 45 guesses—including the winner at 600—landed closer to the truth than the average did.

steve.movie.data.hist

I read the results of my impromptu experiment as a reminder that crowds are often smart, but they aren’t magically clairvoyant. Retellings of Galton’s experiment sometimes make it seem like even pools of poorly informed guessers will automatically produce an accurate estimate, but, apparently, that’s not true.

As I thought about how I might have done better, I got to wondering if there was something about Galton’s crowd that made it particularly powerful for his task. Maybe we should expect a bunch of county fair–goers in nineteenth century England to be good at guessing the weight of farm animals. Still, the replication of Galton’s experiment under various conditions suggests that domain knowledge helps, but it isn’t essential. So maybe this was just an unusually hard problem. Steve has seen an average of nearly one movie in theaters each week for the past 13 years. In my experience, that’s pretty extreme, so even with the hint he dropped in his post about being a frequent moviegoer, it’s easy to see why the crowd would err on the low side. Or maybe this result was just a fluke, and if we could rerun the process with different or larger pools, the average would usually do much better.

Whatever the reason for this particular failure, though, the results of my experiment also got me thinking again about ways we might improve on the unweighted average as a method of gleaning intelligence from crowds. Unweighted averages are a reasonable strategy when we don’t have reliable information about variation in the quality of the individual guesses (see here), but that’s not always the case. For example, if Steve’s wife or kids had posted answers in this contest, it probably would have been wise to give their guesses more weight on the assumption that they knew better than acquaintances or distant relatives like me.

Figuring out smarter ways to aggregate forecasts is also an area of active experimentation for the Good Judgment Project (GJP), and the results so far are encouraging. The project’s core strategy involves discovering who the most accurate forecasters are and leaning more heavily on them. I couldn’t do this in Steve’s single-shot contest, but GJP gets to see forecasters’ track records on large numbers of questions and has been using them to great effect. In the recently-ended Season 3, GJP’s “super forecasters” were grouped into teams and encouraged to collaborate, and this approach has worked quite well. In a paper published this spring, GJP has also shown that they can do well with nonlinear aggregations derived from a simple statistical model that adjusts for systematic bias in forecasters’ judgments. Team GJP’s bias-correction model beats not only the unweighted average but also a number of widely-used and more complex nonlinear algorithms.

Those are just a couple of the possibilities that are already being explored, and I’m sure people will keep coming up with new and occasionally better ones. After all, there’s a lot of money to be made and bragging rights to be claimed in those margins. In the meantime, we can use Steve’s movie-counting contest to remind ourselves that crowds aren’t automatically as clairvoyant as we might hope, so we should keep thinking about ways to do better.

Asking the Right Questions

This is a cross-post from the Good Judgment Project’s blog.

I came to the Good Judgment Project (GJP) two years ago, in Season 2, as a forecaster, excited about contributing to an important research project and curious to learn more about my skill at prediction. I did pretty well at the latter, and GJP did very well at the former. I’m also a political scientist who happened to have more time on my hands than many of my colleagues, because I work as an independent consultant and didn’t have a full plate at that point. So, in Season 3, the project hired me to work as one of its lead question writers.

Going into that role, I had anticipated that one of the main challenges would be negotiating what Phil Tetlock calls the “rigor-relevance trade-off”—finding questions that are relevant to the project’s U.S. government sponsors and can be answered as unambiguously as possible. That forecast was correct, but even armed with that information, I failed to anticipate just how hard it often is to strike this balance.

The rigor-relevance trade-off exists because most of the big questions about global politics concern latent variables. Sometimes we care about specific political events because of their direct consequences, but more often we care about those events because of what they reveal to us about deeper forces shaping the world. For example, we can’t just ask if China will become more cooperative or more belligerent, because cooperation and belligerence are abstractions that we can’t directly observe. Instead, we have to find events or processes that (a) we can see and (b) that are diagnostic of that latent quality. For example, we can tell when China issues another statement reiterating its claim to the Senkaku Islands, but that happens a lot, so it doesn’t give us much new information about China’s posture. If China were to fire on Japanese aircraft or vessels in the vicinity of the islands—or, for that matter, to renounce its claim to them—now that would be interesting.

It’s tempting to forego some rigor to ask directly about the latent stuff, but it’s also problematic. For the forecast’s consumers, we need to be able to explain clearly what a forecast does and does not cover, so they can use the information appropriately. As forecasters, we need to understand what we’re being asked to anticipate so we can think clearly about the forces and pathways that might or might not produce the relevant outcome. And then there’s the matter of scoring the results. If we can’t agree on what eventually happened, we won’t agree on the accuracy of the predictions. Then the consumers don’t know how reliable those forecasts are, the producers don’t get the feedback they need, and everyone gets frustrated and demotivated.

It’s harder to formulate rigorous questions than many people realize until they try to do it, even on things that seem like they should be easy to spot. Take coups. It’s not surprising that the U.S. government might be keen on anticipating coups in various countries for various reasons. It is, however, surprisingly hard to define a “coup” in such a way that virtually everyone would agree on whether or not one had occurred.

In the past few years, Egypt has served up a couple of relevant examples. Was the departure of Hosni Mubarak in 2011 a coup? On that question, two prominent scholarly projects that use similar definitions to track coups and coup attempts couldn’t agree. Where one source saw an “overt attempt by the military or other elites within the state apparatus to unseat the sitting head of state using unconstitutional means,” the other saw the voluntary resignation of a chief executive due to a loss of his authority and a prompt return to civilian-led government. And what about the ouster of Mohammed Morsi in July 2013? On that, those academic sources could readily agree, but many Egyptians who applauded Morsi’s removal—and, notably, the U.S. government—could not.

We see something similar on Russian military intervention in Ukraine. Not long after Russia annexed Crimea, GJP posted a question asking whether or not Russian armed forces would invade the eastern Ukrainian cities of Kharkiv or Donetsk before 1 May 2014. The arrival of Russian forces in Ukrainian cities would obviously be relevant to U.S. policy audiences, and with Ukraine under such close international scrutiny, it seemed like that turn of events would be relatively easy to observe as well.

Unfortunately, that hasn’t been the case. As Mark Galeotti explained in a mid-April blog post,

When the so-called “little green men” deployed in Crimea, they were very obviously Russian forces, simply without their insignia. They wore Russian uniforms, followed Russian tactics and carried the latest, standard Russian weapons.

However, the situation in eastern Ukraine is much less clear. U.S. Secretary of State John Kerry has asserted that it was “clear that Russian special forces and agents have been the catalyst behind the chaos of the last 24 hours.” However, it is hard to find categorical evidence of this.

Even evidence that seemed incontrovertible when it emerged, like video of a self-proclaimed Russian lieutenant colonel in the Ukrainian city of Horlivka, has often been debunked.

This doesn’t mean we were wrong to ask about Russian intervention in eastern Ukraine. If anything, the intensity of the debate over whether or not that’s happened simply confirms how relevant this topic was. Instead, it implies that we chose the wrong markers for it. We correctly anticipated that further Russian intervention was possible if not probable, but we—like many others—failed to anticipate the unconventional forms that intervention would take.

Both of these examples show how hard it can be to formulate rigorous questions for forecasting tournaments, even on topics that are of keen interest to everyone involved and seem like naturals for the task. In an ideal world, we could focus exclusively on relevance and ask directly about all the deeper forces we want to understand and anticipate. As usual, though, that ideal world isn’t the one we inhabit. Instead, we struggle to find events and processes whose outcomes we can discern that will also reveal something telling about those deeper forces at play.

 

Forecasting Round-Up No. 6

The latest in a very occasional series.

1. The Boston Globe ran a story a few days ago about a company that’s developing algorithms to predict which patients in cardiac intensive care units are most likely to take a turn for the worse (here). The point of this exercise is to help doctors and nurses allocate their time and resources more efficiently and, ideally, to give them more lead time to try to stop those bad turns from happening.

The story suffers some rhetorical tics common to press reports on “predictive analytics.” For example, we never hear any specifics about the analytic techniques used or the predictive accuracy of the tool, and the descriptions of machine learning tilt toward the ingenuous (e.g., “The more data fed into the model, the more accurate the prediction becomes”). On the whole, though, I think this article does a nice job representing the promise and reality of this kind of work. The following passage especially resonated with me, because it describes a process for applying these predictions that sounds like the one I have in mind when building my own forecasting tools:

The unit’s medical director, Dr. Melvin Almodovar, uses [the prediction tool] to double-check his own clinical assessment of patients. Etiometry’s founders are careful to note that physicians will always be the ultimate bedside decision makers, using the Stability Index to confirm or inform their own diagnoses.

Butler said that an information-overload environment like the intensive care unit is ideal for a data-driven risk assessment tool, because the patients teeter between life and death. A predictive model can act as an early warning system, pointing out risky changes in multiple vital signs in a more sophisticated way than bedside alarms.

When our predictive models aren’t as accurate as we’d like or don’t yet have a clear track record, this hybrid approach—decisions are informed by the forecasts but not determined by them—is a prudent way to go. In the cardiac intensive care unit, doctors are already applying their own mental models to these data, so the idea of developing explicit algorithms to do the same isn’t a stretch (or shouldn’t be, but…). Unlike those doctors, though, statistical models won’t suffer from low blood sugar or distraction or become emotionally attached to some patients but not others. Also unlike the mental models doctors use now, statistical models will produce explicit forecasts that can be collected and assessed over time. The resulting feedback will give the stats guys many opportunities to improve their models, and the hospital staff a chance to get a feel for the models’ strengths and limitations. When you’re making such weighty decisions, why wouldn’t you want that additional information?

2. Lyle Ungar recently discussed forecasting with the Machine Intelligence Research Institute (here). The whole thing deserves a read, but I especially liked this framework for thinking about when different methods work best:

I think one can roughly characterize forecasting problems into categories—each requiring different forecasting methods—based, in part, on how much historical data is available.

Some problems, like the geo-political forecasting [the Good Judgment Project is] doing, require lots collection of information and human thought. Prediction markets and team-based forecasts both work well for sifting through the conflicting information about international events. Computer models mostly don’t work as well here—there isn’t a long enough track records of, say, elections or coups in Mali to fit a good statistical model, and it isn’t obvious what other countries are ‘similar.’

Other problems, like predicting energy usage in a given city on a given day, are well suited to statistical models (including neural nets). We know the factors that matter (day of the week, holiday or not, weather, and overall trends), and we have thousands of days of historical observation. Human intuition is not as going to beat computers on that problem.

Yet other classes of problems, like economic forecasting (what will the GDP of Germany be next year? What will unemployment in California be in two years) are somewhere in the middle. One can build big econometric models, but there is still human judgement about the factors that go into them. (What if Merkel changes her mind or Greece suddenly adopts austerity measures?) We don’t have enough historical data to accurately predict economic decisions of politicians.

The bottom line is that if you have lots of data and the world isn’t changing to much, you can use statistical methods. For questions with more uncertain, human experts become more important.

I might disagree on the particular problem of forecasting coups in Mali, but I think the basic framework that Lyle proposes is right.

3. Speaking of the Good Judgment Project (GJP), a bevy of its researchers, including Ungar, have an article in the March 2014 issue of Psychological Science (here) that shows how certain behavioral interventions can significantly boost the accuracy of forecasts derived from subjective judgments. Here’s the abstract:

Five university-based research groups competed to recruit forecasters, elicit their predictions, and aggregate those predictions to assign the most accurate probabilities to events in a 2-year geopolitical forecasting tournament. Our group tested and found support for three psychological drivers of accuracy: training, teaming, and tracking. Probability training corrected cognitive biases, encouraged forecasters to use reference classes, and provided forecasters with heuristics, such as averaging when multiple estimates were available. Teaming allowed forecasters to share information and discuss the rationales behind their beliefs. Tracking placed the highest performers (top 2% from Year 1) in elite teams that worked together. Results showed that probability training, team collaboration, and tracking improved both calibration and resolution. Forecasting is often viewed as a statistical problem, but forecasts can be improved with behavioral interventions. Training, teaming, and tracking are psychological interventions that dramatically increased the accuracy of forecasts. Statistical algorithms (reported elsewhere) improved the accuracy of the aggregation. Putting both statistics and psychology to work produced the best forecasts 2 years in a row.

The atrocities early-warning project on which I’m working is learning from GJP in real time, and we hope to implement some of these lessons in the opinion pool we’re running (see this conference paper for details).

Speaking of which: If you know something about conflict or atrocities risk or a particular part of the world and are interested in volunteering as a forecaster, please send an email to ewp@ushmm.org.

4. Finally, Daniel Little writes about the partial predictability of social upheaval on his terrific blog, Understanding Society (here). The whole post deserves reading, but here’s the nub (emphasis in the original):

Take unexpected moments of popular uprising—for example, the Arab Spring uprisings or the 2013 riots in Stockholm. Are these best understood as random events, the predictable result of long-running processes, or something else? My preferred answer is something else—in particular, conjunctural intersections of independent streams of causal processes (link). So riots in London or Stockholm are neither fully predictable nor chaotic and random.

This matches my sense of the problem and helps explain why predictive models of these events will never be as accurate as we might like but are still useful, as are properly elicited and combined forecasts from people using their noggins.

Will Unarmed Civilians Soon Get Massacred in Ukraine?

According to one pool of forecasters, most probably not.

As part of a public atrocities early-warning system I am currently helping to build for the U.S. Holocaust Memorial Museum’s Center for the Prevention of Genocide (see here), we are running a kind of always-on forecasting survey called an opinion pool. An opinion pool is similar in spirit to a prediction market, but instead of having participants trade shares tied the occurrence of some future event, we simply ask participants to estimate the probability of each event’s occurrence. In contrast to a traditional survey, every question remains open until the event occurs or the forecasting window closes. This way, participants can update their forecasts as often as they like, as they see or hear relevant information or just change their minds.

With generous support from Inkling, we started up our opinion pool in October, aiming to test and refine it before our larger early-warning system makes its public debut this spring (we hope). So far, we have only recruited opportunistically among colleagues and professional acquaintances, but we already have more than 70 registered participants. In the first four months of operation, we have used the system to ask more than two dozen questions, two of which have since closed because the relevant events occurred (mass killing in CAR and the Geneva II talks on Syria).

Over the next few years, we aim to recruit a large and diverse pool of volunteer forecasters from around the world with some claim to topical expertise or relevant local knowledge. The larger and more diverse our pool, the more accurate we expect our forecasts to be, and the wider the array of questions we can ask. (If you are interested in participating, please drop me a line at ulfelder <at> gmail <dot> com.)

A few days ago and prompted by a couple of our more active members, I posted a question to our pool asking, “Before 1 March 2014, will any massacres occur in Ukraine?” As of this morning, our pool had made a total of 13 forecasts, and the unweighted average of the latest of those estimates from each participating forecaster was just 15 percent. Under the criteria we specified (see Background Information below), this forecast does not address the risk of large-scale violence against or among armed civilians, nor does it exclude the possibility of a series of small but violent encounters that cumulatively produce a comparable or larger death toll. Still, for those of us concerned that security forces or militias will soon kill nonviolent protesters in Ukraine on a large scale, our initial forecast implies that those fears are probably unwarranted.

Crowd-Estimated Probability of Any Massacres in Ukraine Before 1 March 2014

Crowd-Estimated Probability of Any Massacres in Ukraine Before 1 March 2014

Obviously, we don’t have a crystal ball, and this is just an aggregation of subjective estimates from a small pool of people, none of whom (I think) is on the scene in Ukraine or has inside knowledge of the decision-making of relevant groups. Still, a growing body of evidence shows that aggregations of subjective forecasts like this one can often be usefully accurate (see here), even with a small number of contributing forecasters (see here). On this particular question, I very much hope our crowd is right. Whatever happens in Ukraine over the next few weeks, though, principle and evidence suggest that the method is sound, and we soon expect to be using this system to help assess risks of mass atrocities all over the world in real time.

Background Information

We define a “massacre” as an event that has the following features:

  • At least 10 noncombatant civilians are killed in one location (e.g., neighborhood, town, or village) in less than 48 hours. A noncombatant civilian is any person who is not a current member of a formal or irregular military organization and who does not apparently pose an immediate threat to the life, physical safety, or property of other people.
  • The victims appear to have been the primary target of the violence that killed them.
  • The victims do not appear to have been engaged in violent action or criminal activity when they were killed, unless that violent action was apparently in self-defense.
  • The relevant killings were carried out by individuals affiliated with a social group or organization engaged in a wider political conflict and appear to be connected to each other and to that wider conflict.

Those features will not always be self-evident or uncontroversial, so we use the following series of ad hoc rules to make more consistent judgments about ambiguous events.

  • Police, soldiers, prison guards, and other agents of state security are never considered noncombatant civilians, even if they are killed while off duty or out of uniform.
  • State officials and bureaucrats are not considered civilians when they are apparently targeted because of their professional status (e.g., assassinated).
  • Civilian deaths that occur in the context of operations by uniformed military-service members against enemy combatants are considered collateral damage, not atrocities, and should be excluded unless there is strong evidence that the civilians were targeted deliberately. We will err on the side of assuming that they were not.
  • Deaths from state repression of civilians engaged in nonviolent forms of protest are considered atrocities. Deaths resulting from state repression targeting civilians who were clearly engaged in rioting, looting, attacks on property, or other forms of collective aggression or violence are not.
  • Non-state militant or paramilitary groups, such as militias, gangs, vigilante groups, or raiding parties, are considered combatants, not civilians.

We will use contextual knowledge to determine whether or not a discrete event is linked to a wider conflict or campaign of violence, and we will err on the side of assuming that it is.

Determinations of whether or not a massacre has occurred will be made by the administrator of this system using publicly available secondary sources. Relevant evidence will be summarized in a blog post published when the determination is announced, and any dissenting views will be discussed as well.

Disclosure

I have argued on this blog that scholars have an obligation to disclose potential conflicts of interest when discussing their research, so let me do that again here: For the past two years, I have been paid as a contractor by the U.S. Holocaust Memorial Museum for my work on the atrocities early-warning system discussed in this post. Since the spring of 2013, I have also been paid to write questions for the Good Judgment Project, in which I participated as a forecaster the year before. To the best of my knowledge, I have no financial interests in, and have never received any payments from, any companies that commercially operate prediction markets or opinion pools.

Lost in the Fog of Civil War in Syria

On Twitter a couple of days ago, Adam Elkus called out a recent post on Time magazine’s World blog as evidence of the way that many peoples’ expectations about the course of Syria’s civil war have zigged and zagged over the past couple of years. “Last year press was convinced Assad was going to fall,” Adam tweeted. “Now it’s that he’s going to win. Neither perspective useful.” To which the eminent civil-war scholar Stathis Kalyvas replied simply, “Agreed.”

There’s a lesson here for anyone trying to glean hints about the course of a civil war from press accounts of a war’s twists and turns. In this case, it’s a lesson I’m learning through negative feedback.

Since early 2012, I’ve been a participant/subject in the Good Judgment Project (GJP), a U.S. government-funded experiment in “wisdom of crowds” forecasting. Over the past year, GJP participants have been asked to estimate the probability of several events related to the conflict in Syria, including the likelihood that Bashar al-Assad would leave office and the likelihood that opposition forces would seize control of the city of Aleppo.

I wouldn’t describe myself as an expert on civil wars, but during my decade of work for the Political Instability Task Force, I spent a lot of time looking at data on the onset, duration, and end of civil wars around the world. From that work, I have a pretty good sense of the typical dynamics of these conflicts. Most of the civil wars that have occurred in the past half-century have lasted for many years. A very small fraction of those wars flared up and then ended within a year. The ones that didn’t end quickly—in other words, the vast majority of these conflicts—almost always dragged on for several more years at least, sometimes even for decades. (I don’t have my own version handy, but see Figure 1 in this paper by Paul Collier and Anke Hoeffler for a graphical representation of this pattern.)

On the whole, I’ve done well in the Good Judgment Project. In the year-long season that ended last month, I ranked fifth among the 303 forecasters in my experimental group, all while the project was producing fairly accurate forecasts on many topics. One thing that’s helped me do well is my adherence to what you might call the forecaster’s version of the Golden Rule: “Don’t neglect the base rate.” And, as I just noted, I’m also quite familiar with the base rates of civil-war duration.

So what did I do when asked by GJP to think about what would happen in Syria? I chucked all that background knowledge out the window and chased the very narrative that Elkus and Kalyvas rightly decry as misleading.

Here’s a chart showing how I assessed the probability that Assad wouldn’t last as president beyond the end of March 2013, starting in June 2012. The actual question asked us to divide the probability of his exiting office across several time periods, but for simplicity’s sake I’ve focused here on the part indicating that he would stick around past April 1. This isn’t the same thing as the probability that the war would end, of course, but it’s closely related, and I considered the two events as tightly linked. As you can see, until early 2013, I was pretty confident that Assad’s fall was imminent. In fact, I was so confident that at a couple of points in 2012, I gave him zero chance of hanging on past March of this year—something a trained forecaster really never should do.

gjp assad chart

Now here’s another chart showing my estimates of the likelihood that rebels would seize control of Aleppo before May 1, 2013. The numbers are a little different, but the basic pattern is the same. I started out very confident that the rebels would win the war soon and only swung hard in the opposite direction in early 2013, as the boundaries of the conflict seemed to harden.

gjp aleppo chart

It’s impossible to say what the true probabilities were in this or any other uncertain situation. Maybe Assad and Aleppo really were on the brink of falling for a while and then the unlikely-but-still-possible version happened anyway.

That said, there’s no question that forecasts more tightly tied to the base rate would have scored a lot better in this case. Here’s a chart showing what my estimates might have looked like had I followed that rule, using approximations of the hazard rate from the chart in the Collier and Hoeffler paper. If anything, these numbers overstate the likelihood that a civil war will end at a given point in time.

gjp baserate chart

I didn’t keep a log spelling out my reasoning at each step, but I’m pretty confident that my poor performance here is an example of motivated reasoning. I wanted Assad to fall and the pro-democracy protesters who dominated the early stages of the uprising to win, and that desire shaped what I read and then remembered when it came time to forecast. I suspect that many of the pieces I was reading were slanted by similar hopes, creating a sort of analytic cascade similar to the herd behavior thought to drive many financial-market booms and busts. I don’t have the data to prove it, but I’m pretty sure the ups and downs in my forecasts track the evolving narrative in the many newspaper and magazine stories I was reading about the Syrian conflict.

Of course, that kind of herding happens on a lot of topics, and I was usually good at avoiding it. For example, when tensions ratcheted up on the Korean Peninsula earlier this year, I hewed to the base rate and didn’t substantially change my assessment of the risk that real clashes would follow.

What got me in the case of Syria was, I think, a sense of guilt. The Assad government has responded to a legitimate popular challenge with mass atrocities that we routinely read about and sometimes even see. In parts of the country, the resulting conflict is producing scenes of absurd brutality. This isn’t a “problem from hell,” as Samantha Powers’ book title would have it; it is a glimpse of hell. And yet, in the face of that horror, I have publicly advocated against American military intervention. Upon reflection, I wonder if my wildly optimistic forecasting about the imminence of Assad’s fall wasn’t my unconscious attempt to escape the discomfort of feeling complicit in the prolongation of that suffering.

As a forecaster, if I were doing these questions over, I would try to discipline myself to attend to the base rate, but I wouldn’t necessarily stop there. As I’ve pointed out in a previous post, the base rate is a valuable anchoring device, but attending to it doesn’t mean automatically ignoring everything else. My preferred approach, when I remember to have one, is to take that base rate as a starting point and then use Bayes’ theorem to update my forecasts in a more disciplined way. Still, I’ll bring a newly skeptical eye the flurry of stories predicting that Assad’s forces will soon defeat Syria’s rebels and keep their patron in power. Now that we’re a couple years into the conflict, quantified history tells us that the most likely outcome in any modest slice of time (say, months rather than years) is, tragically, more of the same.

And, as a human, I’ll keep hoping the world will surprise us and take a different turn.

A Chimp’s-Eye View of a Forecasting Experiment

For the past several months, I’ve been participating in a forecasting tournament as one of hundreds of “experts” making and updating predictions about dozens of topics in international political and economic affairs. This tournament is funded by IARPA, a research arm of the U.S. government’s Office of the Director of National Intelligence, and it’s really a grand experiment designed to find better ways to elicit, combine, and present probabilistic forecasts from groups of knowledgeable people.

There are several teams participating in this tournament; I happen to be part of the Good Judgment team that is headed by psychologist Phil Tetlock. Good Judgment “won” the first year of the competition, but that win came before I started participating, so alas, I can’t claim even a tiny sliver of the credit for that.

I’ve been prognosticating as part of my job for more than a decade, but almost all of the forecasting I’ve done in the past was based on statistical models designed to assess risks of a specific rare event (say, a coup attempt, or the onset of a civil war) across all countries worldwide. The Good Judgment Project is my first experience with routinely making calls about the likelihood of many different events based almost entirely on my subjective beliefs. Now that I’m a few months into this exercise, I thought I’d write something about how I’ve approached the task, because I think my experiences speak to generic difficulties in forecasting rare political and economic events.

By way of background, here’s how the forecasting process works for the Good Judgment team: I start by logging into a web site that lists a bunch of questions on an odd mix of topics, everything from the Euro-to-dollar exchange rate to the outcome of the recent election in Venezuela. I click on a question and am expected to assign a numeric probability to each of the two or more possible outcomes listed (e.g., “yes or no,” or “A, B, or C”). Those outcomes are always exhaustive, so the probabilities I assign must always sum to 1. Whenever I feel like it, I can log back in and update any of the forecasts I’ve already made. Then, when the event of interest happens (e.g., results of the Venezuelan election are announced), the question is closed, and the accuracy of my forecast for that question and for all questions closed to date is summarized with a statistic called the Brier score. I can usually see my score pretty soon after the question closes, and I can also see average scores on each question for my whole team and the cumulative scores for the top 10 percent or so of the team’s leader board.

So, how do you perform this task? In an ideal world, my forecasting process would go as follows:

1. Read the question carefully. What does “attack,” “take control of,” or “lose office” mean here, and how will it be determined? To make the best possible forecast, I need to understand exactly what it is I’m being asked to predict.

2. Forecast. In Unicorn World, I would have an ensemble of well-designed statistical models to throw at each and every question. In the real world, I’m lucky if there’s a single statistical model that applies even indirectly. Absent a sound statistical forecast, I would try to identify the class of events to which the question belongs and then determine a base rate for that class.  The “base rate” is just the historical frequency of the event’s occurrence in comparable cases—say, how often civil wars end in their first year, or how often incumbents lose elections in Africa. Where feasible, I would also check prediction markets, look for relevant polling data, seek out published predictions, and ask knowledgeable people.

In all cases, the idea is to find empirical evidence to which I can anchor my forecast, and to get a sense of how much uncertainty there is about this particular case. When there’s a reliable forecast from a statistical model or even good base-rate information, I would weight that most heavily and would only adjust far away from that prediction in cases where subject-matter experts make a compelling case about why this instance will be different. (As for what makes a case compelling, well…) If I can’t find any useful information or the information I find is all over the map, I would admit that I have no clue and do the probabilistic equivalent of a coin flip or die roll, splitting the predicted probability evenly across all of the possible outcomes.

3. Update. As Nate Silver argues in The Signal and The Noise, forecasters should adjust their prediction whenever we are presented with relevant new information. There’s no shame in reconsidering your views as new information becomes available; that’s called learning. Ideally, I would be disciplined about how I update my forecasts, using Bayes’ rule to check the human habit of leaning too hard on the freshest tidbit and take full advantage of all the information I had before.

4. Learn. Over time, I would see areas where I was doing well or poorly and would use those outcomes to deepen confidence in, or to try to improve, my mental models. For the questions I get wrong, what did I overlook? What were the forecasters who did better than I thinking about that I wasn’t? For the questions I answer well, was I right for the right reasons, or was it possibly just dumb luck? The more this process gets repeated, the better I should be able to do, within the limits determined by the basic predictability of the phenomenon in question.

To my mind, that is how I should be making forecasts. Now here are few things I’ve noticed so far about what I’m actually doing.

1. I’m lazy. Most of the time, I see the question, make an assumption about what it means without checking the resolution criteria, and immediately formulate a gut response on simple four-point scale: very likely, likely, unlikely, or very unlikely. I’d like to say that “I have no clue” is also on the menu of immediate responses, but my brain almost always makes an initial guess, even if it’s a topic about which I know nothing—for example, the resolution of a particular dispute before the World Trade Organization. What’s worse, it’s hard to dislodge that gut response once I’ve had it, even when I think I have better anchoring information.

2. I’m motivated by the competition, but that motivation doesn’t necessarily make me more attentive. As I said earlier, in this tournament, we can see our own scores and the scores of all the top performers as we go. With such a clear yardstick, it’s hard not to get pulled into seeing other forecasters as competitors and trying to beat them. You’d think that urge would motivate me to be more attentive to the task, more rigorously following the idealized process I described above. Most of the time, though, it just means that I calibrate my answers to the oddities of the yardstick—the Brier score involves squaring your error term, so the cost of being wrong isn’t distributed evenly across the range of possible values—and that I check the updated scores soon after questions close.

3. It’s hard to distinguish likelihood from timing. Some of the questions can get pretty specific about the timing of the event of interest. For example, we might be asked something like: Will Syrian president Bashar al-Assad fall before the end of October 2012; before the end of December 2012; before April 2013; or some time thereafter? I find these questions excruciatingly hard to answer, and it took me a little while to figure out why.

After thinking through how I had approached a few examples, I realized that my brain was conflating probability with proximity. In other words, the more likely I thought the event was, the sooner I expected it to occur.  That makes sense for some situations, but it doesn’t always, and a careful consideration of timing will usually require lots of additional information. For example, I might look at structural conditions in Syria and conclude that Assad can’t win the civil war he’s started, but how long it’s going to take for him to lose will depend on a host of other things with complex dynamics, like the price of oil, the flow of weapons, and the logistics of military campaigns. Interestingly, even though I’m now aware of this habit, I’m still finding it hard to break.

4. I’m eager to learn from feedback, but feedback is hard to come by. This project isn’t like weather forecasting or equities trading, where you make a prediction, see how you did, tweak your model, and try again, over and over. Most of the time, the questions are pretty idiosyncratic, so you’ll have just one or a few chances to make a certain prediction. What’s more, the answers are usually categorical, so even when you do get more than one shot, it’s hard to tell how wrong or right you were. In this kind of environment, it’s really hard to build on your experience. In the several months I’ve been participating, I think I’ve learned far more about the peculiarities of the Brier score than I have about the generative process underlying any of the phenomena about which I’ve been asked to predict.

And that, for whatever it’s worth, is one dart-throwing chimp’s initial impressions of this particular experiment. As it happens, I’ve been doing pretty well so far—I’ve been at or near the top of the team’s accuracy rankings since scores for the current phase started to come in—but I still feel basically clueless most of the time.  It’s like a golf tournament where good luck tosses some journeyman to the top of the leader board after the first or second round, and I keep waiting for the inevitable regression toward the mean to kick in. I’d like to think I can pull off a Tin Cup, but I know enough about statistics to expect that I almost certainly won’t.

Follow

Get every new post delivered to your Inbox.

Join 6,415 other followers

%d bloggers like this: