How Circumspect Should Quantitative Forecasters Be?

Yesterday, I participated in a panel discussion on the use of technology to prevent and document mass atrocities as part of an event at American University’s Washington College of Law to commemorate the Rwandan genocide.* In my prepared remarks, I talked about the atrocities early-warning system I’m helping build for the U.S. Holocaust Memorial Museum’s Center for the Prevention of Genocide. The chief outputs of that system are probabilistic forecasts, some from statistical models and others from a “wisdom of (expert) crowds” system called an opinion pool.

After I’d described that project, one of the other panelists, Patrick Ball, executive director of Human Rights Data Analysis Group, had this to say via Google Hangout:

As someone who uses machine learning to build statistical models—that’s what I do all day long, that’s my job—I’m very skeptical that models about conflict, about highly rare events that have very complicated and situation-unique antecedents are forecastable. I worry about early warning because when we build models we listen to people less. I know that, from my work with the U.N., when we have a room full of people who know an awful lot about what’s going on on the ground, a graph—when someone puts a graph on the table, everybody stops thinking. They just look at the graph. And that worries me a lot.

In 1994, human-rights experts warned the world about what was happening [in Rwanda]. No one listened. So as we, as technologists and people who like technology, when we ask questions of data, we have to make sure that if anybody is going to listen to us, we’d better be giving them the right answers.

Maybe I was being vain, but I heard that part of Patrick’s remarks as a rebuke of our early-warning project and pretty much every other algorithm-driven atrocities and conflict forecasting endeavor out there. I responded by acknowledging that our forecasts are far from perfect, but I also asserted that we have reason to believe they will usually be at least marginally better than the status quo, so they’re worth doing and sharing anyway.

A few minutes later, Patrick came back with this:

When we build technology for human rights, I think we need to be somewhat thoughtful about how our less technical colleagues are going to hear the things that we say. In a lot of meetings over a lot of years, I’ve listened to very sophisticated, thoughtful legal, qualitative, ethnographic arguments about very specific events occurring on the ground. But almost inevitably, when someone proposes some kind of quantitative analysis, all that thoughtful reasoning escapes the room… The practical effect of introducing any kind of quantitative argument is that it displaces the other arguments that are on the table. And we are naive to think otherwise.

What that means is that the stakes for getting these kinds of claims right are very high. If we make quantitative claims and we’re wrong—because our sampling foundations are weak, because our model is inappropriate, because we misinterpreted the error around our claim, or for any other reason—we can do a lot of harm.

From that combination of uncertainty and the possibility for harm, Patrick concludes that quantitative forecasters have a special responsibility to be circumspect in the presentation of their work:

I propose that one of the foundations of any kind of quantitative claims-making is that we need to have very strict validation before we propose a conclusion to be used by our broader community. There are all kinds of rules about validation in model-building. We know a lot about it. We have a lot of contexts in which we have ground truth. We have a lot of historical detail. Some of that historical detail is itself beset by these sampling problems, but we have opportunities to do validation. And I think that any argument, any claim that we make—especially to non-technical audiences—should lead with that validation rather than leaving it to the technical detail. By avoiding discussing the technical problems in front of non-technical audiences, we’re hiding stuff that might not be working. So I warn us all to be much stricter.

Patrick has applied statistical methods to human-rights matters for a long time, and his combined understanding of the statistics and the advocacy issues is as good if not better than almost anyone else’s. Still, what he described about how people respond to quantitative arguments is pretty much the exact opposite of my experience over 15 years of working on statistical forecasts of various forms of political violence and change. Many of the audiences to which I’ve presented that work have been deeply skeptical of efforts to forecast political behavior. Like Patrick, many listeners have asserted that politics is fundamentally unquantifiable and unpredictable. Statistical forecasts in particular are often derided for connoting a level of precision that’s impossible to achieve and for being too far removed from the messy reality of specific places to produce useful information. Even in cases where we can demonstrate that the models are pretty good at distinguishing high-risk cases from low-risk ones, that evidence usually fails to persuade many listeners, who appear to reject the work on principle.

I hear loud echoes of my experiences in Daniel Kahneman’s discussion of clinical psychologists’ hostility to algorithms and enduring prejudice in favor of clinical judgment, even in situations where the former is demonstrably superior to the latter. On pp. 228 of Thinking, Fast and Slow, Kahneman observes that this prejudice “is an attitude we can all recognize.”

When a human competes with a machine, whether it is John Henry a-hammerin’ on the mountain or the chess genius Garry Kasparov facing off against the computer Deep Blue, our sympathies lie with our fellow human. The aversion to algorithms making decisions that affect humans is rooted in the strong preference that many people have for the natural over the synthetic or artificial.

Kahneman further reports that

The prejudice against algorithms is magnified when the decisions are consequential. [Psychologist Paul] Meehl remarked, ‘I do not quite know how to alleviate the horror some clinicians seem to experience when they envisage a treatable case being denied treatment because a ‘blind, mechanical’ equation misclassifies him.’ In contrast, Meehl and other proponents of algorithms have argued strongly that it is unethical to rely on intuitive judgments for important decisions if an algorithm is available that will make fewer mistakes. Their rational argument is compelling, but it runs against a stubborn psychological reality: for most people, the cause of a mistake matters. The story of a child dying because an algorithm made a mistake is more poignant than the story of the same tragedy occurring as a result of human error, and the difference in emotional intensity is readily translated into a moral preference.

If our distaste for algorithms is more emotional than rational, then why do forecasters who use them have a special obligation, as Patrick asserts, to lead presentations of their work with a discussion of the “technical problems” when experts offering intuitive judgments almost never do? I’m uncomfortable with that requirement, because I think it unfairly handicaps algorithmic forecasts in what is, frankly, a competition for attention against approaches that are often demonstrably less reliable but also have real-world consequences. This isn’t a choice between action or inaction; it’s a trolley problem. Plenty of harm is already happening on the current track, and better forecasts could help reduce that harm. Under these circumstances, I think we behave ethically when we encourage the use of our forecasts in honest but persuasive ways.

If we could choose between forecasting and not forecasting, then I would be happier to set a high bar for predictive claims-making and let the validation to which Patrick alluded determine whether or not we’re going to try forecasting at all. Unfortunately, that’s not the world we inhabit. Instead, we live in a world in which governments and other organizations are constantly making plans, and those plans incorporate beliefs about future states of the world.

Conventionally, those beliefs are heavily influenced by the judgments of a small number of experts elicited in unstructured ways. That approach probably works fine in some fields, but geopolitics is not one of them. In this arena, statistical models and carefully designed procedures for eliciting and combining expert judgments will also produce forecasts that are uncertain and imperfect, but those algorithm-driven forecasts will usually be more accurate than the conventional approach of querying one or a few experts and blending their views in our heads (see here and here for some relevant evidence).

We also know that most of those subject-matter experts don’t abide by the rules Patrick proposes for quantitative forecasters. Anyone who’s ever watched cable news or read an op-ed—or, for that matter, attended a panel discussion—knows that experts often convey their judgments with little or no discussion of their cognitive biases and sources of uncertainty.

As it happens, that confidence is persuasive. As Kahneman writes (p. 263),

Experts who acknowledge the full extent of their ignorance may expect to be replaced by more confident competitors who are better able to gain the trust of clients. An unbiased appreciation of uncertainty is a cornerstone of rationality—but it is not what people and organizations want. Extreme uncertainty is paralyzing under dangerous circumstances, and the admission that one is merely guessing is especially unacceptable when the stakes are high. Acting on pretended knowledge is often the preferred solution.

The allure of confidence is dysfunctional in many analytic contexts, but it’s also not something we can wish away. And if confidence often trumps content, then I think we do our work and our audiences a disservice when we hem and haw about the validity of our forecasts as long as the other guys don’t. Instead, I believe we are behaving ethically when we present imperfect but carefully derived forecasts in a confident manner. We should be transparent about the limitations of the data and methods, and we should assess the accuracy of our forecasts and share what we learn. Until we all agree to play by the same rules, though, I don’t think quantitative forecasters have a special obligation to lead with the limitations of their work, thus conceding a persuasive advantage to intuitive forecasters who will fill that space and whose prognostications we can expect to be less reliable than ours.

* You can replay a webcast of that event here. Our panel runs from 1:00:00 to 2:47:00.

Yes, Forecasting Conflict Can Help Make Better Foreign Policy Decisions

At the Monkey Cage, Idean Salehyan has a guest post that asks, “Can forecasting conflict help to make better foreign policy decisions?” I started to respond in a comment there, but as my comment ballooned into several paragraphs and started to include hyperlinks, I figured I’d go ahead and blog it.

Let me preface my response by saying that I’ve spent most of my 16-year career since graduate school doing statistical forecasting for the U.S. government and now wider audiences and plan and expect to continue doing this kind of work for a while. That means I have a lot of experience doing it and thinking about how and why to do it, but it also means that I’m financially invested in an affirmative answer to Salehyan’s rhetorical question. Make of that what you will.

So, on to the substance. Salehyan’s main concern is actually an ethical one, not the pragmatic one I inferred when I first saw the title of his post. When Salehyan asks about making decisions “better,” he doesn’t just mean more effective. In his view,

Scholars cannot be aloof from the real-world implications of their work, but must think carefully about the potential uses of forecasts…If social scientists will not use their research to engage in policy debates about when to strike, provide aid, deploy troops, and so on, others will do so for them.  Conflict forecasting should not be seen as value-neutral by the academic community—it will certainly not be seen as such by others.

On this point, I agree completely, but I don’t think there’s anything unique about conflict forecasting in this regard. No scholarship is entirely value neutral, and research on causal inference informs policy decisions, too. In fact, my experience is that policy frames suggested by compelling causal analysis have deeper and more durable influence than statistical forecasts, which most policymakers still seem inclined to ignore.

One prominent example comes from the research program that emerged in the 2000s on the relationship between natural resources and the occurrence and persistence of armed conflict. After Paul Collier and Anke Hoeffler famously identified “greed” as an important impetus to civil war (here), numerous scholars showed that some rebel groups were using “lootable” resources to finance their insurgencies. These studies helped inspire advocacy campaigns that led, among other things, to U.S. legislation aimed at restricting trade in “conflict minerals” from the Democratic Republic of Congo. Now, several years later, other scholars and advocates have convincingly shown that this legislation was counterproductive. According to Laura Seay (here), the U.S. law

has created a de facto ban on Congolese mineral exports, put anywhere from tens of thousands up to 2 million Congolese miners out of work in the eastern Congo, and, despite ending most of the trade in Congolese conflict minerals, done little to improve the security situation or the daily lives of most Congolese.

Those are dire consequences, and forecasting is nowhere in sight. I don’t blame Collier and Hoeffler or the scholars who followed their intellectual lead on this topic for Dodd-Frank 1502, but I do hope and expect that those scholars will participate in the public conversation around related policy choices.

Ultimately, we all have a professional and ethical responsibility for the consequences of our work. For statistical forecasters, I think this means, among other things, a responsibility to be honest about the limitations, and to attend to the uses, of the forecasts we produce. The fact that we use mathematical equations to generate our forecasts and we can quantify our uncertainty doesn’t always mean that our forecasts are more accurate or more precise than what pundits offer, and it’s incumbent on us to convey those limitations. It’s easy to model things. It’s hard to model them well, and sometimes hard to spot the difference. We need to try to recognize which of those worlds we’re in and to communicate our conclusions about those aspects of our work along with our forecasts. (N.B. It would be nice if more pundits tried to abide by this rule as well. Alas, as Phil Tetlock points out in Expert Political Judgment, the market for this kind of information rewards other things.)

Salehyan doesn’t just make this general point, however. He also argues that scholars who produce statistical forecasts have a special obligation to attend to the ethics of policy informed by their work because, in his view, they are likely to be more influential.

The same scientific precision that makes statistical forecasts better than ‘gut feelings’ makes it even more imperative for scholars to engage in policy debates.  Because statistical forecasts are seen as more scientific and valid they are likely to carry greater weight in the policy community.  I would expect—indeed hope—that scholars care about how their research is used, or misused, by decision makers.  But claims to objectivity and coolheaded scientific-ness make many academics reluctant to advocate for or against a policy position.

In my experience and the experience of every policy veteran with whom I’ve ever spoken about the subject, Salehyan’s conjecture that “statistical forecasts are likely to carry greater weight in the policy community” is flat wrong. In many ways, the intellectual culture within the U.S. intelligence and policy communities mirrors the intellectual culture of the larger society from which their members are drawn. If you want to know how those communities react to statistical forecasts of the things they care about, just take a look at the public discussion around Nate Silver’s election forecasts. The fact that statistical forecasts aren’t blithely and blindly accepted doesn’t absolve statistical forecasters of responsibility for their work. Ethically speaking, though, it matters that we’re nowhere close to the world Salehyan imagines in which the layers of deliberation disappear and a single statistical forecast drives a specific foreign policy decision.

Look, these decisions are going to be made whether or not we produce statistical forecasts, and when they are made, they will be informed by many things, of which forecasts—statistical or otherwise—will be only one. That doesn’t relieve the forecaster of ethical responsibility for the potential consequences of his or her work. It just means that the forecaster doesn’t have a unique obligation in this regard. In fact, if anything, I would think we have an ethical obligation to help make those forecasts as accurate as we can in order to reduce as much as we can the uncertainty about this one small piece of the decision process. It’s a policymaker’s job to confront these kinds of decisions, and their choices are going to be informed by expectations about the probability of various alternative futures. Given that fact, wouldn’t we rather those expectations be as well informed as possible? I sure think so, and I’m not the only one.

Some Suggested Readings for Political Forecasters

A few people have recently asked me to recommend readings on political forecasting for people who aren’t already immersed in the subject. Since the question keeps coming up, I thought I’d answer with a blog post. Here, in no particular order, are books (and one article) I’d suggest to anyone interested in the subject.

Thinking, Fast and Slow, by Daniel Kahneman. A really engaging read on how we think, with special attention to cognitive biases and heuristics. I think forecasters should read it in hopes of finding ways to mitigate the effects of these biases on their own work, and of getting better at spotting them in the thinking of others.

Numbers Rule Your World, by Kaiser Fung. Even if you aren’t going to use statistical models to forecast, it helps to think statistically, and Fung’s book is the most engaging treatment of that topic that I’ve read so far.

The Signal and the Noise, by Nate Silver. A guided tour of how forecasters in a variety of fields do their work, with some useful general lessons on the value of updating and being an omnivorous consumer of relevant information.

The Theory that Would Not Die, by Sharon Bertsch McGrayne. A history of Bayesian statistics in the real world, including successful applications to some really hard prediction problems, like the risk of accidents with atomic bombs and nuclear power plants.

The Black Swan, by Nicholas Nassim Taleb. If you can get past the derisive tone—and I’ll admit, I initially found that hard to do—this book does a great job explaining why we should be humble about our ability to anticipate rare events in complex systems, and how forgetting that fact can hurt us badly.

Expert Political Judgment: How Good Is It? How Can We Know?, by Philip Tetlock. The definitive study to date on the limits of expertise in political forecasting and the cognitive styles that help some experts do a bit better than others.

Counterfactual Thought Experiments in World Politics, edited by Philip Tetlock and Aaron Belkin. The introductory chapter is the crucial one. It’s ostensibly about the importance of careful counterfactual reasoning to learning from history, but it applies just as well to thinking about plausible futures, an important skill for forecasting.

The Foundation Trilogy, by Isaac Asimov. A great fictional exploration of the Modernist notion of social control through predictive science. These books were written half a century ago, and it’s been more than 25 years since I read them, but they’re probably more relevant than ever, what with all the talk of Big Data and the Quantified Self and such.

The Perils of Policy by P-Value: Predicting Civil Conflicts,” by Michael Ward, Brian Greenhill, and Kristin Bakke. This one’s really for practicing social scientists, but still. The point is that the statistical models we typically construct for hypothesis testing often won’t be very useful for forecasting, so proceed with caution when switching between tasks. (The fact that they often aren’t very good for hypothesis testing, either, is another matter. On that and many other things, see Phil Schrodt’s “Seven Deadly Sins of Contemporary Quantitative Political Analysis.“)

I’m sure I’ve missed a lot of good stuff and would love to hear more suggestions from readers.

And just to be absolutely clear: I don’t make any money if you click through to those books or buy them or anything like that. The closest thing I have to a material interest in this list are ongoing professional collaborations with three of the authors listed here: Phil Tetlock, Phil Schrodt, and Mike Ward.

Forecasting Round-Up No. 3

1. Mike Ward and six colleagues recently posted a new working paper on “the next generation of crisis prediction.” The paper echoes themes that Mike and Nils Metternich sounded in a recent Foreign Policy piece responding to one I wrote a few days earlier, about the challenges of forecasting rare political events around the world. Here’s a snippet from the paper’s intro:

We argue that conflict research in political science can be improved by more, not less, attention to predictions. The increasing availability of disaggregated data and advanced estimation techniques are making forecasts of conflict more accurate and precise. In addition, we argue that forecasting helps to prevent overfi tting, and can be used both to validate models, and inform policy makers.

I agree with everything the authors say about the scientific value and policy relevance of forecasting, and I think the modeling they’re doing on civil wars is really good. There were two things I especially appreciated about the new paper.

First, their modeling is really ambitious. In contrast to most recent statistical work on civil wars, they don’t limit their analysis to conflict onset, termination, or duration, and they don’t use country-years as their unit of observation. Instead, they look at country-months, and they try to tackle the more intuitive but also more difficult problem of predicting where civil wars will be occurring, whether or not one is already ongoing.

This version of the problem is harder because the factors that affect the risk of conflict onset might not be the same ones that affect the risk of conflict continuation. Even when they are, those factors might not affect the two risks in inverse ways. As a result, it’s hard to specify a single model that can reliably anticipate continuity in, and changes from, both forms of the status quo (conflict or no conflict).

The difficulty of this problem is evident in the out-of-sample accuracy of the model these authors have developed. The performance statistics are excellent on the whole, but that’s mostly because the model is accurately forecasting that whatever is happening in one month will continue to happen in the next. Not surprisingly, the model’s ability to anticipate transitions is apparently weaker. Of the five civil-war onsets that occurred in the test set, only two “arguably…rise to probability levels that are heuristic,” as the authors put it.

I emailed Mike to ask about this issue, and he said they were working on it:

Although the paper doesn’t go into it, in a separate part of this effort we actually do have separate models for onset and continuation, and they do reasonably well.  We are at work on terminations, and developing a new methodology that predicts onsets, duration, and continuation in a single (complicated!) model.  But that is down the line a bit.

Second and even more exciting to me, the authors close the paper with real, honest-to-goodness forecasts. Using the most recent data available when the paper was written, the authors generate predicted probabilities of civil war for the next six months: October 2012 through March 2013. That’s the first time I’ve seen that done in an academic paper about something other than an election, and I hope it sets a precedent that others will follow.

2. Over at Red (team) Analysis, Helene Lavoix appropriately pats The Economist on the back for publicly evaluating the accuracy of the predictions they made in their “World in 2012″ issue. You can read the Economist‘s own rack-up here, but I want to highlight one of the points Helene raised in her discussion of it. Toward the end of her post, in a section called “Black swans or biases?”, she quotes this bit from the Economist:

As ever, we failed at big events that came out of the blue. We did not foresee the LIBOR scandal, for example, or the Bo Xilai affair in China or Hurricane Sandy.

As Helene argues, though, it’s not self evident that these events were really so surprising—in their specifics, yes, but not in the more general sense of the possibility of events like these occurring sometime this year. On Sandy, for example, she notes that

Any attention paid to climate change, to the statistics and documents produced by Munich-re…or Allianz, for example, to say nothing about the host of related scientific studies, show that extreme weather events have become a reality and we are to expect more of them and more often, including in the so-called rich countries.

This discussion underscores the importance of being clear about what kind of forecasting we’re trying to do, and why. Sometimes the specifics will matter a great deal. In other cases, though, we may have reason to be more concerned with risks of a more general kind, and we may need to broaden our lens accordingly. Or, as Helene writes,

The methodological problem we are facing here is as follows: Are we trying to predict discrete events (hard but not impossible, however with some constraints and limitations according to cases) or are we trying to foresee dynamics, possibilities? The answer to this question will depend upon the type of actions that should follow from the anticipation, as predictions or foresight are not done in a vacuum but to allow for the best handling of change.

3. Last but by no means least, Edge.org has just posted an interview with psychologist Phil Tetlock about his groundbreaking and ongoing research on how people forecast, how accurate (or not) their forecasts are, and whether or not we can learn to do this task better. [Disclosure: I am one of hundreds of subjects in Phil's contribution to the IARPA tournament, the Good Judgment Project.] On the subject of learning, the conventional wisdom is pessimistic, so I was very interested to read this bit (emphasis added):

Is world politics like a poker game? This is what, in a sense, we are exploring in the IARPA forecasting tournament. You can make a good case that history is different and it poses unique challenges. This is an empirical question of whether people can learn to become better at these types of tasks. We now have a significant amount of evidence on this, and the evidence is that people can learn to become better [forecasters]. It’s a slow process. It requires a lot of hard work, but some of our forecasters have really risen to the challenge in a remarkable way and are generating forecasts that are far more accurate than I would have ever supposed possible from past research in this area.

And bonus alert: the interview is introduced by Daniel Kahneman, Nobel laureate and author of one of my favorite books from the past few years, Thinking, Fast and Slow.

N.B. In case you’re wondering, you can find Forecasting Round-Up Nos. 1 and 2 here and here.

A Chimp’s-Eye View of a Forecasting Experiment

For the past several months, I’ve been participating in a forecasting tournament as one of hundreds of “experts” making and updating predictions about dozens of topics in international political and economic affairs. This tournament is funded by IARPA, a research arm of the U.S. government’s Office of the Director of National Intelligence, and it’s really a grand experiment designed to find better ways to elicit, combine, and present probabilistic forecasts from groups of knowledgeable people.

There are several teams participating in this tournament; I happen to be part of the Good Judgment team that is headed by psychologist Phil Tetlock. Good Judgment “won” the first year of the competition, but that win came before I started participating, so alas, I can’t claim even a tiny sliver of the credit for that.

I’ve been prognosticating as part of my job for more than a decade, but almost all of the forecasting I’ve done in the past was based on statistical models designed to assess risks of a specific rare event (say, a coup attempt, or the onset of a civil war) across all countries worldwide. The Good Judgment Project is my first experience with routinely making calls about the likelihood of many different events based almost entirely on my subjective beliefs. Now that I’m a few months into this exercise, I thought I’d write something about how I’ve approached the task, because I think my experiences speak to generic difficulties in forecasting rare political and economic events.

By way of background, here’s how the forecasting process works for the Good Judgment team: I start by logging into a web site that lists a bunch of questions on an odd mix of topics, everything from the Euro-to-dollar exchange rate to the outcome of the recent election in Venezuela. I click on a question and am expected to assign a numeric probability to each of the two or more possible outcomes listed (e.g., “yes or no,” or “A, B, or C”). Those outcomes are always exhaustive, so the probabilities I assign must always sum to 1. Whenever I feel like it, I can log back in and update any of the forecasts I’ve already made. Then, when the event of interest happens (e.g., results of the Venezuelan election are announced), the question is closed, and the accuracy of my forecast for that question and for all questions closed to date is summarized with a statistic called the Brier score. I can usually see my score pretty soon after the question closes, and I can also see average scores on each question for my whole team and the cumulative scores for the top 10 percent or so of the team’s leader board.

So, how do you perform this task? In an ideal world, my forecasting process would go as follows:

1. Read the question carefully. What does “attack,” “take control of,” or “lose office” mean here, and how will it be determined? To make the best possible forecast, I need to understand exactly what it is I’m being asked to predict.

2. Forecast. In Unicorn World, I would have an ensemble of well-designed statistical models to throw at each and every question. In the real world, I’m lucky if there’s a single statistical model that applies even indirectly. Absent a sound statistical forecast, I would try to identify the class of events to which the question belongs and then determine a base rate for that class.  The “base rate” is just the historical frequency of the event’s occurrence in comparable cases—say, how often civil wars end in their first year, or how often incumbents lose elections in Africa. Where feasible, I would also check prediction markets, look for relevant polling data, seek out published predictions, and ask knowledgeable people.

In all cases, the idea is to find empirical evidence to which I can anchor my forecast, and to get a sense of how much uncertainty there is about this particular case. When there’s a reliable forecast from a statistical model or even good base-rate information, I would weight that most heavily and would only adjust far away from that prediction in cases where subject-matter experts make a compelling case about why this instance will be different. (As for what makes a case compelling, well…) If I can’t find any useful information or the information I find is all over the map, I would admit that I have no clue and do the probabilistic equivalent of a coin flip or die roll, splitting the predicted probability evenly across all of the possible outcomes.

3. Update. As Nate Silver argues in The Signal and The Noise, forecasters should adjust their prediction whenever we are presented with relevant new information. There’s no shame in reconsidering your views as new information becomes available; that’s called learning. Ideally, I would be disciplined about how I update my forecasts, using Bayes’ rule to check the human habit of leaning too hard on the freshest tidbit and take full advantage of all the information I had before.

4. Learn. Over time, I would see areas where I was doing well or poorly and would use those outcomes to deepen confidence in, or to try to improve, my mental models. For the questions I get wrong, what did I overlook? What were the forecasters who did better than I thinking about that I wasn’t? For the questions I answer well, was I right for the right reasons, or was it possibly just dumb luck? The more this process gets repeated, the better I should be able to do, within the limits determined by the basic predictability of the phenomenon in question.

To my mind, that is how I should be making forecasts. Now here are few things I’ve noticed so far about what I’m actually doing.

1. I’m lazy. Most of the time, I see the question, make an assumption about what it means without checking the resolution criteria, and immediately formulate a gut response on simple four-point scale: very likely, likely, unlikely, or very unlikely. I’d like to say that “I have no clue” is also on the menu of immediate responses, but my brain almost always makes an initial guess, even if it’s a topic about which I know nothing—for example, the resolution of a particular dispute before the World Trade Organization. What’s worse, it’s hard to dislodge that gut response once I’ve had it, even when I think I have better anchoring information.

2. I’m motivated by the competition, but that motivation doesn’t necessarily make me more attentive. As I said earlier, in this tournament, we can see our own scores and the scores of all the top performers as we go. With such a clear yardstick, it’s hard not to get pulled into seeing other forecasters as competitors and trying to beat them. You’d think that urge would motivate me to be more attentive to the task, more rigorously following the idealized process I described above. Most of the time, though, it just means that I calibrate my answers to the oddities of the yardstick—the Brier score involves squaring your error term, so the cost of being wrong isn’t distributed evenly across the range of possible values—and that I check the updated scores soon after questions close.

3. It’s hard to distinguish likelihood from timing. Some of the questions can get pretty specific about the timing of the event of interest. For example, we might be asked something like: Will Syrian president Bashar al-Assad fall before the end of October 2012; before the end of December 2012; before April 2013; or some time thereafter? I find these questions excruciatingly hard to answer, and it took me a little while to figure out why.

After thinking through how I had approached a few examples, I realized that my brain was conflating probability with proximity. In other words, the more likely I thought the event was, the sooner I expected it to occur.  That makes sense for some situations, but it doesn’t always, and a careful consideration of timing will usually require lots of additional information. For example, I might look at structural conditions in Syria and conclude that Assad can’t win the civil war he’s started, but how long it’s going to take for him to lose will depend on a host of other things with complex dynamics, like the price of oil, the flow of weapons, and the logistics of military campaigns. Interestingly, even though I’m now aware of this habit, I’m still finding it hard to break.

4. I’m eager to learn from feedback, but feedback is hard to come by. This project isn’t like weather forecasting or equities trading, where you make a prediction, see how you did, tweak your model, and try again, over and over. Most of the time, the questions are pretty idiosyncratic, so you’ll have just one or a few chances to make a certain prediction. What’s more, the answers are usually categorical, so even when you do get more than one shot, it’s hard to tell how wrong or right you were. In this kind of environment, it’s really hard to build on your experience. In the several months I’ve been participating, I think I’ve learned far more about the peculiarities of the Brier score than I have about the generative process underlying any of the phenomena about which I’ve been asked to predict.

And that, for whatever it’s worth, is one dart-throwing chimp’s initial impressions of this particular experiment. As it happens, I’ve been doing pretty well so far—I’ve been at or near the top of the team’s accuracy rankings since scores for the current phase started to come in—but I still feel basically clueless most of the time.  It’s like a golf tournament where good luck tosses some journeyman to the top of the leader board after the first or second round, and I keep waiting for the inevitable regression toward the mean to kick in. I’d like to think I can pull off a Tin Cup, but I know enough about statistics to expect that I almost certainly won’t.

A Comment on Nate Silver’s The Signal and the Noise

I’ve just finished reading Nate Silver’s very good new book, The Signal and the Noise: Why So Many Predictions Fail—But Some Don’t. For me, the book was more of a tossed salad than a cake—a tasty assemblage of conceptually related parts that doesn’t quite cohere into a new whole. Still, I would highly recommend it to anyone interested in forecasting, a category that should include anybody with a pulse, as Silver persuasively argues. We can learn a lot just by listening to a skilled practitioner talk about his craft, and that, to me, is what The Signal and the Noise is really all about.

Instead of trying to review the whole book here, though, I wanted to pull on one particular thread running through it, because I worry about where that thread might lead some readers. That thread concerns the relative merits of statistical models and expert judgment as forecasting tools.

Silver is a professional forecaster who built his reputation on the clever application of statistical tools, but that doesn’t mean he’s a quantitative fundamentalist. To the contrary, one of the strongest messages in The Signal and the Noise is that our judgment may be poor, but we shouldn’t fetishize statistical models, either. Yes, human forecasters are inevitably biased, but so, in a sense, are the statistical models they build. For starters, those models entail a host of assumptions about the reliability and structure of the data, many of which will often be wrong. For another, there is often important information that’s hard to quantify but is useful for forecasting, and we ignore the signals from this space at our own peril. Third, forecasts are usually more accurate when they aggregate information from multiple, independent sources, and subjective forecasts from skilled experts can be a really useful leg in this stool.

Putting all of these concerns together, Silver arrives at a philosophy of forecasting that might be described as “model-assisted,” or maybe just “omnivorous.” Silver recognizes the power of statistics for finding patterns in noisy data and checking our mental models, but he also cautions strongly against putting blind faith in those tools and favors keeping human judgment in the cycle, including at the final stage where we finally make a forecast about some situation of interest.

To illustrate the power of model-assisted forecasting, Silver describes how well this approach has worked in a few areas: baseball scouting, election forecasting, and meteorology, to name a few. About the latter, for example, he writes that “weather forecasting is one of the success stories in this book, a case of man and machine joining forces to understand and sometimes anticipate the complexities of nature.”

All of what Silver says about the pitfalls of statistical forecasting and the power of skilled human forecasters is true, but only to a point. I think Silver’s preferred approach depends on a couple of conditions that are often absent in real-world efforts to forecast complex political phenomena, where I’ve done most of my work. Because Silver doesn’t spell those conditions out, I thought I would, in an effort to discourage readers of The Signal and the Noise from concluding that statistical forecasts can always be improved by adjusting them according to our judgment.

First, the process Silver recommends assumes that the expert tweaking the statistical forecast is the modeler, or at least has a good understanding of the strengths and weaknesses of the model(s) being used. For example, he describes experienced meteorologists improving their forecasts by manually adjusting certain values to correct for a known flaw in the model. Those manual adjustments seem to make the forecasts better, but they depend on a pretty sophisticated knowledge of the underlying algorithm and the idiosyncrasies of the historical data.

Second and probably more important, the scenarios Silver approvingly describes all involve situations where the applied forecaster gets frequent and clear feedback on the accuracy of his or her predictions. This feedback allows the forecaster to look for patterns in the performance of the statistical tool and the adjustments being made to them. It’s the familiar process of trial and error, but that process only works when we can see where the errors are and see if the fixes we attempt are actually working.

Both of these conditions hold in several of the domains Silver discusses, including baseball scouting and and meteorology. These are data-rich environments where forecasters often know the quirks of the data and statistical models they might use and can constantly see how they’re doing.

In the world of international politics, however, most forecasters—and, whether they realize it or not, every analyst is a forecaster—have little or no experience with statistical forecasting tools and are often skeptical of their value. As a result, discussions about the forecasts these tools produce are more likely to degenerate into a competitive, “he said, she said” dynamic than they are to achieve the synergy that Silver praises.

More important, feedback on the predictive performance of analysts in international politics is usually fuzzy or absent. Poker players get constant feedback from the changing size of their chip stacks. By contrast, people who try to forecast politics rarely do so with much specificity, and even when they do, they rarely keep track of their performance over time. What’s worse, the events we try to forecast—things like coups or revolutions—rarely occur, so there aren’t even that many opportunities to assess our performance even if we try. Most of the score-keeping is done in our own heads, but as Phil Tetlock shows, we’re usually poor judges of own performance. We fixate on the triumphs, forget or explain away the misses, and spin the ambiguous cases as successes.

In this context, it’s not clear to me that Silver’s ideal of “model-assisted” forecasting is really attainable, at least not without more structure being imposed from the outside. For example, I could imagine a process where a third party elicits forecasts from human prognosticators and statistical models and then combines the results in a way that accounts for the strengths and blind spots of each input. This process would blend statistics and expert judgment, just not by means of a single individual as often happened in Silver’s favored examples.

Meanwhile, the virtuous circle Silver describes is already built into the process of statistical modeling, at least when done well. For example, careful statistical forecasters will train their models on one sample of cases and then apply them to another sample they’ve never “seen” before. This out-of-sample validation lets modelers know if they’re onto something useful and gives them some sense of the accuracy and precision of their models before they rush out and apply them.

I couldn’t help but wonder how much Silver’s philosophy was shaped by the social part of his experiences in baseball and elections forecasting. In both of those domains, there’s a running culture clash, or at least the perception of one, between statistical modelers and judgment-based forecasters—nerds and jocks in baseball, quants and pundits in politics. When you work in a field like that, you can get a lot of positive social feedback by saying “Everybody’s right!” I’ve sat in many meetings where someone proposed combining statistical forecasts and expert judgment without specifying how that process would work or that we actually check how the combination is affecting forecast accuracy. Almost every time, though, that proposal is met with a murmur of assent: “Of course! Experts are experts! More is more!” I get the sense that this advice will almost always be popular, but I’m not convinced that it’s always sound.

Silver is right, of course, when he argues that we can never escape subjectivity. Modelers still have to choose the data and models they use, both of which bake a host of judgments right into the pie. What we can do with models, though, is discipline our use of those data, and in so doing, more clearly compare sets of assumptions to see which are more useful. Most political forecasters don’t currently inhabit a world where they can get to know the quirks of the statistical models and adjust for them. Most don’t have statistical models or hold them at arm’s length if they do, and they don’t get to watch them perform anywhere near enough to spot and diagnose the biases. When these conditions aren’t met, we need to be very cautious about taking forecasts from a well-designed model and tweaking them because they don’t feel right.

Forecasting Round-Up

I don’t usually post lists of links, but the flurry of great material on forecasting that hit my screen over the past few days is inspiring me to make an exception. Here in no particular order are several recent pieces that deserve wide reading:

  • The Weatherman Is Not a Moron.” Excerpted from a forthcoming book by the New York Times’ Nate Silver, this piece deftly uses meteorology to illustrate the difficulties of forecasting in complex systems and some of the ways working forecasters deal with them. For a fantastic intellectual history on the development of the ensemble forecasting approach Silver discusses, see this July 2005 journal article by John Lewis in the American Meteorological Society’s Monthly Weather Review.
  • Trending Upward.” Michael Horowitz and Phil Tetlock write for Foreign Policy about how the U.S. “intelligence community” can improve its long-term forecasting. The authors focus on the National Intelligence Council’s Global Trends series, which attempts the Herculean (or maybe Sisyphean) feat of trying to peer 15 years into the future, but the recommendations they offer apply to most forecasting exercises that rely on expert judgment. And, on the Duck of Minerva blog, John Western pushes back: “I think there is utility in long-range forecasting exercises, I’m just not sure I see any real benefits from improved accuracy on the margins. There may actually be some downsides.” [Disclosure: Since this summer, I have been a member of Tetlock and Horowitz's team in the IARPA-funded forecasting competition they mention in the article.]
  • Theories, Models, and the Future of Science.” This post by Ashutosh Jogalekar on Scientific American‘s Curious Waveform blog argues that “modeling and simulation are starting to be considered as a respectable ‘third leg’ of science, in addition to theory and experiment.” Why? Because “many of science’s greatest current challenges may not be amenable to rigorous theorizing, and we may have to treat models of phenomena as independent, authoritative explanatory entities in their own right.” Like Trey Causey, who pointed me toward this piece on Twitter, I think the post draws a sharper distinction between modeling for simulation and explanation than it needs to, but it’s a usefully provocative read.
  • The Probabilities of Large Terrorist Events.” I recently finished Nassim Nicholas Taleb’s Black Swan and was looking around for worked examples applying that book’s idea of “fractal randomness” to topics I study. Voilà! On Friday, Wired‘s Social Dimensions blog spotlighted a recent paper by Aaron Clauset and Ryan Woodward that uses a mix of statistical techniques, including power-law models, to estimate the risk of this particular low-probability, high-impact political event. Their approach—model only the tail of the distribution and use an ensemble approach like the aforementioned meteorologists do—seems really clever to me, and I like how they are transparent about the uncertainty of the resulting estimates.

How Makers of Foreign Policy Use Statistical Forecasts: They Don’t, Really

The current issue of Foreign Policy magazine includes a short piece I wrote on how statistical models can be useful for forecasting coups d’etat. With the March coup in Mali as a hook, the piece aims to show that number-crunching can sometimes do a good job assessing risks of rare events that might otherwise present themselves as strategic surprises.

In fact, statistical forecasting of international politics is a relatively young field, and decision-makers in government and the private sector have traditionally relied on subject-matter experts to prognosticate on events of interest. Unfortunately, expert judgment does not work nearly as well as a forecasting tool as we might hope or expect.

In a comprehensive study of expert political judgment, Philip Tetlock finds that forecasts made by human experts on a wide variety of political phenomena are barely better than random guesses, and they are routinely bested by statistical algorithms that simply extrapolate from recent trends. Some groups of experts perform better than others—the experts’ cognitive style is especially relevant, and feedback and knowledge of base rates can help, too—but even the best-performing sets of experts fail to match the accuracy of those simple statistical algorithms.

The finding that models outperform subjective judgments at forecasting has been confirmed repeatedly by other researchers, including one prominent 2004 study which showed that a simple statistical model could predict the outcomes of U.S. Supreme Court cases much more accurately than a large assemblage of legal experts.

Because statistical forecasts are potentially so useful, you would think that policy makers and the analysts who inform them would routinely use them. That, however, would be a bad bet. I spoke with several former U.S. policy and intelligence officials, and all of them agreed that policymakers make little use of these tools and the “watch lists” they are often used to produce. A few of those former officials noted some variation in the application of these techniques across segments of the government—military leaders seem to be more receptive to statistical forecasting than civilian ones—but, broadly speaking, sentiment runs strongly against applied modeling.

If the evidence in favor of statistical risk assessment is so strong, why is it such a tough sell?

Part of the answer surely lies in a general tendency humans have to discount or ignore evidence that doesn’t match our current beliefs. Psychologists call this tendency confirmation bias, and it affects how we respond when models produce forecasts that contradict our expectations about the future. In theory, this is when models are most useful; in practice, it may also be when they’re hardest to sell.

Jeremy Weinstein, a professor of political science at Stanford University, served as Director for Development and Democracy on the National Security Council staff at the White House from 2009 until 2011. When I asked him why statistical forecasts don’t get used more in foreign-policy decision-making, he replied, “I only recall seeing the use of quantitative assessments in one context. And in that case, I think they were accepted by folks because they generated predictions consistent with people’s priors. I’m skeptical that they would have been valued the same if they had generated surprising predictions. For example, if a quantitative model suggests instability in a country that no one is invested in following or one everyone believes is stable, I think the likely instinct of policymakers is to question the value of the model.”

The pattern of confirmation bias extends to the bigger picture on the relative efficacy of models and experts. When asked about why policymakers don’t pay more attention to quantitative risk assessments, Anne-Marie Slaughter, former director of Policy Planning at State, responded: “You may believe that [statistical forecasts] have a better track record than expert judgment, but that is not a widely shared view. Changing minds has to come first, then changing resources.”

Where Weinstein and Slaughter note doubts about the value of the forecasts, others see deeper obstacles in the organizational culture of the intelligence community. Ken Knight, now Analytic Director at Centra Technology, spent the better part of a 30-year career in government working on risk assessment, including several years in the 2000s as National Intelligence Officer for Warning. According to Knight, “Part of it is the analytic community that I grew up in. There was very little in the way of quantitative analytic techniques that was taught to me as an analyst in the courses I took. There is this bias that says this stuff is too complex to model…People are just really skeptical that this is going to tell them something they don’t already know.”

This organizational bias may simply reflect some deep grooves in human cognition. Psychological research shows that our minds routinely ignore statistical facts about groups or populations while gobbling up or even cranking out causal stories that purport to explain those facts. These different responses appear to be built-in features of the automatic and unconscious thinking that dominates our cognition. Because of them, our minds “can deal with stories in which the elements are causally linked,” Daniel Kahneman writes, but they are “weak in statistical reasoning.”

Of course, cognitive bias and organizational culture aren’t the only reasons statistical risk assessments don’t always get traction in the intelligence-production process. Stephen Krasner, a predecessor of Slaughter’s as director of Policy Planning at State, noted in an email exchange that there’s often a mismatch between the things these models can warn about and the kinds of questions policymakers are often trying to answer. Krasner’s point was echoed in a recent column by CNAS senior fellow Andrew Exum, who notes that “intelligence organizations are normally asked to answer questions regarding both capability and intent.” To that very short list, I would add “probability,” but the important point here is that estimating the likelihood of events of concern is just one part of what these organizations are asked to do, and often not the most prominent one.

Clearly, there are a host of reasons why policy-makers might not see statistical forecasts as a valuable resource. Some are rooted in cognitive bias and organizational culture, while others are related to the nature of the problems they’re trying to solve.

That said, I suspect that modelers also share some of the blame for the chilly reception their forecasts receive. When modelers are building their forecasting tools, I suspect they often imagine their watch lists landing directly on the desks of policymakers with global concerns who are looking to take preventive action or to nudge along events they’d like to see happen. “Tell me the 10 countries where civil war is most likely,” we might imagine the president saying, “so I know where to send my diplomats and position my ships now.”

In reality, the policy process is much more reactive, and by the time something has landed on the desks of the most senior decision-makers, the opportunity for useful strategic warning is often gone. What’s more, in the rare instances where quantitative forecasts do land on policy-makers’ desks, analysts may not be thrilled to see those watch lists cutting to the front of the line and competing directly with them for the scarce attention of their “customers.”

In this environment, modelers could try to make their forecasts more valuable by designing them for, and targeting them at, people earlier in the analytical process—that is, lower in the bureaucracy. Quantitative risk assessments should be more useful to the analysts, desk officers, and deputies who may be able to raise warning flags earlier and who will be called upon when their country of interest pops into the news. Statistical forecasts of relevant events can shape those specialists’ thinking about what the major risks are in their areas of concern, hopefully spurring them to revisit their assumptions in cases where the forecast diverges significantly from their own expectations. Statistical forecasts can also give those specialists some indication on how various risks might increase or decrease as other conditions change. In this model, the point isn’t to replace or overrule the analyst’s judgment, but rather to shape and inform it.

Even without strategic redirection among modelers, though, it’s possible that broader cultural trends will at least erode resistance to statistical risk assessment among senior decision-makers and the analysts who support them. Advances in computing and communications technology are spurring the rise of Big Data and even talk of a new “age of the algorithm.” The discourse often gets a bit heady, but there’s no question that statistical thinking is making new inroads into many fields. In medicine, for example—another area where subjective judgment is prized and decisions can have life-or-death consequences—improvements in data and analysis are combining with easier access to the results to encourage practitioners to lean more heavily on statistical risk assessments in their decisions about diagnosis and treatment. If the hidebound world of medicine can find new value in statistical modeling, who knows, maybe foreign policy won’t be too far behind.

Follow

Get every new post delivered to your Inbox.

Join 7,817 other followers

%d bloggers like this: