To Realize the QDDR’s Early-Warning Goal, Invest in Data-Making

The U.S. Department of State dropped its second Quadrennial Diplomacy and Development Review, or QDDR, last week (here). Modeled on the Defense Department’s Quadrennial Defense Review, the QDDR lays out the department’s big-picture concerns and objectives so that—in theory—they can guide planning and shape day-to-day decision-making.

The new QDDR establishes four main goals, one of which is to “strengthen our ability to prevent and respond to internal conflict, atrocities, and fragility.” To help do that, the State Department plans to “increase [its] use of early warning analysis to drive early action on fragility and conflict.” Specifically, State says it will:

  1. Improve our use of tools for analyzing, tracking, and forecasting fragility and conflict, leveraging improvements in analytical capabilities;
  2. Provide more timely and accurate assessments to chiefs of mission and senior decision-makers;
  3. Increase use of early warning data and conflict and fragility assessments in our strategic planning and programming;
  4. Ensure that significant early warning shifts trigger senior-level review of the mission’s strategy and, if necessary, adjustments; and
  5. Train and deploy conflict-specific diplomatic expertise to support countries at risk of conflict or atrocities, including conflict negotiation and mediation expertise for use at posts.

Unsurprisingly, that plan sounds great to me. We can’t now and never will be able to predict precisely where and when violent conflict and atrocities will occur, but we can assess risks with enough accuracy and lead time to enable better strategic planning and programming. These forecasts don’t have to be perfect to be earlier, clearer, and more reliable than the traditional practices of deferring to individual country or regional analysts or just reacting to the news.

Of course, quite a bit of well-designed conflict forecasting is already happening, much of it paid for by the U.S. government. To name a few of the relevant efforts: The Political Instability Task Force (PITF) and the Worldwide Integrated Conflict Early Warning System (W-ICEWS) routinely update forecasts of various forms of political crisis for U.S. government customers. IARPA’s Open Source Indicators (OSI) and Aggregative Contingent Estimation (ACE) programs are simultaneously producing forecasts now and discovering ways to make future forecasts even better. Meanwhile, outside the U.S. government, the European Union has recently developed its own Global Conflict Risk Index (GCRI), and the Early Warning Project now assesses risks of mass atrocities in countries worldwide.

That so much thoughtful risk assessment is being done now doesn’t mean it’s a bad idea to start new projects. If there are any iron laws of forecasting hard-to-predict processes like political violence, one of them is that combinations of forecasts from numerous sources should be more accurate than forecasts from a single model or person or framework. Some of the existing projects already do this kind of combining themselves, but combinations of combinations will often be even better.

Still, if I had to channel the intention expressed in this part of the QDDR into a single activity, it would not be the construction of new models, at least not initially. Instead, it would be data-making. Social science is not Newtonian physics, but it’s not astrology, either. Smart people have been studying politics for a long time, and collectively they have developed a fair number of useful ideas about what causes or precedes violent conflict. But, if you can’t track the things those theorists tell you to track, then your forecasts are going to suffer. To improve significantly on the predictive models of political violence we have now, I think we need better inputs most of all.

When I say “better” inputs, I have a few things in mind. In some cases, we need to build data sets from scratch. When I was updating my coup forecasts earlier this year, a number of people wondered why I didn’t include measures of civil-military relations, which are obviously relevant to this particular risk. The answer was simple: because global data on that topic don’t exist. If we aren’t measuring it, we can’t use it in our forecasts, and the list of relevant features that falls into this set is surprisingly long.

In other cases, we need to revive them. Social scientists often build “boutique” data sets for specific research projects, run the tests they want to run on them, and then move on to the next project. Sometimes, the tests they or others run suggest that some features captured in those data sets would make useful predictors. Those discoveries are great in principle, but if those data sets aren’t being updated, then applied forecasters can’t use that knowledge. To get better forecasts, we need to invest in picking up where those boutique data sets left off so we can incorporate their insights into our applications.

Finally and in almost all cases, we need to observe things more frequently. Most of the data available now to most conflict forecasters is only updated once each year, often on a several-month delay and sometimes as much as two years later (e.g., data describing 2014 becomes available in 2016). That schedule is fine for basic research, but it is crummy for applied forecasting. If we want to be able to give assessments and warnings that as current as possible to those “chiefs of mission and senior decision-makers” mentioned in the QDDR, then we need to build models with data that are updated as frequently as possible. Daily or weekly are ideal, but monthly updates would suffice in many cases and would mark a huge improvement over the status quo.

As I said at the start, we’re never going to get models that reliably tell us far in advance exactly where and when violent conflicts and mass atrocities will erupt. I am confident, however, that we can assess these risks even more accurately than we do now, but only if we start making more, and better versions, of the data our theories tell us we need.

I’ll end with a final plea to any public servants who might be reading this: if you do invest in developing better inputs, please make the results freely available to the public. When you share your data, you give the crowd a chance to help you spot and fix your mistakes, to experiment with various techniques, and to think about what else you might consider, all at no additional cost to you. What’s not to like about that?

Forecasting Round-Up No. 3

1. Mike Ward and six colleagues recently posted a new working paper on “the next generation of crisis prediction.” The paper echoes themes that Mike and Nils Metternich sounded in a recent Foreign Policy piece responding to one I wrote a few days earlier, about the challenges of forecasting rare political events around the world. Here’s a snippet from the paper’s intro:

We argue that conflict research in political science can be improved by more, not less, attention to predictions. The increasing availability of disaggregated data and advanced estimation techniques are making forecasts of conflict more accurate and precise. In addition, we argue that forecasting helps to prevent overfi tting, and can be used both to validate models, and inform policy makers.

I agree with everything the authors say about the scientific value and policy relevance of forecasting, and I think the modeling they’re doing on civil wars is really good. There were two things I especially appreciated about the new paper.

First, their modeling is really ambitious. In contrast to most recent statistical work on civil wars, they don’t limit their analysis to conflict onset, termination, or duration, and they don’t use country-years as their unit of observation. Instead, they look at country-months, and they try to tackle the more intuitive but also more difficult problem of predicting where civil wars will be occurring, whether or not one is already ongoing.

This version of the problem is harder because the factors that affect the risk of conflict onset might not be the same ones that affect the risk of conflict continuation. Even when they are, those factors might not affect the two risks in inverse ways. As a result, it’s hard to specify a single model that can reliably anticipate continuity in, and changes from, both forms of the status quo (conflict or no conflict).

The difficulty of this problem is evident in the out-of-sample accuracy of the model these authors have developed. The performance statistics are excellent on the whole, but that’s mostly because the model is accurately forecasting that whatever is happening in one month will continue to happen in the next. Not surprisingly, the model’s ability to anticipate transitions is apparently weaker. Of the five civil-war onsets that occurred in the test set, only two “arguably…rise to probability levels that are heuristic,” as the authors put it.

I emailed Mike to ask about this issue, and he said they were working on it:

Although the paper doesn’t go into it, in a separate part of this effort we actually do have separate models for onset and continuation, and they do reasonably well.  We are at work on terminations, and developing a new methodology that predicts onsets, duration, and continuation in a single (complicated!) model.  But that is down the line a bit.

Second and even more exciting to me, the authors close the paper with real, honest-to-goodness forecasts. Using the most recent data available when the paper was written, the authors generate predicted probabilities of civil war for the next six months: October 2012 through March 2013. That’s the first time I’ve seen that done in an academic paper about something other than an election, and I hope it sets a precedent that others will follow.

2. Over at Red (team) Analysis, Helene Lavoix appropriately pats The Economist on the back for publicly evaluating the accuracy of the predictions they made in their “World in 2012” issue. You can read the Economist‘s own rack-up here, but I want to highlight one of the points Helene raised in her discussion of it. Toward the end of her post, in a section called “Black swans or biases?”, she quotes this bit from the Economist:

As ever, we failed at big events that came out of the blue. We did not foresee the LIBOR scandal, for example, or the Bo Xilai affair in China or Hurricane Sandy.

As Helene argues, though, it’s not self evident that these events were really so surprising—in their specifics, yes, but not in the more general sense of the possibility of events like these occurring sometime this year. On Sandy, for example, she notes that

Any attention paid to climate change, to the statistics and documents produced by Munich-re…or Allianz, for example, to say nothing about the host of related scientific studies, show that extreme weather events have become a reality and we are to expect more of them and more often, including in the so-called rich countries.

This discussion underscores the importance of being clear about what kind of forecasting we’re trying to do, and why. Sometimes the specifics will matter a great deal. In other cases, though, we may have reason to be more concerned with risks of a more general kind, and we may need to broaden our lens accordingly. Or, as Helene writes,

The methodological problem we are facing here is as follows: Are we trying to predict discrete events (hard but not impossible, however with some constraints and limitations according to cases) or are we trying to foresee dynamics, possibilities? The answer to this question will depend upon the type of actions that should follow from the anticipation, as predictions or foresight are not done in a vacuum but to allow for the best handling of change.

3. Last but by no means least, Edge.org has just posted an interview with psychologist Phil Tetlock about his groundbreaking and ongoing research on how people forecast, how accurate (or not) their forecasts are, and whether or not we can learn to do this task better. [Disclosure: I am one of hundreds of subjects in Phil’s contribution to the IARPA tournament, the Good Judgment Project.] On the subject of learning, the conventional wisdom is pessimistic, so I was very interested to read this bit (emphasis added):

Is world politics like a poker game? This is what, in a sense, we are exploring in the IARPA forecasting tournament. You can make a good case that history is different and it poses unique challenges. This is an empirical question of whether people can learn to become better at these types of tasks. We now have a significant amount of evidence on this, and the evidence is that people can learn to become better [forecasters]. It’s a slow process. It requires a lot of hard work, but some of our forecasters have really risen to the challenge in a remarkable way and are generating forecasts that are far more accurate than I would have ever supposed possible from past research in this area.

And bonus alert: the interview is introduced by Daniel Kahneman, Nobel laureate and author of one of my favorite books from the past few years, Thinking, Fast and Slow.

N.B. In case you’re wondering, you can find Forecasting Round-Up Nos. 1 and 2 here and here.

A Chimp’s-Eye View of a Forecasting Experiment

For the past several months, I’ve been participating in a forecasting tournament as one of hundreds of “experts” making and updating predictions about dozens of topics in international political and economic affairs. This tournament is funded by IARPA, a research arm of the U.S. government’s Office of the Director of National Intelligence, and it’s really a grand experiment designed to find better ways to elicit, combine, and present probabilistic forecasts from groups of knowledgeable people.

There are several teams participating in this tournament; I happen to be part of the Good Judgment team that is headed by psychologist Phil Tetlock. Good Judgment “won” the first year of the competition, but that win came before I started participating, so alas, I can’t claim even a tiny sliver of the credit for that.

I’ve been prognosticating as part of my job for more than a decade, but almost all of the forecasting I’ve done in the past was based on statistical models designed to assess risks of a specific rare event (say, a coup attempt, or the onset of a civil war) across all countries worldwide. The Good Judgment Project is my first experience with routinely making calls about the likelihood of many different events based almost entirely on my subjective beliefs. Now that I’m a few months into this exercise, I thought I’d write something about how I’ve approached the task, because I think my experiences speak to generic difficulties in forecasting rare political and economic events.

By way of background, here’s how the forecasting process works for the Good Judgment team: I start by logging into a web site that lists a bunch of questions on an odd mix of topics, everything from the Euro-to-dollar exchange rate to the outcome of the recent election in Venezuela. I click on a question and am expected to assign a numeric probability to each of the two or more possible outcomes listed (e.g., “yes or no,” or “A, B, or C”). Those outcomes are always exhaustive, so the probabilities I assign must always sum to 1. Whenever I feel like it, I can log back in and update any of the forecasts I’ve already made. Then, when the event of interest happens (e.g., results of the Venezuelan election are announced), the question is closed, and the accuracy of my forecast for that question and for all questions closed to date is summarized with a statistic called the Brier score. I can usually see my score pretty soon after the question closes, and I can also see average scores on each question for my whole team and the cumulative scores for the top 10 percent or so of the team’s leader board.

So, how do you perform this task? In an ideal world, my forecasting process would go as follows:

1. Read the question carefully. What does “attack,” “take control of,” or “lose office” mean here, and how will it be determined? To make the best possible forecast, I need to understand exactly what it is I’m being asked to predict.

2. Forecast. In Unicorn World, I would have an ensemble of well-designed statistical models to throw at each and every question. In the real world, I’m lucky if there’s a single statistical model that applies even indirectly. Absent a sound statistical forecast, I would try to identify the class of events to which the question belongs and then determine a base rate for that class.  The “base rate” is just the historical frequency of the event’s occurrence in comparable cases—say, how often civil wars end in their first year, or how often incumbents lose elections in Africa. Where feasible, I would also check prediction markets, look for relevant polling data, seek out published predictions, and ask knowledgeable people.

In all cases, the idea is to find empirical evidence to which I can anchor my forecast, and to get a sense of how much uncertainty there is about this particular case. When there’s a reliable forecast from a statistical model or even good base-rate information, I would weight that most heavily and would only adjust far away from that prediction in cases where subject-matter experts make a compelling case about why this instance will be different. (As for what makes a case compelling, well…) If I can’t find any useful information or the information I find is all over the map, I would admit that I have no clue and do the probabilistic equivalent of a coin flip or die roll, splitting the predicted probability evenly across all of the possible outcomes.

3. Update. As Nate Silver argues in The Signal and The Noise, forecasters should adjust their prediction whenever we are presented with relevant new information. There’s no shame in reconsidering your views as new information becomes available; that’s called learning. Ideally, I would be disciplined about how I update my forecasts, using Bayes’ rule to check the human habit of leaning too hard on the freshest tidbit and take full advantage of all the information I had before.

4. Learn. Over time, I would see areas where I was doing well or poorly and would use those outcomes to deepen confidence in, or to try to improve, my mental models. For the questions I get wrong, what did I overlook? What were the forecasters who did better than I thinking about that I wasn’t? For the questions I answer well, was I right for the right reasons, or was it possibly just dumb luck? The more this process gets repeated, the better I should be able to do, within the limits determined by the basic predictability of the phenomenon in question.

To my mind, that is how I should be making forecasts. Now here are few things I’ve noticed so far about what I’m actually doing.

1. I’m lazy. Most of the time, I see the question, make an assumption about what it means without checking the resolution criteria, and immediately formulate a gut response on simple four-point scale: very likely, likely, unlikely, or very unlikely. I’d like to say that “I have no clue” is also on the menu of immediate responses, but my brain almost always makes an initial guess, even if it’s a topic about which I know nothing—for example, the resolution of a particular dispute before the World Trade Organization. What’s worse, it’s hard to dislodge that gut response once I’ve had it, even when I think I have better anchoring information.

2. I’m motivated by the competition, but that motivation doesn’t necessarily make me more attentive. As I said earlier, in this tournament, we can see our own scores and the scores of all the top performers as we go. With such a clear yardstick, it’s hard not to get pulled into seeing other forecasters as competitors and trying to beat them. You’d think that urge would motivate me to be more attentive to the task, more rigorously following the idealized process I described above. Most of the time, though, it just means that I calibrate my answers to the oddities of the yardstick—the Brier score involves squaring your error term, so the cost of being wrong isn’t distributed evenly across the range of possible values—and that I check the updated scores soon after questions close.

3. It’s hard to distinguish likelihood from timing. Some of the questions can get pretty specific about the timing of the event of interest. For example, we might be asked something like: Will Syrian president Bashar al-Assad fall before the end of October 2012; before the end of December 2012; before April 2013; or some time thereafter? I find these questions excruciatingly hard to answer, and it took me a little while to figure out why.

After thinking through how I had approached a few examples, I realized that my brain was conflating probability with proximity. In other words, the more likely I thought the event was, the sooner I expected it to occur.  That makes sense for some situations, but it doesn’t always, and a careful consideration of timing will usually require lots of additional information. For example, I might look at structural conditions in Syria and conclude that Assad can’t win the civil war he’s started, but how long it’s going to take for him to lose will depend on a host of other things with complex dynamics, like the price of oil, the flow of weapons, and the logistics of military campaigns. Interestingly, even though I’m now aware of this habit, I’m still finding it hard to break.

4. I’m eager to learn from feedback, but feedback is hard to come by. This project isn’t like weather forecasting or equities trading, where you make a prediction, see how you did, tweak your model, and try again, over and over. Most of the time, the questions are pretty idiosyncratic, so you’ll have just one or a few chances to make a certain prediction. What’s more, the answers are usually categorical, so even when you do get more than one shot, it’s hard to tell how wrong or right you were. In this kind of environment, it’s really hard to build on your experience. In the several months I’ve been participating, I think I’ve learned far more about the peculiarities of the Brier score than I have about the generative process underlying any of the phenomena about which I’ve been asked to predict.

And that, for whatever it’s worth, is one dart-throwing chimp’s initial impressions of this particular experiment. As it happens, I’ve been doing pretty well so far—I’ve been at or near the top of the team’s accuracy rankings since scores for the current phase started to come in—but I still feel basically clueless most of the time.  It’s like a golf tournament where good luck tosses some journeyman to the top of the leader board after the first or second round, and I keep waiting for the inevitable regression toward the mean to kick in. I’d like to think I can pull off a Tin Cup, but I know enough about statistics to expect that I almost certainly won’t.

House Votes to Defund Political Science Program: The Irony, It Burns

From the Monkey Cage this morning:

The Flake amendment Henry wrote about appears to have passed the House last night with a 218-208 vote. The amendment prohibits funding for NSF’s political science program, which among others funds many valuable data collection efforts including the National Election Studies. No other program was singled out like this…This is obviously not the last word on this. The provision may be scrapped in the conference committee (Sara?). But it is clear that political science research is in real danger of a very serious setback.

There’s real irony here in a Republican-controlled House of Representatives voting to defund a political-science program at a time when the Department of Defense and “intelligence community” are apparently increasing spending on similar work. With things like the Minerva Initiative, the Political Instability Task Force (on which I worked for 10 years), ICEWS, and IARPA’s Open Source Indicators programs, the parts of the government concerned with protecting national security seem to find growing value in social-science research and are spending accordingly. Meanwhile, the party that claims to be the stalwart defender of national security pulls in the opposite direction, like the opposing head on Dr. Doolittle’s Pushmi-pullyu. Nice work, fellas.

  • Author

  • Follow me on Twitter

  • Follow Dart-Throwing Chimp on WordPress.com
  • Enter your email address to follow this blog and receive notifications of new posts by email.

    Join 13,609 other subscribers
  • Archives

%d bloggers like this: