Will Unarmed Civilians Soon Get Massacred in Ukraine?

According to one pool of forecasters, most probably not.

As part of a public atrocities early-warning system I am currently helping to build for the U.S. Holocaust Memorial Museum’s Center for the Prevention of Genocide (see here), we are running a kind of always-on forecasting survey called an opinion pool. An opinion pool is similar in spirit to a prediction market, but instead of having participants trade shares tied the occurrence of some future event, we simply ask participants to estimate the probability of each event’s occurrence. In contrast to a traditional survey, every question remains open until the event occurs or the forecasting window closes. This way, participants can update their forecasts as often as they like, as they see or hear relevant information or just change their minds.

With generous support from Inkling, we started up our opinion pool in October, aiming to test and refine it before our larger early-warning system makes its public debut this spring (we hope). So far, we have only recruited opportunistically among colleagues and professional acquaintances, but we already have more than 70 registered participants. In the first four months of operation, we have used the system to ask more than two dozen questions, two of which have since closed because the relevant events occurred (mass killing in CAR and the Geneva II talks on Syria).

Over the next few years, we aim to recruit a large and diverse pool of volunteer forecasters from around the world with some claim to topical expertise or relevant local knowledge. The larger and more diverse our pool, the more accurate we expect our forecasts to be, and the wider the array of questions we can ask. (If you are interested in participating, please drop me a line at ulfelder <at> gmail <dot> com.)

A few days ago and prompted by a couple of our more active members, I posted a question to our pool asking, “Before 1 March 2014, will any massacres occur in Ukraine?” As of this morning, our pool had made a total of 13 forecasts, and the unweighted average of the latest of those estimates from each participating forecaster was just 15 percent. Under the criteria we specified (see Background Information below), this forecast does not address the risk of large-scale violence against or among armed civilians, nor does it exclude the possibility of a series of small but violent encounters that cumulatively produce a comparable or larger death toll. Still, for those of us concerned that security forces or militias will soon kill nonviolent protesters in Ukraine on a large scale, our initial forecast implies that those fears are probably unwarranted.

Crowd-Estimated Probability of Any Massacres in Ukraine Before 1 March 2014

Crowd-Estimated Probability of Any Massacres in Ukraine Before 1 March 2014

Obviously, we don’t have a crystal ball, and this is just an aggregation of subjective estimates from a small pool of people, none of whom (I think) is on the scene in Ukraine or has inside knowledge of the decision-making of relevant groups. Still, a growing body of evidence shows that aggregations of subjective forecasts like this one can often be usefully accurate (see here), even with a small number of contributing forecasters (see here). On this particular question, I very much hope our crowd is right. Whatever happens in Ukraine over the next few weeks, though, principle and evidence suggest that the method is sound, and we soon expect to be using this system to help assess risks of mass atrocities all over the world in real time.

Background Information

We define a “massacre” as an event that has the following features:

  • At least 10 noncombatant civilians are killed in one location (e.g., neighborhood, town, or village) in less than 48 hours. A noncombatant civilian is any person who is not a current member of a formal or irregular military organization and who does not apparently pose an immediate threat to the life, physical safety, or property of other people.
  • The victims appear to have been the primary target of the violence that killed them.
  • The victims do not appear to have been engaged in violent action or criminal activity when they were killed, unless that violent action was apparently in self-defense.
  • The relevant killings were carried out by individuals affiliated with a social group or organization engaged in a wider political conflict and appear to be connected to each other and to that wider conflict.

Those features will not always be self-evident or uncontroversial, so we use the following series of ad hoc rules to make more consistent judgments about ambiguous events.

  • Police, soldiers, prison guards, and other agents of state security are never considered noncombatant civilians, even if they are killed while off duty or out of uniform.
  • State officials and bureaucrats are not considered civilians when they are apparently targeted because of their professional status (e.g., assassinated).
  • Civilian deaths that occur in the context of operations by uniformed military-service members against enemy combatants are considered collateral damage, not atrocities, and should be excluded unless there is strong evidence that the civilians were targeted deliberately. We will err on the side of assuming that they were not.
  • Deaths from state repression of civilians engaged in nonviolent forms of protest are considered atrocities. Deaths resulting from state repression targeting civilians who were clearly engaged in rioting, looting, attacks on property, or other forms of collective aggression or violence are not.
  • Non-state militant or paramilitary groups, such as militias, gangs, vigilante groups, or raiding parties, are considered combatants, not civilians.

We will use contextual knowledge to determine whether or not a discrete event is linked to a wider conflict or campaign of violence, and we will err on the side of assuming that it is.

Determinations of whether or not a massacre has occurred will be made by the administrator of this system using publicly available secondary sources. Relevant evidence will be summarized in a blog post published when the determination is announced, and any dissenting views will be discussed as well.

Disclosure

I have argued on this blog that scholars have an obligation to disclose potential conflicts of interest when discussing their research, so let me do that again here: For the past two years, I have been paid as a contractor by the U.S. Holocaust Memorial Museum for my work on the atrocities early-warning system discussed in this post. Since the spring of 2013, I have also been paid to write questions for the Good Judgment Project, in which I participated as a forecaster the year before. To the best of my knowledge, I have no financial interests in, and have never received any payments from, any companies that commercially operate prediction markets or opinion pools.

Lost in the Fog of Civil War in Syria

On Twitter a couple of days ago, Adam Elkus called out a recent post on Time magazine’s World blog as evidence of the way that many peoples’ expectations about the course of Syria’s civil war have zigged and zagged over the past couple of years. “Last year press was convinced Assad was going to fall,” Adam tweeted. “Now it’s that he’s going to win. Neither perspective useful.” To which the eminent civil-war scholar Stathis Kalyvas replied simply, “Agreed.”

There’s a lesson here for anyone trying to glean hints about the course of a civil war from press accounts of a war’s twists and turns. In this case, it’s a lesson I’m learning through negative feedback.

Since early 2012, I’ve been a participant/subject in the Good Judgment Project (GJP), a U.S. government-funded experiment in “wisdom of crowds” forecasting. Over the past year, GJP participants have been asked to estimate the probability of several events related to the conflict in Syria, including the likelihood that Bashar al-Assad would leave office and the likelihood that opposition forces would seize control of the city of Aleppo.

I wouldn’t describe myself as an expert on civil wars, but during my decade of work for the Political Instability Task Force, I spent a lot of time looking at data on the onset, duration, and end of civil wars around the world. From that work, I have a pretty good sense of the typical dynamics of these conflicts. Most of the civil wars that have occurred in the past half-century have lasted for many years. A very small fraction of those wars flared up and then ended within a year. The ones that didn’t end quickly—in other words, the vast majority of these conflicts—almost always dragged on for several more years at least, sometimes even for decades. (I don’t have my own version handy, but see Figure 1 in this paper by Paul Collier and Anke Hoeffler for a graphical representation of this pattern.)

On the whole, I’ve done well in the Good Judgment Project. In the year-long season that ended last month, I ranked fifth among the 303 forecasters in my experimental group, all while the project was producing fairly accurate forecasts on many topics. One thing that’s helped me do well is my adherence to what you might call the forecaster’s version of the Golden Rule: “Don’t neglect the base rate.” And, as I just noted, I’m also quite familiar with the base rates of civil-war duration.

So what did I do when asked by GJP to think about what would happen in Syria? I chucked all that background knowledge out the window and chased the very narrative that Elkus and Kalyvas rightly decry as misleading.

Here’s a chart showing how I assessed the probability that Assad wouldn’t last as president beyond the end of March 2013, starting in June 2012. The actual question asked us to divide the probability of his exiting office across several time periods, but for simplicity’s sake I’ve focused here on the part indicating that he would stick around past April 1. This isn’t the same thing as the probability that the war would end, of course, but it’s closely related, and I considered the two events as tightly linked. As you can see, until early 2013, I was pretty confident that Assad’s fall was imminent. In fact, I was so confident that at a couple of points in 2012, I gave him zero chance of hanging on past March of this year—something a trained forecaster really never should do.

gjp assad chart

Now here’s another chart showing my estimates of the likelihood that rebels would seize control of Aleppo before May 1, 2013. The numbers are a little different, but the basic pattern is the same. I started out very confident that the rebels would win the war soon and only swung hard in the opposite direction in early 2013, as the boundaries of the conflict seemed to harden.

gjp aleppo chart

It’s impossible to say what the true probabilities were in this or any other uncertain situation. Maybe Assad and Aleppo really were on the brink of falling for a while and then the unlikely-but-still-possible version happened anyway.

That said, there’s no question that forecasts more tightly tied to the base rate would have scored a lot better in this case. Here’s a chart showing what my estimates might have looked like had I followed that rule, using approximations of the hazard rate from the chart in the Collier and Hoeffler paper. If anything, these numbers overstate the likelihood that a civil war will end at a given point in time.

gjp baserate chart

I didn’t keep a log spelling out my reasoning at each step, but I’m pretty confident that my poor performance here is an example of motivated reasoning. I wanted Assad to fall and the pro-democracy protesters who dominated the early stages of the uprising to win, and that desire shaped what I read and then remembered when it came time to forecast. I suspect that many of the pieces I was reading were slanted by similar hopes, creating a sort of analytic cascade similar to the herd behavior thought to drive many financial-market booms and busts. I don’t have the data to prove it, but I’m pretty sure the ups and downs in my forecasts track the evolving narrative in the many newspaper and magazine stories I was reading about the Syrian conflict.

Of course, that kind of herding happens on a lot of topics, and I was usually good at avoiding it. For example, when tensions ratcheted up on the Korean Peninsula earlier this year, I hewed to the base rate and didn’t substantially change my assessment of the risk that real clashes would follow.

What got me in the case of Syria was, I think, a sense of guilt. The Assad government has responded to a legitimate popular challenge with mass atrocities that we routinely read about and sometimes even see. In parts of the country, the resulting conflict is producing scenes of absurd brutality. This isn’t a “problem from hell,” as Samantha Powers’ book title would have it; it is a glimpse of hell. And yet, in the face of that horror, I have publicly advocated against American military intervention. Upon reflection, I wonder if my wildly optimistic forecasting about the imminence of Assad’s fall wasn’t my unconscious attempt to escape the discomfort of feeling complicit in the prolongation of that suffering.

As a forecaster, if I were doing these questions over, I would try to discipline myself to attend to the base rate, but I wouldn’t necessarily stop there. As I’ve pointed out in a previous post, the base rate is a valuable anchoring device, but attending to it doesn’t mean automatically ignoring everything else. My preferred approach, when I remember to have one, is to take that base rate as a starting point and then use Bayes’ theorem to update my forecasts in a more disciplined way. Still, I’ll bring a newly skeptical eye the flurry of stories predicting that Assad’s forces will soon defeat Syria’s rebels and keep their patron in power. Now that we’re a couple years into the conflict, quantified history tells us that the most likely outcome in any modest slice of time (say, months rather than years) is, tragically, more of the same.

And, as a human, I’ll keep hoping the world will surprise us and take a different turn.

A Chimp’s-Eye View of a Forecasting Experiment

For the past several months, I’ve been participating in a forecasting tournament as one of hundreds of “experts” making and updating predictions about dozens of topics in international political and economic affairs. This tournament is funded by IARPA, a research arm of the U.S. government’s Office of the Director of National Intelligence, and it’s really a grand experiment designed to find better ways to elicit, combine, and present probabilistic forecasts from groups of knowledgeable people.

There are several teams participating in this tournament; I happen to be part of the Good Judgment team that is headed by psychologist Phil Tetlock. Good Judgment “won” the first year of the competition, but that win came before I started participating, so alas, I can’t claim even a tiny sliver of the credit for that.

I’ve been prognosticating as part of my job for more than a decade, but almost all of the forecasting I’ve done in the past was based on statistical models designed to assess risks of a specific rare event (say, a coup attempt, or the onset of a civil war) across all countries worldwide. The Good Judgment Project is my first experience with routinely making calls about the likelihood of many different events based almost entirely on my subjective beliefs. Now that I’m a few months into this exercise, I thought I’d write something about how I’ve approached the task, because I think my experiences speak to generic difficulties in forecasting rare political and economic events.

By way of background, here’s how the forecasting process works for the Good Judgment team: I start by logging into a web site that lists a bunch of questions on an odd mix of topics, everything from the Euro-to-dollar exchange rate to the outcome of the recent election in Venezuela. I click on a question and am expected to assign a numeric probability to each of the two or more possible outcomes listed (e.g., “yes or no,” or “A, B, or C”). Those outcomes are always exhaustive, so the probabilities I assign must always sum to 1. Whenever I feel like it, I can log back in and update any of the forecasts I’ve already made. Then, when the event of interest happens (e.g., results of the Venezuelan election are announced), the question is closed, and the accuracy of my forecast for that question and for all questions closed to date is summarized with a statistic called the Brier score. I can usually see my score pretty soon after the question closes, and I can also see average scores on each question for my whole team and the cumulative scores for the top 10 percent or so of the team’s leader board.

So, how do you perform this task? In an ideal world, my forecasting process would go as follows:

1. Read the question carefully. What does “attack,” “take control of,” or “lose office” mean here, and how will it be determined? To make the best possible forecast, I need to understand exactly what it is I’m being asked to predict.

2. Forecast. In Unicorn World, I would have an ensemble of well-designed statistical models to throw at each and every question. In the real world, I’m lucky if there’s a single statistical model that applies even indirectly. Absent a sound statistical forecast, I would try to identify the class of events to which the question belongs and then determine a base rate for that class.  The “base rate” is just the historical frequency of the event’s occurrence in comparable cases—say, how often civil wars end in their first year, or how often incumbents lose elections in Africa. Where feasible, I would also check prediction markets, look for relevant polling data, seek out published predictions, and ask knowledgeable people.

In all cases, the idea is to find empirical evidence to which I can anchor my forecast, and to get a sense of how much uncertainty there is about this particular case. When there’s a reliable forecast from a statistical model or even good base-rate information, I would weight that most heavily and would only adjust far away from that prediction in cases where subject-matter experts make a compelling case about why this instance will be different. (As for what makes a case compelling, well…) If I can’t find any useful information or the information I find is all over the map, I would admit that I have no clue and do the probabilistic equivalent of a coin flip or die roll, splitting the predicted probability evenly across all of the possible outcomes.

3. Update. As Nate Silver argues in The Signal and The Noise, forecasters should adjust their prediction whenever we are presented with relevant new information. There’s no shame in reconsidering your views as new information becomes available; that’s called learning. Ideally, I would be disciplined about how I update my forecasts, using Bayes’ rule to check the human habit of leaning too hard on the freshest tidbit and take full advantage of all the information I had before.

4. Learn. Over time, I would see areas where I was doing well or poorly and would use those outcomes to deepen confidence in, or to try to improve, my mental models. For the questions I get wrong, what did I overlook? What were the forecasters who did better than I thinking about that I wasn’t? For the questions I answer well, was I right for the right reasons, or was it possibly just dumb luck? The more this process gets repeated, the better I should be able to do, within the limits determined by the basic predictability of the phenomenon in question.

To my mind, that is how I should be making forecasts. Now here are few things I’ve noticed so far about what I’m actually doing.

1. I’m lazy. Most of the time, I see the question, make an assumption about what it means without checking the resolution criteria, and immediately formulate a gut response on simple four-point scale: very likely, likely, unlikely, or very unlikely. I’d like to say that “I have no clue” is also on the menu of immediate responses, but my brain almost always makes an initial guess, even if it’s a topic about which I know nothing—for example, the resolution of a particular dispute before the World Trade Organization. What’s worse, it’s hard to dislodge that gut response once I’ve had it, even when I think I have better anchoring information.

2. I’m motivated by the competition, but that motivation doesn’t necessarily make me more attentive. As I said earlier, in this tournament, we can see our own scores and the scores of all the top performers as we go. With such a clear yardstick, it’s hard not to get pulled into seeing other forecasters as competitors and trying to beat them. You’d think that urge would motivate me to be more attentive to the task, more rigorously following the idealized process I described above. Most of the time, though, it just means that I calibrate my answers to the oddities of the yardstick—the Brier score involves squaring your error term, so the cost of being wrong isn’t distributed evenly across the range of possible values—and that I check the updated scores soon after questions close.

3. It’s hard to distinguish likelihood from timing. Some of the questions can get pretty specific about the timing of the event of interest. For example, we might be asked something like: Will Syrian president Bashar al-Assad fall before the end of October 2012; before the end of December 2012; before April 2013; or some time thereafter? I find these questions excruciatingly hard to answer, and it took me a little while to figure out why.

After thinking through how I had approached a few examples, I realized that my brain was conflating probability with proximity. In other words, the more likely I thought the event was, the sooner I expected it to occur.  That makes sense for some situations, but it doesn’t always, and a careful consideration of timing will usually require lots of additional information. For example, I might look at structural conditions in Syria and conclude that Assad can’t win the civil war he’s started, but how long it’s going to take for him to lose will depend on a host of other things with complex dynamics, like the price of oil, the flow of weapons, and the logistics of military campaigns. Interestingly, even though I’m now aware of this habit, I’m still finding it hard to break.

4. I’m eager to learn from feedback, but feedback is hard to come by. This project isn’t like weather forecasting or equities trading, where you make a prediction, see how you did, tweak your model, and try again, over and over. Most of the time, the questions are pretty idiosyncratic, so you’ll have just one or a few chances to make a certain prediction. What’s more, the answers are usually categorical, so even when you do get more than one shot, it’s hard to tell how wrong or right you were. In this kind of environment, it’s really hard to build on your experience. In the several months I’ve been participating, I think I’ve learned far more about the peculiarities of the Brier score than I have about the generative process underlying any of the phenomena about which I’ve been asked to predict.

And that, for whatever it’s worth, is one dart-throwing chimp’s initial impressions of this particular experiment. As it happens, I’ve been doing pretty well so far—I’ve been at or near the top of the team’s accuracy rankings since scores for the current phase started to come in—but I still feel basically clueless most of the time.  It’s like a golf tournament where good luck tosses some journeyman to the top of the leader board after the first or second round, and I keep waiting for the inevitable regression toward the mean to kick in. I’d like to think I can pull off a Tin Cup, but I know enough about statistics to expect that I almost certainly won’t.

Follow

Get every new post delivered to your Inbox.

Join 5,702 other followers

%d bloggers like this: