Road-Testing GDELT as a Resource for Monitoring Atrocities

As I said here a few weeks ago, I think the Global Dataset on Events, Location, and Tone (GDELT) is a fantastic new resource that really embodies some of the ways in which technological changes are coming together to open lots of new doors for social-scientific research. GDELT’s promise is obvious: more than 200 million political events from around the world over the past 30 years, all spotted and coded by well-trained software instead of the traditional armies of undergrad RAs, and with daily updates coming online soon. Or, as Adam Elkus’ t-shirt would have it, “200 million observations. Only one boss.”

BUT! Caveat emptor! Like every other data-collection effort ever, GDELT is not alchemy, and it’s important that people planning to use the data, or even just to consume analysis based on it, understand what its limitations are.

I’m starting to get a better feel for those limitations from my own efforts to use GDELT to help observe atrocities around the world, as part of a consulting project I’m doing for the U.S. Holocaust Memorial Museum’s Center for the Prevention of Genocide. The core task of that project is to develop plans for a public early-warning system that would allow us to assess the risk of onsets of atrocities in countries worldwide more accurately and earlier than current practice.

When I heard about GDELT last fall, though, it occurred to me that we could use it (and similar data sets in the pipeline) to support efforts to monitor atrocities as well. The CAMEO coding scheme on which GDELT is based includes a number of event types that correspond to various forms of violent attack and other variables indicating who was doing attacking whom. If we could develop a filter that reliably pulled events of interest to us from the larger stream of records, we could produce something like a near-real time bulletin on recent violence against civilians around the world. Our record would surely have some blind spots—GDELT only tracks a limited number of news sources, and some atrocities just don’t get reported, period—but I thought it would reliably and efficiently alert us to new episodes of violence against civilians and help us identify trends in ongoing ones.

Well, you know what they say about plans and enemies and first contact. After digging into GDELT, I still think we can accomplish those goals, but it’s going to take more human effort than I originally expected. Put bluntly, GDELT is noisier than I had anticipated, and for the time being the only way I can see to sharpen that signal is to keep a human in the loop.

Imagine (fantasize?) for a moment that there’s a perfect record somewhere of all the political interactions GDELT is trying to identify. For kicks, let’s call it the Encyclopedia Eventum (EE). Like any detection system, GDELT can mess up in two basic ways: 1) errors of omission, in which GDELT fails to spot something that’s in the EE; and 2) errors of commission, in which it mistakenly records an event that isn’t in the EE (or, relatedly, is in the EE but in a different place). We might also call these false negatives and false positives, respectively.

At this point, I can’t say anything about how often GDELT is making errors of omission, because I don’t have that Encyclopedia Eventum handy. A more realistic strategy for assessing the rate of errors of omission would involve comparing a subset of GDELT to another event data set that’s known to be a fairly reliable measure for some time and place of something GDELT is meant to track—say, protest and coercion in Europe—and see how well they match up, but that’s not a trivial task, and I haven’t tried it yet.

Instead, the noise I’m seeing is on the other side of that coin: the errors of commission, or false positives. Here’s what I mean:

To start developing my atrocities-monitoring filter, I downloaded the reduced and compressed version of GDELT recently posted on the Penn State Event Data Project page and pulled the tab-delimited text files for a couple of recent years. I’ve worked with event data before, so I’m familiar with basic issues in their analysis, but every data set has its own idiosyncrasies. After trading emails with a few CAMEO pros and reading Jay Yonamine’s excellent primer on event aggregation strategies, I started tinkering with a function in R that would extract the subset of events that appeared to involve lethal force against civilians. That function would involve rules to select on three features: event type, source (the doer), and target.

  • Event Type. For observing atrocities, type 20 (“Engage in Unconventional Mass Violence”) was an obvious choice. Based on advice from those CAMEO pros, I also focused on 18 (“Assault”) and 19 (“Fight”) but was expecting that I would need to be more restrictive about the subtypes, sources, and targets in those categories to avoid errors of commission.
  • Source. I’m trying to track violence by state and non-state agents, so I focused on GOV (government), MIL (Military), COP (police), and intelligence agencies (SPY) for the former and REB (militarized opposition groups) and SEP (separatist groups) for the latter. The big question mark was how to handle records with just a country code (e.g., “SYR” for Syria) and no indication of the source’s type. My CAMEO consultants told me these would usually refer in some way to the state, so I should at least consider including them.
  • Target. To identify violence against civilians, I figured I would get the most mileage out of the OPP (non-violent political opposition), CVL (“civilians,” people in general), and REF (refugees) codes, but I wanted to see if the codes for more specific non-state actors (e.g., LAB for labor, EDU for schools or students, HLH for health care) would also help flag some events of interest.

After tinkering with the data a bit, I decided to write to separate functions, one for events with state perpetrators and another for events with non-state perpetrators. If you’re into that sort of thing, you can see the state-perpetrator version of that filtering function on Github, here.

When I ran the more than 9 million records in the “2011.reduced.txt” file through that function, I got back 2,958 events. So far, so good. As soon as I started poking around in the results, though, I saw a lot of records that looked . The current release of GDELT doesn’t include text from or links to the source material, so it’s hard to say for sure what real-world event any one record describes. Still, some of the perpetrator-and-target combos looked odd to me, and web searches for relevant stories either came up empty or reinforced my suspicions that the records were probably errors of commission. Here are a few examples, showing the date, event type, source, and target:

  • 1/8/2011 193 USAGOV USAMED. Type 193 is “Fight with small arms and light weapons,” but I don’t think anyone from the U.S. government actually got in a shootout or knife fight with American journalists that day. In fact, that event-source-target combination popped up a lot in my subset.
  • 1/9/2011 202 USAMIL VNMCVL. Taken on its face, this record says that U.S. military forces killed Vietnamese civilians on January 9, 2011. My hunch is that the story on which this record is based was actually talking about something from the Vietnam War.
  • 4/11/2011 202 RUSSPY POLCVL. This record seems to suggest that Russian intelligence agents “engaged in mass killings” of Polish civilians in central Siberia two years ago. I suspect the story behind this record was actually talking about the Kaytn Massacre and associated mass deportations that occurred in April 1940.

That’s not to say that all the records looked wacky. Interleaved with these suspicious cases were records representing exactly the kinds of events I was trying to find. For example, my filter also turned up a 202 GOV SYRCVL for June 10, 2011, a day on which one headline blared “Dozens Killed During Syrian Protests.”

Still, it’s immediately clear to me that GDELT’s parsing process is not quite at the stage where we can peruse the codebook like a menu, identify the morsels we’d like to consume, phone our order in, and expect to have exactly the meal we imagined waiting for us when we go to pick it up. There’s lots of valuable information in there, but there’s plenty of chaff, too, and for the time being it’s on us as researchers to take time to try to sort the two out. This sorting will get easier to do if and when the posted version adds information about the source article and relevant text, but “easier” in this case will still require human beings to review the results and do the cross-referencing.

Over time, researchers who work on specific topics—like atrocities, or interstate war, or protest activity in specific countries—will probably be able to develop supplemental coding rules and tweak their filters to automate some of what they learn. I’m also optimistic that the public release of GDELT will accelerate improvements the software and dictionaries it uses, expanding its reach while shrinking the error rates. In the meantime, researchers are advised to stick to the same practices they’ve always used (or should have, anyway): take time to get to know your data; parse it carefully; and, when there’s no single parsing that’s obviously superior, check the sensitivity of your results to different permutations.

PS. If you have any suggestions on how to improve the code I’m using to spot potential atrocities or otherwise improve the monitoring process I’ve described, please let me know. That’s an ongoing project, and even marginal improvements in the fidelity of the filter would be a big help.

PPS. For more on these issues and the wider future of automated event coding, see this ensuing post from Phil Schrodt on his blog.

In Praise of Fun Projects

Over the past year, I’ve watched a few people I know in digital life sink a fair amount of time into statistical modeling projects that other people might see as “just for fun,” if not downright frivolous. Last April, for example, public-health grad student Brett Keller delivered an epic blog post that used event history models to explore why some competitors survive longer than others in the fictional Hunger Games. More recently, sociology Ph.D. student Alex Hanna has been using the same event history techniques to predict who’ll get booted each week from the reality TV show RuPaul’s Drag Race (see here and here so far). And then there’s Against the Spread, a nascent pro-football forecasting project from sociology Ph.D. candidate Trey Causey, whose dissertation uses natural language processing and agent-based modeling to examine information ecology in authoritarian regimes.

I happen to think these kinds of projects are a great idea, if you can find the time to do them–and if you’re reading this blog post, you probably can. Based on personal experience, I’m a big believer in learning by doing. Concepts don’t stick in my brain when I only read about them; I’ve got to see the concepts in action and attach them to familiar contexts and examples to really see what’s going on. Blog posts like Brett’s and Alex’s are a terrific way to teach yourself new methods by applying them to toy problems where the data sets are small, the domain is familiar and interesting, and the costs of being wrong are negligible.

Lego-Raspberry-Pi-case

A bigger project like Trey’s requires you to solve a lot of complex procedural and methodological problems, but all the skills you develop along the way transfer to other domains. If you can build and run a decent forecasting system from scratch for something as complex as pro football, you can do the same for “seriouser” problems, too. I think that demonstrated skill on fun tasks says as much about someone’s ability to execute complex research in the real world as any job talk or publication in a peer-reviewed journal. Done well, these hobby projects can even evolve into rewarding enterprises of their own. Just ask Nate Silver, who kickstarted his now-prodigious career as a statistical forecaster with PECOTA, a baseball forecasting system that he ginned up for fun while working for pay as a consultant.

I suspect that a lot of people in the private sector already get this. Academia, not so much, but then they’re the ones who wind up poorer for it.

Forecasting Politics Is Still Hard to Do (Well)

Last November, after the U.S. elections, I wrote a thing for Foreign Policy about persistent constraints on the accuracy of statistical forecasts of politics. The editors called it “Why the World Can’t Have a Nate Silver,” and the point was that much of what people who follow international affairs care about is still a lot harder to forecast accurately than American presidential elections.

One of the examples I cited in that piece was Silver’s poor performance on the U.K.’s 2010 parliamentary elections. Just two years before his forecasts became a conversation piece in American politics, the guy the Economist called “the finest soothsayer this side of Nostradamus” missed pretty badly in what is arguably another of the most information-rich election environments in the world.

A couple of recent election-forecasting efforts only reinforce the point that, the Internet and polling and “math” notwithstanding, this is still hard to do.

The first example comes from political scientist Chris Hanretty, who applied a statistical model to opinion polls to forecast the outcome of Italy’s parliamentary elections. Hanretty’s algorithm indicated that a coalition of center-left parties was virtually certain to win a majority and form the next government, but that’s not what happened. After the dust had settled, Hanretty sifted through the rubble and concluded that “the predictions I made were off because the polls were off.”

Had the exit polls given us reliable information, I could have made an instant prediction that would have been proved right. As it was, the exit polls were wrong, and badly so. This, to me, suggests that the polling industry has made a collective mistake.

The second recent example comes from doctoral candidate Ken Opalo, who used polling as grist for a statistical mill to forecast the outcome of Kenya’s presidential election. Ken’s forecast indicated that Uhuru Kenyatta would get the most votes but would fall short of the 50-percent-plus-one-vote required to win in the first round, making a run-off “almost inevitable.” In fact, Kenyatta cleared the 50-percent threshold in the first try, making him Kenya’s new president-elect. Once again, noisy polling data was apparently to blame. As Ken noted in a blog post before the results were finalized,

Mr. Kenyatta significantly outperformed the national polls leading to the election. I estimated that the national polls over-estimated Odinga’s support by about 3 percentage points. It appears that I may have underestimated their overestimation. I am also beginning to think that their regional weighting was worse than I thought.

As I see it, both of these forecasts were, as Nate Silver puts it in his book, wrong for the right reasons. Both Hanretty and Opalo built models that used the best and most relevant information available to them in a thoughtful way, and neither forecast was wildly off the mark. Instead, it just so happened that modest errors in the forecasts interacted with each country’s electoral rules to produce categorical outcomes that were quite different from the ones the forecasts had led us to expect.

But that’s the rub, isn’t it? Even in the European Union in the Internet age, it’s still hard to predict the outcome of national elections. We’re getting smarter about how to model these things, and our computers can now process more of the models we can imagine, but polling data are still noisy and electoral systems complex.

And that’s elections, where polling data nicely mimic the data-generating process that underlies the events we’re trying to forecast. We don’t have polls telling us what share of the population plans to turn out for anti-government demonstrations or join a rebel group or carry out a coup—and even if we did, we probably wouldn’t trust them. Absent these micro-level data, we turn to proxy measures and indicators of structural opportunities and constraints, but every step away from the choices we’re trying to forecast adds more noise to the result. Agent-based computational models represent a promising alternative, but when it comes to macro-political phenomena like revolutions and state collapses, these systems are still in their infancy.

Don’t get me wrong. I’m thrilled to see more people using statistical models to try to forecast important events in international politics, and I would eagerly pit the forecasts from models like Hanretty’s and Opalo’s against the subjective judgments of individual experts any day. I just think it’s important to avoid prematurely declaring the arrival of a revolution in forecasting political events, to keep reminding ourselves how hard this problem still is. As if the (in)accuracy of our forecasts would let us have it any other way.

Coup Risk in 2013, Mapped My Way

This blog’s gotten a lot more traffic than usual since yesterday, when Max Fisher of the Washington Post called out my 2013 coup forecasts in a post on WorldViews.

I’m grateful for the attention Max has drawn to my work, but if it had been up to me, I would have done the mapping a little differently. As I said to Max in an email from which he later excerpted, the forecasts simply aren’t sharp enough to parse the world as finely as their map did. Our theories of what causes coup attempts are too fuzzy and our measures of the things in those theories are too spotty to estimate the probability of these rare events with that much precision.

But, hey, I’m a data guy. I don’t have to stick to grumbling about the Post‘s map; I can make my own! So…

The map below sorts the countries of the world into three groups based on their relative coup risk for 2013: highest (red), moderate (orange), and lowest (beige). I emphasize “relative” because coup attempts are very rare, so the estimated risk of coup attempts in any given country in any single year is pretty small. For example, Guinea-Bissau tops my list for 2013, and the estimated probability of at least one coup attempt occurring there this year is only 25%. Most countries worldwide are under 2%.

Consistent with an emphasis on relative risk, the categories I’ve mapped are based on rank order, not predicted probability. The riskiest fifth of the world (33 countries) makes up the “highest” group, the second fifth the “moderate” group, and the bottom three-fifths the “lowest” group.

This forecasting process doesn’t have enough of track record for me to say exactly how those categories relate to real-world risk, but based on my experience working with similar data and models, I would expect roughly four of every five coup attempts to occur in countries identified here as high risk, and the occasional “miss” to come from the moderate-risk set. Only very rarely should coup attempts come from the 100 or so countries in the low-risk group.

coup_risk_map_2013

FTR, this map was made in R using the ‘rworldmap‘ package.

If Only It Were That Simple

Hans Rosling has done a lot to popularize statistical thinking about human development, and that’s a very good thing, but yesterday he did something that drives me crazy. After word spread of an apparent coup attempt in Eritrea (more on that later), Rosling tweeted this:

If you create a Democracy x Income score Eritrea the lowest in the world! See the graphic predicting the coup.

And here’s a shot of that graphic:

rosling_eritrea

Brilliant, right? Just cross-tabulate a couple of commonly used measures of economic and political development and you get an index that accurately predicts this attempted coup in Eritrea that seemed to catch so many people by surprise.

Well, modelers have a name for this strategy, and it’s “overfitting.” ”Cherry picking” works, too. After the fact, it’s easy to construct a predictive index that does very well at spotlighting any single event. If you poke around enough in the data, you can usually find some combination of measures under which the case of the moment rises to the top. If yesterday’s coup attempt had happened in China, for example—the big red ball in the bottom middle of Rosling’s chart—Rosling could have treated population size as a third dimension in his index, and China would have occupied the bottom corner of the resulting cube. We saw a lot of this right after the uprisings in Tunisia and Egypt in early 2011, too, when for example New York Times columnist Charles Blow found a handful of factors that seemed to differentiate those two countries from many of their regional neighbors.

What those after-the-fact snapshots won’t tell you, however, is how reliable that forecasting strategy would be over time. Most of us don’t need an index that’s optimized to predict a specific event, and even if we did, we would still need to build it before the event happened in order for it to be useful. To build a good predictive model, we need to find things that consistently help separate the situations where events of interest will happen from the ones where they won’t. Going back to Rosling’s chart, we see that his index also puts North Korea, Myanmar, Togo, the Gambia, and Cameroon in the lower left-hand corner, yet none of those countries has suffered any coup attempts for many years. Meanwhile, the two countries that saw successful coups d’etat in 2012—Mali and Guinea-Bissau—are both in the upper left, poor but democratic. Dig a little deeper, and that scatterplot’s not looking quite as useful.

So how reliable is Rosling’s two-dimensional index as a device for forecasting coups? To get an empirical answer that question, I used the two variables Rosling picked—GDP per capita and degree of democracy—to estimate a simple logistic regression model in a training data set covering the period 1960-1994. I then applied that model to data from the period 1995-2010 to see how well it worked on cases it hadn’t already “seen.” The thing this model is trying to predict is the occurrence of any coup attempts (successful or failed) in a country during a particular calendar year, based on the value of the two predictors at the end of the previous year. Data on coup attempts come from the Center for Systemic Peace, and data for the two risk factors come from the World Bank’s World Development Indicators and the Polity project, respectively.

Before seeing how that model fared, it’s important to note that, just by modeling, we’ve already added some valuable information to the mix that isn’t in Rosling’s scatterplot. First, the logistic regression model includes an intercept that captures information about the historical base rate of coup attempts worldwide, and most forecasters can tell you that the base rate is a powerful predictor in its own right. Second, where Rosling’s scatterplot implicitly gives its two elements equal weight in its predictions, the statistical model estimates parameters for those two variables that incorporate historical evidence about the strength and direction of their association with coup risk. Ideally we would use Rosling’s two variables on their own, but we need a model to convert values of those variables into predicted probabilities, and the process of modeling itself already carries us a couple of steps beyond the two-dimensional plot.

Now, the results. Area under the ROC curve (AUC) is commonly used as a measure of predictive power for classification models like this one. AUC represents the probability that a randomly selected positive case (here, a country-year with any coup attempts) will have a higher predicted probability than a randomly selected negative case (a country-year with no coup attempts). It ranges from 0.5 to 1, with higher values indicating better discrimination. The bar chart below plots AUC for three models: 1) one with Rosling’s two variables as linear predictors of coup risk; 2) another with nonlinear versions of Rosling’s variables (logged GDP and a quadratic term for the degree of democracy); and 3) a more complex model that adds information about recent coup activity, the age of a country’s political institutions, participation in international human-rights treaty regimes, among other things.

rosling_auc

As the chart shows, a model with linear versions of Rosling’s axes does reasonably well at forecasting coup attempts, with an AUC of about 0.75. Transforming those variables to capture nonlinearities in those associations improves the predictive accuracy, but only a smidgen, to 0.76. Finally, the model that includes several other risk factors produces a bigger bump, pushing the AUC up to 0.80.

Based on those results, I think it’s fair to say that Rosling’s scatterplot is on the right track, but we can do a lot better by a) estimating a model instead of just using a scatterplot and b) including other useful predictors in that model. The fact that a modeled version of Rosling’s index did okay won’t surprise anyone who’s done quantitative analysis of political instability. If you want to assess the relative risk of various forms of domestic political crisis across many countries, you can get a pretty good handle on the problem just by seeing how poor and authoritarian it is. Still, a scatterplot alone doesn’t get us very far, and adding a few more things to the model that are specifically indicative of coup risk helps us do even better.

I’ll close this post on an ironic note: at this point, it’s not even clear at this point that yesterday’s tumult in Eritrea was really an attempted coup. According to an initial report from Reuters, the soldiers who occupied the Ministry of Information demanded the release of political prisoners but did not threaten to topple the government. Political scientists generally reserve the term “coup” for situations where challengers use or threaten violence to capture state power and call cases where disobedient soldiers demand policy changes “mutinies.” This might seem like hair splitting, but the latter is more common and usually less consequential than the former, and we wouldn’t necessarily expect a predictive model designed for the one to work well for the other.

Rules of Thumb vs. Statistical Models, or the Misconception that Will Not Die

Steve LeVine kicked off the new year on Quartz with a nice post called “14 rules for predicting future geopolitical events.” According to LeVine,

Nations are eccentric. But they also have threads of repeated history through which we can discern what comes next…Many political scientists dismiss the detection of such trends as “deterministic.” Some insist that, unlike in economics and statistics, there is as yet in fact no useful algorithm for foreseeing events—the only tool available to political forecasters is their own intuition. But it is vapid to observe the world, its nations and peoples as an unfathomable mob. History is not a science—but neither is it pure chaos.

If you’re a regular reader of this blog, you know I basically agree. Borrowing Almond and Genco’s classic metaphor, politics isn’t clock-like, but it’s not purely random, either. I also found little to dispute in the 14 rules that followed. For example, LeVine’s Muddle-Along Rule and its corollary, the Precipice Rule, are really just admonitions to take a deep breath when considering the risk of big but rare crises and recognize that, most of the time, the crisis won’t materialize. In statistical terms, that’s analogous to forecasting the base rate, and that’s actually a pretty powerful rule of thumb.

Still, after reading LeVine’s piece, I felt frustrated. As someone who uses statistical models to do the kind of forecasting he seems to be proposing, I couldn’t help but wonder: Why stop halfway? Rules of thumb can be very helpful, but they are often pretty coarse. Okay, so most cases will “tend to muddle along regardless of the trouble, and not collapse,” but can’t we say something more specific about just how unlikely that collapse is? Does it vary across forms of crisis or types of countries? LeVine proposes using history as resource for gleaning useful patterns but then stops short of doing so in anything but the fuzziest terms.

Equally important, it’s often not clear how to use rules of thumb together, especially when they’re in tension with one another. Some of the rules on LeVine’s list contradict each other, and it’s not clear to me how you’d adjudicate between them when trying to make judgments about specific cases. For example, in addition to the Muddle-Along and Precipice Rules, LeVine gives us the True-Believer Rule:

While people and countries tend toward the middle, events can turn on exceptions operating on the extremes. Hitler’s Germany is an example. Today, Khameini’s Iran, Afghanistan’s Taliban, Kim’s North Korea and Chavez’s Venezuela punch above their weight in influencing the geopolitical landscape.

Now imagine you’re trying to apply these rules to a case that isn’t already on that short list of exceptions. How can we tell in advance whether it’s a muddler or a true believer? If you’re not sure, what’s the forecast?

I don’t know LeVine personally, so I won’t make any assumptions about his motivations, but I do think the preference for rules of thumb over quantified forecasts exemplified in his Quartz post is pretty common to political forecasting. And I wonder if this aversion to statistics isn’t born, in part, of ignorance of what the use of statistics implies. A couple of days ago, I asked on Twitter: ”Why do lay audiences consume weather forecasts w/o asking how they’re made but want peek under hood of stat forecasts of pol crises?” To which Dan Drezner replied, “My (obvious) answer is that people accept meteorology as an actual science, don’t believe the same about political science.”

But here’s the thing: statistics isn’t science, it’s a set of tools for doing science. The decision to use statistics does not presume either regularity in, or certainty about, the object of study. If anything, that decision is a reasoned choice to search for empirical evidence of regularity, an attempt to clarify our un-certainty. The whole point of statistical modeling for forecasting is to take a bunch of conjectures like LeVine’s and run them through a mill that provides clearer answers to the questions that naturally arise when we try to apply those conjectures to specific situations.

Put another way, a statistical forecasting model is really nothing more than a meta-rule of thumb, a flow chart for moving from those initial conjectures to a single best estimate. That the estimate is presented as a number does not automatically imply that its presenter believes it’s any more true or certain than an estimate described in a phrase. It’s just another form of representation for our ideas, and one that happens to be especially useful because it lends itself to the application of some really powerful tools for pattern recognition we’ve finally devised after a few million years of human evolution.

Yes, there was a time when statistics was new and notions of science and modernity and quantification all got mashed together in some professional and social circles into an extreme optimism about the predictability of human behaviors. As far as I can tell, though,very few practicing social scientists think that way any more. And, honestly, I’m just tired of carrying the intellectual baggage those 19th-century hacks left behind.

PS. In a follow-up post, LeVine applies his rules of thumb to produce “six geopolitical predictions for 2013.” On the whole, I think this is a thoughtful exercise, and I only wish more qualitative analysts would be as transparent as Steve is here about the mental models underlying their predictions.

Dr. Bayes, or How I Learned to Stop Worrying and Love Updating

Over the past week, I spent a chunk of every morning cycling through the desert around Tucson, Arizona. On Friday, while riding toward my in-laws’ place in the mountains west of town, I heard the roar of a jet overhead. My younger son’s really into flying right now, and Tucson’s home to a bunch of fighter jets, so I reflexively glanced up toward the noise, hoping to spot something that would interest him.

On that first glance, all I saw was an empty patch of deep blue sky. Without effort, my brain immediately retrieved a lesson from middle-school physics, reminding me that the relative speeds of light and sound meant any fast-moving plane would appear ahead of its roar. But which way? Before I glanced up again, I drew on prior knowledge of local patterns to guess that it would almost certainly be to my left, traveling east, and not so far ahead of the sound because it would be flying low as it approached either the airport or the Air Force Base.  Moments after my initial glance, I looked up a second time and immediately spotted the plane where I’d now expected to find it. When I did, I wasn’t surprised to see that it was a commercial jet, not a military one, because most of the air traffic in the area is civilian.

This is Bayesian thinking, and it turns out that we do it all the time.  The essence of Bayesian inference is updating. We humans intuitively form and hold beliefs (estimates) about all kinds of things. Those beliefs are often erroneous, but it turns out that we can make them better by revising (updating) them whenever we encounter new information that pertains to them. Updating is really just a form of learning, but Bayes’ theorem gives us a way to structure that learning that turns out to be very powerful. As cognitive scientists Tom Griffiths and Joshua Tenenbaum summarize in a nice 2006 paper [PDF] called “Statistics and the Bayesian Mind,”

The mathematics of Bayesian belief is set out in the box. The degree to which one should believe in a particular hypothesis h after seeing data d is determined by two factors: the degree to which one believed in it before seeing d, as reflected by the prior probability P(h), and how well it predicts the data d, as reflected in the likelihood, P(d|h).

This might sound like a lot of work or just too arcane to bother, but Griffiths and Tenenbaum argue that we often think that way intuitively. Their paper gives several examples, including predictions about the next result in a series of coin flips and the common tendency to infer causality from clusters that actually arise at random.

The same process appears in my airplane-spotting story. My initial glance is akin to the base rates that are often used as the starting point for Bayesian inference: to see something you hear, look where sound is coming from. When that prediction failed, I went through three rounds of updating before I looked up again—one based on general knowledge about the relative speeds of light and sound, and then a second (direction) and third (commercial vs. military) based on prior observations of local air traffic. My final “prediction” turned out to be right because those local patterns are strong, but even with all that objective information, there was still some uncertainty. Who knows, there could have been an emergency, or a rogue pilot, or an alien invasion…

I’m writing about this because I think it’s interesting, but I also have ulterior motives. A big part of my professional life involves using statistical models to forecast rare political events, and I am deeply frustrated by frequent encounters with people who dismiss statistical forecasts out of hand (see here and here for previous posts on the subject). It’s probably unrealistic of me to think so, but I am hopeful that recognition of the intuitive nature and power of Bayesian updating might make it easier for skeptics to make use of my statistical forecasts and others like them.

I’m a firm believer in the forecasting power of statistical models, so I usually treat a statistical forecast as my initial belief (or prior, in Bayesian jargon) and then only revise that forecast as new information arrives. That strategy is based on another prior, namely, the body of evidence amassed by Phil Tetlock and others that the predictive judgments of individual experts often aren’t very reliable, and that statistical models usually produce more accurate forecasts.

From personal experience I gather that most people, including many analysts and policymakers, don’t share that belief about the power of statistical models for forecasting. Even so, I would like to think those skeptics might still see how Bayes’ rule would allow them to make judicious use of statistical forecasts, even if they trust their own or other experts’ judgments more. After all, to ignore a statistical forecast is equivalent to holding the extreme view that that statistical forecast holds absolutely no useful information. In The Theory that Would Not Die, an entertaining lay history of Bayes’ rule, Sharon Bertsch McGrayne quotes Larry Stone, a statistician who used the theorem to help find a nuclear submarine that went missing in 1968, as saying that, “Discarding one of the pieces of information is in effect making the subjective judgment that its weight is zero and the other weight is one.”

So instead of rejecting the statistical forecast out of hand, why not update in response to it? When the statistical forecast closely accords with your prior belief, it will strengthen your confidence in that judgment, and rightly so. When the statistical forecast diverges from your prior belief, Bayes’ theorem offers a structured but simple way to arrive at a new estimate. Experience shows that this deliberate updating will produce more accurate forecasts than the willful myopia involved in ignoring the new information the statistical model has provided. And, as a kind of bonus, the deliberation involved in estimating the conditional probabilities Bayes’ theorem requires may help to clarify your thinking about the underlying processes involved and the sensitivity of your forecasts to certain assumptions.

PS. For some nice worked examples of Bayesian updating, see Appendix B of The Theory that Would Not Die or Chapter 8 of Nate Silver’s book, The Signal and the Noise. And thanks to Paul Meinshausen for pointing out the paper by Griffiths and Tenenbaum, and to Jay Yonamine for recommending The Theory That Would Not Die.

A Comment on Nate Silver’s The Signal and the Noise

I’ve just finished reading Nate Silver’s very good new book, The Signal and the Noise: Why So Many Predictions Fail—But Some Don’t. For me, the book was more of a tossed salad than a cake—a tasty assemblage of conceptually related parts that doesn’t quite cohere into a new whole. Still, I would highly recommend it to anyone interested in forecasting, a category that should include anybody with a pulse, as Silver persuasively argues. We can learn a lot just by listening to a skilled practitioner talk about his craft, and that, to me, is what The Signal and the Noise is really all about.

Instead of trying to review the whole book here, though, I wanted to pull on one particular thread running through it, because I worry about where that thread might lead some readers. That thread concerns the relative merits of statistical models and expert judgment as forecasting tools.

Silver is a professional forecaster who built his reputation on the clever application of statistical tools, but that doesn’t mean he’s a quantitative fundamentalist. To the contrary, one of the strongest messages in The Signal and the Noise is that our judgment may be poor, but we shouldn’t fetishize statistical models, either. Yes, human forecasters are inevitably biased, but so, in a sense, are the statistical models they build. For starters, those models entail a host of assumptions about the reliability and structure of the data, many of which will often be wrong. For another, there is often important information that’s hard to quantify but is useful for forecasting, and we ignore the signals from this space at our own peril. Third, forecasts are usually more accurate when they aggregate information from multiple, independent sources, and subjective forecasts from skilled experts can be a really useful leg in this stool.

Putting all of these concerns together, Silver arrives at a philosophy of forecasting that might be described as “model-assisted,” or maybe just “omnivorous.” Silver recognizes the power of statistics for finding patterns in noisy data and checking our mental models, but he also cautions strongly against putting blind faith in those tools and favors keeping human judgment in the cycle, including at the final stage where we finally make a forecast about some situation of interest.

To illustrate the power of model-assisted forecasting, Silver describes how well this approach has worked in a few areas: baseball scouting, election forecasting, and meteorology, to name a few. About the latter, for example, he writes that “weather forecasting is one of the success stories in this book, a case of man and machine joining forces to understand and sometimes anticipate the complexities of nature.”

All of what Silver says about the pitfalls of statistical forecasting and the power of skilled human forecasters is true, but only to a point. I think Silver’s preferred approach depends on a couple of conditions that are often absent in real-world efforts to forecast complex political phenomena, where I’ve done most of my work. Because Silver doesn’t spell those conditions out, I thought I would, in an effort to discourage readers of The Signal and the Noise from concluding that statistical forecasts can always be improved by adjusting them according to our judgment.

First, the process Silver recommends assumes that the expert tweaking the statistical forecast is the modeler, or at least has a good understanding of the strengths and weaknesses of the model(s) being used. For example, he describes experienced meteorologists improving their forecasts by manually adjusting certain values to correct for a known flaw in the model. Those manual adjustments seem to make the forecasts better, but they depend on a pretty sophisticated knowledge of the underlying algorithm and the idiosyncrasies of the historical data.

Second and probably more important, the scenarios Silver approvingly describes all involve situations where the applied forecaster gets frequent and clear feedback on the accuracy of his or her predictions. This feedback allows the forecaster to look for patterns in the performance of the statistical tool and the adjustments being made to them. It’s the familiar process of trial and error, but that process only works when we can see where the errors are and see if the fixes we attempt are actually working.

Both of these conditions hold in several of the domains Silver discusses, including baseball scouting and and meteorology. These are data-rich environments where forecasters often know the quirks of the data and statistical models they might use and can constantly see how they’re doing.

In the world of international politics, however, most forecasters—and, whether they realize it or not, every analyst is a forecaster—have little or no experience with statistical forecasting tools and are often skeptical of their value. As a result, discussions about the forecasts these tools produce are more likely to degenerate into a competitive, “he said, she said” dynamic than they are to achieve the synergy that Silver praises.

More important, feedback on the predictive performance of analysts in international politics is usually fuzzy or absent. Poker players get constant feedback from the changing size of their chip stacks. By contrast, people who try to forecast politics rarely do so with much specificity, and even when they do, they rarely keep track of their performance over time. What’s worse, the events we try to forecast—things like coups or revolutions—rarely occur, so there aren’t even that many opportunities to assess our performance even if we try. Most of the score-keeping is done in our own heads, but as Phil Tetlock shows, we’re usually poor judges of own performance. We fixate on the triumphs, forget or explain away the misses, and spin the ambiguous cases as successes.

In this context, it’s not clear to me that Silver’s ideal of “model-assisted” forecasting is really attainable, at least not without more structure being imposed from the outside. For example, I could imagine a process where a third party elicits forecasts from human prognosticators and statistical models and then combines the results in a way that accounts for the strengths and blind spots of each input. This process would blend statistics and expert judgment, just not by means of a single individual as often happened in Silver’s favored examples.

Meanwhile, the virtuous circle Silver describes is already built into the process of statistical modeling, at least when done well. For example, careful statistical forecasters will train their models on one sample of cases and then apply them to another sample they’ve never “seen” before. This out-of-sample validation lets modelers know if they’re onto something useful and gives them some sense of the accuracy and precision of their models before they rush out and apply them.

I couldn’t help but wonder how much Silver’s philosophy was shaped by the social part of his experiences in baseball and elections forecasting. In both of those domains, there’s a running culture clash, or at least the perception of one, between statistical modelers and judgment-based forecasters—nerds and jocks in baseball, quants and pundits in politics. When you work in a field like that, you can get a lot of positive social feedback by saying “Everybody’s right!” I’ve sat in many meetings where someone proposed combining statistical forecasts and expert judgment without specifying how that process would work or that we actually check how the combination is affecting forecast accuracy. Almost every time, though, that proposal is met with a murmur of assent: “Of course! Experts are experts! More is more!” I get the sense that this advice will almost always be popular, but I’m not convinced that it’s always sound.

Silver is right, of course, when he argues that we can never escape subjectivity. Modelers still have to choose the data and models they use, both of which bake a host of judgments right into the pie. What we can do with models, though, is discipline our use of those data, and in so doing, more clearly compare sets of assumptions to see which are more useful. Most political forecasters don’t currently inhabit a world where they can get to know the quirks of the statistical models and adjust for them. Most don’t have statistical models or hold them at arm’s length if they do, and they don’t get to watch them perform anywhere near enough to spot and diagnose the biases. When these conditions aren’t met, we need to be very cautious about taking forecasts from a well-designed model and tweaking them because they don’t feel right.

Forecasting Political Instability: Results from a Tournament of Methods

I’ve just posted to SSRN a report describing a statistical forecasting “tournament” undertaken by the CIA-funded Political Instability Task Force (PITF) in 2009–2010. I was PITF’s research director from 2001 until the start of 2011, and I designed and participated in this melee. You can download the full report here. As the abstract states,

The purpose of the tournament was to evaluate systematically the relative merits of several statistical techniques for forecasting various forms of political change in countries worldwide. Among other things, the tournament confirmed our belief that domain expertise and familiarity with relevant data help lead to more accurate forecasts. When knowledge of theory and data were held constant, the forecasts produced by most of the techniques we tried did not diverge by much. Unsurprisingly, this tournament also confirmed that forecasting rare forms of political instability as far as two years in advance is hard to do well. The forecasting tools the participants produced were generally quite good at discriminating high-risk cases from low-risk ones, but none was very precise.

The idea for the tournament came in 2009 from a story about the Netflix Prize, and I was really gratified to get to implement something a little bit like that process within PITF. I hope the report is useful to other practicing forecasters and would love to hear what folks make of the results.

Episodes of Democracy and Autocracy: A New Data Set

To look for patterns in the occurrence of transitions to democracy and democratic breakdowns around the world over time, we need reliable observations of where and when those events have happened. Most statistical analyses of democratic transitions in the past 15 years have used either Polity or the Democracy-Dictatorship (DD) data set to do that. As part of my work for the Political Instability Task Force (PITF), I developed yet another data set on episodes of democratic and authoritarian government in countries worldwide with populations larger than 500,000. The results of that work—I’m calling it the Democracy/Autocracy Data Set (DAD)—are now publicly available on the Dataverse Network, a data-sharing platform operated by Harvard University’s Institute for Quantitative Social Science.*

Like DD, DAD sorts cases annually into two bins: democracies and non-democracies.  Countries are identified as democracies when they satisfy all of several criteria, like items on a checklist. Cases that fail to satisfy one or more of those criteria are identified as non-democracies. Those criteria are meant to be indicative of four broader conditions essential to democracy:

  • Elected officials rule. Representatives chosen by citizens actually make policy, and unelected individuals, bodies, and organizations cannot veto those representatives’ decisions.
  • Elections are fair and competitive. The process by which citizens elect their rulers provides voters with meaningful choice and is free from deliberate fraud or abuse.
  • Politics is inclusive. Adult citizens have equal rights to vote and participate in government and fair opportunity to exercise those rights.
  • Civil liberties are protected. Freedoms of speech, association, and assembly give citizens the chance to deliberate on their interests, to organize in pursuit of those interests, and to monitor the performance of their elected representatives and the bureaucracies on which those officials depend.

Conceptually, these conditions are very similar to the ones used in constructing the DD data set. So why bother doing it all over again? The impetus to re-invent this particular wheel came from concerns I had about the effects of a couple of ancillary rules the makers of the original DD data set used to make decisions about ambiguous cases. As I saw it, those rules systematically skewed the resulting data in ways that are especially problematic for the kind of survival analysis those authors and many others have done with them. I won’t belabor the issue here, but interested readers can find more on the subject in this paper of mine on SSRN.

DAD was designed with survival analysis in mind, so it includes duration of current status, indicators of change from current status, and running counts of past events of both types (transitions to and from democracy). Importantly, those running counts include episodes before 1955, so at least that portion of the data set is not left-censored. Unlike DD, DAD does not differentiate within the two bins among types of democracy and dictatorship. Also unlike DD, however, DAD does track times to first alternations in power within democratic episodes—by individual chief executive and by ruling party/coalition—and it differentiates among democratic breakdowns by their form: executive coup (a.k.a. consolidation of incumbent advantage), military coup, rebellion, or other.

As a kind of bonus, DAD also includes annual data on each countries’ participation in a host of regional and global intergovernmental organizations and treaty regimes—data I used for this paper, which looks at the effects of international integration on prospects for transitions to and from democracy. Those data are also available as a standalone data set through ICPSR (link).

For other published or publicly available research I’ve done with DAD, see here, here, here, here, and here.

Based on my experience working with Polity, DD, and Freedom House’s Freedom in the World data, I can say a little bit about how the various sources compare to one another. In its calls on which regimes are democratic, DAD is closest to Freedom House’s annual list of electoral democracies. DD is generally more cautious than DAD, identifying as dictatorships some cases where DAD sees (usually short-lived) spells of democratic government that ended with a consolidation of incumbent advantage. Polity runs the opposite way, identifying as more democratic than autocratic many cases where DAD sees an autocracy (e.g. Russia and Armenia today).

At present, I am not planning to update DAD. Still, I hope it’s a useful resource and welcome comments and criticisms. Again, you can find the data set and supporting documentation on the Dataverse Network.

* This research was conducted for the Political Instability Task Force (PITF). The PITF is funded by the Central Intelligence Agency (CIA). The views expressed herein are the author’s alone and do not necessarily represent the views of the Task Force or the U.S. Government.

Follow

Get every new post delivered to your Inbox.

Join 3,482 other followers

%d bloggers like this: