Deriving a Fuzzy-Set Measure of Democracy from Several Dichotomous Data Sets

In a recent post, I described an ongoing project in which Shahryar Minhas, Mike Ward, and I are using text mining and machine learning to produce fuzzy-set measures of various political regime types for all countries of the world. As part of the NSF-funded MADCOW project,* our ultimate goal is to devise a process that routinely updates those data in near-real time at low cost. We’re not there yet, but our preliminary results are promising, and we plan to keep tinkering.

One of crucial choices we had to make in our initial analysis was how to measure each regime type for the machine-learning phase of the process. This choice is important because our models are only going to be as good as the data from which they’re derived. If the targets in that machine-learning process don’t reliably represent the concepts we have in mind, then the resulting models will be looking for the wrong things.

For our first cut, we decided to use dichotomous measures of several regime types, and to base those dichotomous measures on stringent criteria. So, for example, we identified as democracies only those cases with a score of 10, the maximum, on Polity’s scalar measure of democracy. For military rule, we only coded as 1 those cases where two major data sets agreed that a regime was authoritarian and only military-led, with no hybrids or modifiers. Even though the targets of our machine-learning process were crisply bivalent, we could get fuzzy-set measures from our classifiers by looking at the probabilities of class membership they produce.

In future iterations, though, I’m hoping we’ll get a chance to experiment with targets that are themselves fuzzy or that just take advantage of a larger information set. Bayesian measurement error models offer a great way to generate those targets.

Imagine that you have a set of cases that may or may not belong in some category of interest—say, democracy. Now imagine that you’ve got a set of experts who vote yes (1) or no (0) on the status of each of those cases and don’t always agree. We can get a simple estimate of the probability that a given case is a democracy by averaging the experts’ votes, and that’s not necessarily a bad idea. If, however, we suspect that some experts are more error prone than others, and that the nature of those errors follows certain patterns, then we can do better with a model that gleans those patterns from the data and adjusts the averaging accordingly. That’s exactly what a Bayesian measurement error model does. Instead of an unweighted average of the experts’ votes, we get an inverse-error-rate-weighted average, which should be more reliable than the unweighted version if the assumption about predictable patterns in those errors is largely correct.

I’m not trained in Bayesian data analysis and don’t know my way around the software used to estimate these models, so I sought and received generous help on this task from Sean J. Taylor. I compiled yes/no measures of democracy from five country-year data sets that ostensibly use similar definitions and coding criteria:

  • Cheibub, Gandhi, and Vreeland’s Democracy and Dictatorship (DD) data set, 1946–2008 (here);
  • Boix, Miller, and Rosato’s dichotomous coding of democracy, 1800–2007 (here);
  • A binary indicator of democracy derived from Polity IV using the Political Instability Task Force’s coding rules, 1800–2013;
  • The lists of electoral democracies in Freedom House’s annual Freedom in the World reports, 1989–2013; and
  • My own Democracy/Autocracy data set, 1955–2010 (here).

Sean took those five columns of zeroes and ones and used them to estimate a model with no prior assumptions about the five sources’ relative reliability. James Melton, Stephen Meserve, and Daniel Pemstein use the same technique to produce the terrific Unified Democracy Scores. What we’re doing is a little different, though. Where their approach treats democracy as a scalar concept and estimates a composite index from several measures, we’re accepting the binary conceptualization underlying our five sources and estimating the probability that a country qualifies as a democracy. In fuzzy-set terms, this probability represents a case’s degree of membership in the democracy set, not how democratic it is.

The distinction between a country’s degree of membership in that set and its degree of democracy is subtle but potentially meaningful, and the former will sometimes be a better fit for an analytic task than the latter. For example, if you’re looking to distinguish categorically between democracies and autocracies in order to estimate the difference in some other quantity across the two sets, it makes more sense to base that split on a probabilistic measure of set membership than an arbitrarily chosen cut point on a scalar measure of democracy-ness. You would still need to choose a threshold, but “greater than 0.5″ has a natural interpretation (“probably a democracy”) that suits the task in a way that an arbitrary cut point on an index doesn’t. And, of course, you could still perform a sensitivity analysis by moving the cut point around and seeing how much that choice affects your results.

So that’s the theory, anyway. What about the implementation?

I’m excited to report that the estimates from our initial measurement model of democracy look great to me. As someone who has spent a lot of hours wringing my hands over the need to make binary calls on many ambiguous regimes (Russia in the late 1990s? Venezuela under Hugo Chavez? Bangladesh between coups?), I think these estimates are accurately distinguishing the hazy cases from the rest and even doing a good job estimating the extent of that uncertainty.

As a first check, let’s take a look at the distribution of the estimated probabilities. The histogram below shows the estimates for the period 1989–2007, the only years for which we have inputs from all five of the source data sets. Voilà, the distribution has the expected shape. Most countries most of the time are readily identified as democracies or non-democracies, but the membership status of a sizable subset of country-years is more uncertain.

Estimated Probabilities of Democracy for All Countries Worldwide, 1989-2007

Estimated Probabilities of Democracy for All Countries Worldwide, 1989-2007

Of course, we can and should also look at the estimates for specific cases. I know a little more about countries that emerged from the collapse of the Soviet Union than I do about the rest of the world, so I like to start there when eyeballing regime data. The chart below compares scores for several of those countries that have exhibited more variation over the past 20+ years. Most of the rest of the post-Soviet states are slammed up against 1 (Estonia, Latvia, and Lithuania) or 0 (e.g., Uzbekistan, Turkmenistan, Tajikistan), so I left them off the chart. I also limited the range of years to the ones for which data are available from all five sources. By drawing strength from other years and countries, the model can produce estimates for cases with fewer or even no inputs. Still, the estimates will be less reliable for those cases, so I thought I would focus for now on the estimates based on a common set of “votes.”

Estimated Probability of Democracy for Selected Soviet Successor States, 1991-2007

Estimated Probability of Democracy for Selected Soviet Successor States, 1991-2007

Those estimates look about right to me. For example, Georgia’s status is ambiguous and trending less likely until the Rose Revolution of 2003, after which point it’s probably but not certainly a democracy, and the trend bends down again soon thereafter. Meanwhile, Russia is fairly confidently identified as a democracy after the constitutional crisis of 1993, but its status becomes uncertain around the passage of power from Yeltsin to Putin and then solidifies as most likely authoritarian by the mid-2000s. Finally, Armenia was one of the cases I found most difficult to code when building the Democracy/Autocracy data set for the Political Instability Task Force, so I’m gratified to see its probability of democracy oscillating around 0.5 throughout.

One nice feature of a Bayesian measurement error model is that, in addition to estimating the scores, we can also estimate confidence intervals to help quantify our uncertainty about those scores. The plot below shows Armenia’s trend line with the upper and lower bounds of a 90-percent confidence interval. Here, it’s even easier to see just how unclear this country’s democracy status has been since it regained independence. From 1991 until at least 2007, its 90-percent confidence interval straddled the toss-up line. How’s that for uncertain?

Armenia's Estimated Probability of Democracy with 90% Confidence Interval

Armenia’s Estimated Probability of Democracy with 90% Confidence Interval

Sean and I are still talking about ways to tweak this process, but I think the data it’s producing are already useful and interesting. I’m considering using these estimates in a predictive model of coup attempts and seeing if and how the results differ from ones based on the Polity index and the Unified Democracy Scores. Meanwhile, the rest of the MADCOW crew and I are now talking about applying the same process to dichotomous indicators of military rule, one-party rule, personal rule, and monarchy and then experimenting with machine-learning processes that use the results as their targets. There are lots of moving parts in our regime data-making process, and this one isn’t necessarily the highest priority, but it would be great to get to follow this path and see where it leads.

* NSF Award 1259190, Collaborative Research: Automated Real-time Production of Political Indicators

The Worst World EVER…in the Past 5 or 10 Years

A couple of months ago, the head of the UN’s refugee agency announced that, in 2013, “the number of people displaced by violent conflict hit the highest level since World War II,” and he noted that the number was still growing in 2014.

A few days ago, under the headline “Countries in Crisis at Record High,” Foreign Policy‘s The Cable reported that the UN’s Inter-Agency Standing Committee for the first time ever had identified four situations worldwide—Syria, Iraq, South Sudan, and Central African Republic—as level 3 humanitarian emergencies, its highest (worst) designation.

Today, the Guardian reported that “last year was the most dangerous on record for humanitarian workers, with 155 killed, 171 seriously wounded and 134 kidnapped as they attempted to help others in some of the world’s most dangerous places.'”

If you read those stories, you might infer that the world has become more insecure than ever, or at least the most insecure it’s been since the last world war. That would be reasonable, but probably also wrong.  These press accounts of record-breaking trends are often omitting or underplaying a crucial detail: the data series on which these claims rely don’t extend very far into the past.

In fact, we don’t know how the current number of displaced persons compares to all years since World War II, because the UN only has data on that since 1989. In absolute terms, the number of refugees worldwide is now the largest it’s been since record-keeping began 25 years ago. Measured as a share of global population, however, the number of displaced persons in 2013 had not yet matched the peak of the early 1990s (see the Addendum here).

The Cable accurately states that having four situations designated as level-3 humanitarian disasters by the UN is “unprecedented,” but we only learn late in the story that the system which makes these designations has only existed for a few years. In other words, unprecedented…since 2011.

Finally, while the Guardian correctly reports that 2013 was the most dangerous year on record for aid workers, it fails to note that those records only reach back to the late 1990s.

I don’t mean to make light of worrisome trends in the international system or any of the terrible conflicts driving them. From the measures I track—see here and here, for example, and here for an earlier post on causes—I’d say that global levels of instability and violent conflict are high and waxing, but they have not yet exceeded the peaks we saw in the early 1990s and probably the 1960s. Meanwhile, the share of states worldwide that are electoral democracies remains historically high, and the share of the world’s population living in poverty has declined dramatically in the past few decades. The financial crisis of 2008 set off a severe and persistent global recession, but that collapse could have been much worse, and institutions of global governance deserve some credit for helping to stave off an even deeper failure.

How can all of these things be true at the same time? It’s a bit like climate change. Just as one or even a few unusually cool years wouldn’t reverse or disprove the clear long-term trend toward a hotter planet, an extended phase of elevated disorder and violence doesn’t instantly undo the long-term trends toward a more peaceful and prosperous human society. We are currently witnessing (or suffering) a local upswing in disorder that includes numerous horrific crises, but in global historical terms, the world has not fallen apart.

Of course, if it’s a mistake to infer global collapse from these local trends, it’s also a mistake to infer that global collapse is impossible from the fact that it hasn’t occurred already. The war that is already consuming Syria and Iraq is responsible for a substantial share of the recent increase in refugee flows and casualties, and it could spread further and burn hotter for some time to come. Probably more worrisome to watchers of long-term trends in international relations, the crisis in Ukraine and recent spate of confrontations between China and its neighbors remind us that war between major powers could happen again, and this time those powers would both or all have nuclear weapons. Last but not least, climate change seems to be accelerating with consequences unknown.

Those are all important sources of elevated uncertainty, but uncertainty and breakdown are not the same thing. Although those press stories describing unprecedented crises are all covering important situations and trends, I think their historical perspective is too shallow. I’m forty-four years old. The global system is less orderly than it’s been in a while, but it’s still not worse than it’s ever been in my lifetime, and it’s still nowhere near as bad as it was when my parents were born. I won’t stop worrying or working on ways to try to make things a tiny bit better, but I will keep that frame of reference in mind.

Uncertainty About How Best to Convey Uncertainty

NPR News ran a series of stories this week under the header Risk and Reason, on “how well we understand and act on probabilities.” I thought the series nicely represented how uncertain we are about how best to convey forecasts to people who might want to use them. There really is no clear standard here, even though it is clear that the choices we make in presenting forecasts and other statistics on risks to their intended consumers strongly shape what they hear.

This uncertainty about how best to convey forecasts was on full display in the piece on how CIA analysts convey predictive assessments (here). Ken Pollack, a former analyst who now teaches intelligence analysis, tells NPR that, at CIA, “There was a real injunction that no one should ever use numbers to explain probability.” Asked why, he says that,

Assigning numerical probability suggests a much greater degree of certainty than you ever want to convey to a policymaker. What we are doing is inherently difficult. Some might even say it’s impossible. We’re trying to protect the future. And, you know, saying to someone that there’s a 67 percent chance that this is going to happen, that sounds really precise. And that makes it seem like we really know what’s going to happen. And the truth is that we really don’t.

In that same segment, though, Dartmouth professor Jeff Friedman, who studies decision-making about national security issues, says we should provide a numeric point estimate of an event’s likelihood, along with some information about our confidence in that estimate and how malleable it may be. (See this paper by Friedman and Richard Zeckhauser for a fuller treatment of this argument.) The U.S. Food and Drug Administration apparently agrees; according to the same NPR story, the FDA “prefers numbers and urges drug companies to give numerical values for risk—and to avoid using vague terms such as ‘rare, infrequent and frequent.'”

Instead of numbers, Pollack advocates for using words: “Almost certainly or highly likely or likely or very unlikely,” he tells NPR. As noted by one of the other stories in the series (here), however—on the use of probabilities in medical decision-making—words and phrases are ambiguous, too, and that ambiguity can be just as problematic.

Doctors, including Leigh Simmons, typically prefer words. Simmons is an internist and part of a group practice that provides primary care at Mass General. “As doctors we tend to often use words like, ‘very small risk,’ ‘very unlikely,’ ‘very rare,’ ‘very likely,’ ‘high risk,’ ” she says.

But those words can be unclear to a patient.

“People may hear ‘small risk,’ and what they hear is very different from what I’ve got in my mind,” she says. “Or what’s a very small risk to me, it’s a very big deal to you if it’s happened to a family member.

Intelligence analysts have sometimes tried to remove that ambiguity by standardizing the language they use to convey likelihoods, most famously in Sherman Kent’s “Words of Estimative Probability.” It’s not clear to me, though, how effective this approach is. For one thing, consumers are often lazy about trying to understand just what information they’re being given, and templates like Kent’s don’t automatically solve that problem. This laziness came across most clearly in NPR’s Risk and Reason segment on meteorology (here). Many of us routinely consume probabilistic forecasts of rainfall and make decisions in response to them, but it turns out that few of us understand what those forecasts actually mean. With Kent’s words of estimative probability, I suspect that many readers of the products that use them haven’t memorized the table that spells out their meaning and don’t bother to consult it when they come across those phrases, even when it’s reproduced in the same document.

Equally important, a template that works well for some situations won’t necessarily work for all. I’m thinking in particular of forecasts on the kinds of low-probability, high-impact events that I usually analyze and that are essential to the CIA’s work, too. Here, what look like small differences in probability can sometimes be very meaningful. For example, imagine that it’s August 2001 and you’ve three different assessments of the risk of a major terrorist attack on U.S. soil in the next few months. One pegs the risk at 1 in 1,000; another at 1 in 100; and another at 1 in 10. Using Kent’s table, all three of those assessments would get translated into a statement that the event is “almost certainly not” going to happen, but I imagine that most U.S. decision-makers would have felt very differently about risks of 0.1%, 1%, and 10% with a threat of that kind.

There are lots of rare but important events that inhabit this corner of the probability space: nuclear accidents, extreme weather events, medical treatments, and mass atrocities, to name a few. We could create a separate lexicon for assessments in these areas, as the European Medicines Agency has done for adverse reactions to medical therapies (here, via NPR). I worry, though, that we ask too much of consumers of these and other forecasts if we expect them to remember multiple lexicons and to correctly code-switch between them. We also know that the relevant scale will differ across audiences, even on the same topic. For example, an individual patient considering a medical treatment might not care much about the difference between a mortality risk of 1 in 1,000 and 1 in 10,000, but a drug company and the regulators charged with overseeing them hopefully do.

If there’s a general lesson here, it’s that producers of probabilistic forecasts should think carefully about how best to convey their estimates to specific audiences. In practice, that means thinking about the nature of the decision processes those forecasts are meant to inform and, if possible, trying different approaches and checking to find out how each is being understood. Ideally, consumers of those forecasts should also take time to try to educate themselves on what they’re getting. I’m not optimistic that many will do that, but we should at least make it as easy as possible for them to do so.

In Applied Forecasting, Keep It Simple

One of the lessons I think I’ve learned from the nearly 15 years I’ve spent developing statistical models to forecast rare political events is: keep it simple unless and until you’re compelled to do otherwise.

The fact that the events we want to forecast emerge from extremely complex systems doesn’t mean that the models we build to forecast them need to be extremely complex as well. In a sense, the unintelligible complexity of the causal processes relieves us from the imperative to follow that path. We know our models can’t even begin to capture the true data-generating process. So, we can and usually should think instead about looking for measures that capture relevant concepts in a coarse way and then use simple model forms to combine those measures.

A few experiences and readings have especially shaped my thinking on this issue.

  • When I worked on the Political Instability Task Force (PITF), my colleagues and I found that a logistic regression model with just four variables did a pretty good job assessing relative risks of a few forms of major political crisis in countries worldwide (see here, or ungated here). In fact, one of the four variables in that model—an indicator that four or more bordering countries have ongoing major armed conflicts—has almost no variance, so it’s effectively a three-variable model. We tried adding a lot of other things that were suggested by a lot of smart people, but none of them really improved the model’s predictive power. (There were also a lot of things we couldn’t even try because the requisite data don’t exist, but that’s a different story.)
  • Toward the end of my time with PITF, we ran a “tournament of methods” to compare the predictive power of several statistical techniques that varied in their complexity, from logistic regression to Bayesian hierarchical models with spatial measures (see here for the write-up). We found that the more complex approaches usually didn’t outperform the simpler ones, and when they did, it wasn’t by much. What mattered most for predictive accuracy was finding the inputs with the most forecasting power. Once we had those, the functional form and complexity of the model didn’t make much difference.
  • As Andreas Graefe describes (here), models that assign equal weights to all predictors often forecast at least as accurately as multiple regression models that estimate weights from historical data. “Such findings have led researchers to conclude that the weighting of variables is secondary for the accuracy of forecasts,” Graefe writes. “Once the relevant variables are included and their directional impact on the criterion is specified, the magnitudes of effects are not very important.”

Of course, there will be some situations in which complexity adds value, so it’s worth exploring those ideas when we have a theoretical rationale and the coding skills, data, and time needed to pursue them. In general, though, I am convinced that we should always try simpler forms first and only abandon them if and when we discover that more complex forms significantly increase forecasting power.

Importantly, the evidence for that judgment should come from out-of-sample validation—ideally, from forecasts made about events that hadn’t yet happened. Models with more variables and more complex forms will often score better than simpler ones when applied to the data from which they were derived, but this will usually turn out to be a result of overfitting. If the more complex approach isn’t significantly better at real-time forecasting, it should probably be set aside until it does.

Oh, and a corollary: if you have to choose between a) building more complex models, or even just applying lots of techniques to the same data, and b) testing other theoretically relevant variables for predictive power, do (b).

Ripple Effects from Thailand’s Coup

Thailand just had another coup, its first since 2006 but its twelfth since 1932. Here are a few things statistical analysis tells us about how that coup is likely to reverberate through Thailand’s economy and politics for the next few years.

1. Economic growth will probably suffer a bit more. Thailand’s economy was already struggling in 2014, thanks in part to the political instability to which the military leadership was reacting. Still, a statistical analysis I did a few years ago indicates that the coup itself will probably impose yet more drag on the economy. When we compare annual GDP growth rates from countries that suffered coups to similarly susceptible ones that didn’t, we see an average difference of about 2 percentage points in the year of the coup and another 1 percentage point the year after. (See this FiveThirtyEight post for a nice plot and discussion of those results.) Thailand might find its way to the “good” side of the distribution underlying those averages, but the central tendency suggests an additional knock on the country’s economy.

2. The risk of yet another coup will remain elevated for several years. The “coup trap” is real. Countries that have recently suffered successful or failed coup attempts are more likely to get hit again than ones that haven’t. This increase in risk seems to persist for several years, so Thailand will probably stick toward the top of the global watch list for these events until at least 2019.

3. Thailand’s risk of state-led mass killing has nearly tripled…but remains modest. The risk and occurrence of coups and the character of a country’s national political regime feature prominently in the multimodel ensemble we’re using in our atrocities early-warning project to assess risks of onsets of state-led mass killing. When I recently updated those assessments using data from year-end 2013—coming soon to a blog near you!—Thailand remained toward the bottom of the global distribution: 100th of 162 countries, with a predicted probability of just 0.3%. If I alter the inputs to that ensemble to capture the occurrence of this week’s coup and its effect on Thailand’s regime type, the predicted probability jumps to about 0.8%.

That’s a big change in relative risk, but it’s not enough of a change in absolute risk to push the country into the end of the global distribution where the vast majority of these events occur. In the latest assessments, a risk of 0.8% would have placed Thailand about 50th in the world, still essentially indistinguishable from the many other countries in that long, thin tail. Even with changes in these important risk factors and an ongoing insurgency in its southern provinces, Thailand remains in the vast bloc of countries where state-led mass killing is extremely unlikely, thanks (statistically speaking) to its relative wealth, the strength of its connection to the global economy, and the absence of certain markers of atrocities-prone regimes.

4. Democracy will probably be restored within the next few years… As Henk Goemans and Nikolay Marinov show in a paper published last year in the British Journal of Political Science, since the end of the Cold War, most coups have been followed within a few years by competitive elections. The pattern they observe is even stronger in countries that have at least seven years of democratic experience and have held at least two elections, as Thailand does and has. In a paper forthcoming in Foreign Policy Analysis that uses a different measure of coups, Jonathan Powell and Clayton Thyne see that same broad pattern. After the 2006 coup, it took Thailand a little over a year to get back to a competitive elections for a civilian government under a new constitution. If anything, I would expect this junta to move a little faster, and I would be very surprised if the same junta was still ruling in 2016.

5. …but it could wind up right back here again after that. As implied by nos. 1 and 2 above, however, the resumption of democracy wouldn’t mean that Thailand won’t repeat the cycle again. Both statistical and game-theoretic models indicate that prospects for yet another democratic breakdown will stay relatively high as long as Thai politics remains sharply polarized. My knowledge of Thailand is shallow, but the people I read or follow who know the country much better skew pessimistic on the prospects for this polarization ending soon. From afar, I wonder if it’s ultimately a matter of generational change and suspect that Thailand will finally switch to a stable and less contentious equilibrium when today’s conservative leaders start retiring from their jobs in the military and bureaucracy age out of street politics.

Forecasting Round-Up No. 6

The latest in a very occasional series.

1. The Boston Globe ran a story a few days ago about a company that’s developing algorithms to predict which patients in cardiac intensive care units are most likely to take a turn for the worse (here). The point of this exercise is to help doctors and nurses allocate their time and resources more efficiently and, ideally, to give them more lead time to try to stop those bad turns from happening.

The story suffers some rhetorical tics common to press reports on “predictive analytics.” For example, we never hear any specifics about the analytic techniques used or the predictive accuracy of the tool, and the descriptions of machine learning tilt toward the ingenuous (e.g., “The more data fed into the model, the more accurate the prediction becomes”). On the whole, though, I think this article does a nice job representing the promise and reality of this kind of work. The following passage especially resonated with me, because it describes a process for applying these predictions that sounds like the one I have in mind when building my own forecasting tools:

The unit’s medical director, Dr. Melvin Almodovar, uses [the prediction tool] to double-check his own clinical assessment of patients. Etiometry’s founders are careful to note that physicians will always be the ultimate bedside decision makers, using the Stability Index to confirm or inform their own diagnoses.

Butler said that an information-overload environment like the intensive care unit is ideal for a data-driven risk assessment tool, because the patients teeter between life and death. A predictive model can act as an early warning system, pointing out risky changes in multiple vital signs in a more sophisticated way than bedside alarms.

When our predictive models aren’t as accurate as we’d like or don’t yet have a clear track record, this hybrid approach—decisions are informed by the forecasts but not determined by them—is a prudent way to go. In the cardiac intensive care unit, doctors are already applying their own mental models to these data, so the idea of developing explicit algorithms to do the same isn’t a stretch (or shouldn’t be, but…). Unlike those doctors, though, statistical models won’t suffer from low blood sugar or distraction or become emotionally attached to some patients but not others. Also unlike the mental models doctors use now, statistical models will produce explicit forecasts that can be collected and assessed over time. The resulting feedback will give the stats guys many opportunities to improve their models, and the hospital staff a chance to get a feel for the models’ strengths and limitations. When you’re making such weighty decisions, why wouldn’t you want that additional information?

2. Lyle Ungar recently discussed forecasting with the Machine Intelligence Research Institute (here). The whole thing deserves a read, but I especially liked this framework for thinking about when different methods work best:

I think one can roughly characterize forecasting problems into categories—each requiring different forecasting methods—based, in part, on how much historical data is available.

Some problems, like the geo-political forecasting [the Good Judgment Project is] doing, require lots collection of information and human thought. Prediction markets and team-based forecasts both work well for sifting through the conflicting information about international events. Computer models mostly don’t work as well here—there isn’t a long enough track records of, say, elections or coups in Mali to fit a good statistical model, and it isn’t obvious what other countries are ‘similar.’

Other problems, like predicting energy usage in a given city on a given day, are well suited to statistical models (including neural nets). We know the factors that matter (day of the week, holiday or not, weather, and overall trends), and we have thousands of days of historical observation. Human intuition is not as going to beat computers on that problem.

Yet other classes of problems, like economic forecasting (what will the GDP of Germany be next year? What will unemployment in California be in two years) are somewhere in the middle. One can build big econometric models, but there is still human judgement about the factors that go into them. (What if Merkel changes her mind or Greece suddenly adopts austerity measures?) We don’t have enough historical data to accurately predict economic decisions of politicians.

The bottom line is that if you have lots of data and the world isn’t changing to much, you can use statistical methods. For questions with more uncertain, human experts become more important.

I might disagree on the particular problem of forecasting coups in Mali, but I think the basic framework that Lyle proposes is right.

3. Speaking of the Good Judgment Project (GJP), a bevy of its researchers, including Ungar, have an article in the March 2014 issue of Psychological Science (here) that shows how certain behavioral interventions can significantly boost the accuracy of forecasts derived from subjective judgments. Here’s the abstract:

Five university-based research groups competed to recruit forecasters, elicit their predictions, and aggregate those predictions to assign the most accurate probabilities to events in a 2-year geopolitical forecasting tournament. Our group tested and found support for three psychological drivers of accuracy: training, teaming, and tracking. Probability training corrected cognitive biases, encouraged forecasters to use reference classes, and provided forecasters with heuristics, such as averaging when multiple estimates were available. Teaming allowed forecasters to share information and discuss the rationales behind their beliefs. Tracking placed the highest performers (top 2% from Year 1) in elite teams that worked together. Results showed that probability training, team collaboration, and tracking improved both calibration and resolution. Forecasting is often viewed as a statistical problem, but forecasts can be improved with behavioral interventions. Training, teaming, and tracking are psychological interventions that dramatically increased the accuracy of forecasts. Statistical algorithms (reported elsewhere) improved the accuracy of the aggregation. Putting both statistics and psychology to work produced the best forecasts 2 years in a row.

The atrocities early-warning project on which I’m working is learning from GJP in real time, and we hope to implement some of these lessons in the opinion pool we’re running (see this conference paper for details).

Speaking of which: If you know something about conflict or atrocities risk or a particular part of the world and are interested in volunteering as a forecaster, please send an email to ewp@ushmm.org.

4. Finally, Daniel Little writes about the partial predictability of social upheaval on his terrific blog, Understanding Society (here). The whole post deserves reading, but here’s the nub (emphasis in the original):

Take unexpected moments of popular uprising—for example, the Arab Spring uprisings or the 2013 riots in Stockholm. Are these best understood as random events, the predictable result of long-running processes, or something else? My preferred answer is something else—in particular, conjunctural intersections of independent streams of causal processes (link). So riots in London or Stockholm are neither fully predictable nor chaotic and random.

This matches my sense of the problem and helps explain why predictive models of these events will never be as accurate as we might like but are still useful, as are properly elicited and combined forecasts from people using their noggins.

The Steep Slope of the Data Revolution’s Second Derivative

Most of the talk about a social science “data revolution” has emphasized rapid increases in the quantity of data available to us. Some of that talk has also focused on changes in the quality of those data, including new ideas about how to separate the wheat from the chaff in situations where there’s a lot of grain to thresh. So far, though, we seem to be talking much less about the rate of change in those changes, or what calculus calls the second derivative.

Lately, the slope of this second derivative has been pretty steep. It’s not just that we now have much more, and in some cases much better, data. The sources and content of those data sets are often fast-moving targets, too. The whole environment is growing and churning at an accelerating pace, and that’s simultaneously exhilarating and frustrating.

It’s frustrating because data sets that evolve as we use them create a number of analytical problems that we don’t get from stable measurement tools. Most important, evolving data sets make it hard to compare observations across time, and longitudinal analysis is the crux of most social-scientific research. Paul Pierson explains why in his terrific 2004 book, Politics in Time:

Why do social scientists need to focus on how processes unfold over significant stretches of time? First, because many social processes are path dependent, in which case the key causes are temporally removed from their continuing effects… Second, because sequencing—the temporal order of events or processes—can be a crucial determinant of important social outcomes. Third, because many important social causes and outcomes are slow-moving—they take place over quite extended periods of time and are only likely to be adequately explained (or in some cases even observed in the first place) if analysts are specifically attending to that possibility.

When our measurement systems evolve as we use them, changes in the data we receive might reflect shifts in the underlying phenomenon. They also might reflect changes in the methods and mechanisms by which we observe and record information about that phenomenon, however, and it’s often impossible to tease the one out from the other.

recent study by David Lazer, Gary King, Ryan Kennedy, and Alessandro Vespignani on what Google Flu Trends (GFT) teaches us about “traps in Big Data analysis” offers a nice case in point. Developed in the late 2000s by Google engineers and researchers at the Centers for Disease Control and Prevention, GFT uses data on Google search queries to help detect flu epidemics (see this paper). As Lazer and his co-authors describe, GFT initially showed great promise as a forecasting tool, and its success spurred excitement about the power of new data streams to shed light on important social processes. For the past few years, though, the tool has worked poorly on its own, and Lazer & co. believe of changes in Google’s search software are the reason. The problem—for researchers, anyway—is that

The Google search algorithm is not a static entity—the company is constantly testing and improving search. For example, the official Google search blog reported 86 changes in June and July 2012 alone (SM). Search patterns are the result of thousands of decisions made by the company’s programmers in various sub-units and by millions of consumers worldwide.

Google keeps tinkering with its search software because that’s what its business entails, but we can expect to see more frequent changes in some data sets specific to social science, too. One of the developments about which I’m most excited is the recent formation of the Open Event Data Alliance (OEDA) and the initial release of the machine-coded political event data it plans to start producing soon, hopefully this summer. As its name implies, OEDA plans to make not just its data but also its code freely available to the public in order to grow a community of users who can help improve and expand the software. That crowdsourcing will surely accelerate the development of the scraping and coding machinery, but it also ensures that the data OEDA produces will be a moving target for a while in ways that will complicate attempts to analyze it.

If these accelerated changes are challenging for basic researchers, they’re even tougher on applied researchers, who have to show and use their work in real time. So what’s an applied researcher to do when your data-gathering instruments are frequently changing, and often in opaque and unpredictable ways?

First, it seems prudent to build systems that are modular, so that a failure in one part of the system can be identified and corrected without having to rebuild the whole edifice. In the atrocities early-warning system I’m helping to build right now, we’re doing this by creating a few subsystems with some overlap in their functions. If one part doesn’t pan out or suddenly breaks, we can lean on the others while we repair or retool.

Second, it’s also a good idea to embed those technical systems in organizational procedures that emphasize frequent checking and fast adaptation. One way to do this is to share your data and code and to discuss your work often with outsiders as you go, so you can catch mistakes, spot alternatives, and see these changes coming before you get too far down any one path. Using open-source statistical software like R is also helpful in this regard, because it lets you take advantage of new features and crowd fixes as they bubble up.

Last and fuzziest, I think it helps to embrace the idea that you’re work doesn’t really belong to you or your organization but is just one tiny part of a larger ecosystem that you’re hoping to see evolve in a particular direction. What worked one month might not work the next, and you’ll never know exactly what effect you’re having, but that’s okay if you recognize that it’s not really supposed to be about you. Just keep up as best you can, don’t get too heavily invested in any one approach or idea, and try to enjoy the ride.

A Research Note on Updating Coup Forecasts

A new year is about to start, and that means it’s time for me to update my coup forecasts (see here and here for the 2013 and 2012 editions, respectively). The forecasts themselves aren’t quite ready yet—I need to wait until mid-January for updates from Freedom House to arrive—but I am making some changes to my forecasting process that I thought I would go ahead and describe now, because the thinking behind them illustrates some important dilemmas and new opportunities for predictions of many kinds of political events.

When it comes time to build a predictive statistical model of some rare political event, it’s usually not the model specification that gives me headaches. For many events of interest, I think we now have a pretty good understanding of which methods and variables are likely to produce more accurate forecasts.

Instead, it’s the data, or really the lack thereof, that sets me to pulling my hair out. As I discussed in a recent post, things we’d like to include in our models fall into a few general classes in this regard:

  • No data exist (fuggeddaboudit)
  • Data exist for some historical period, but they aren’t updated (“HA-ha!”)
  • Data exist and are updated, but they are patchy and not missing at random (so long, some countries)
  • Data exist and are updated, but not until many months or even years later (Spinning Pinwheel of Death)

In the past, I’ve set aside measures that fall into the first three of those sets but gone ahead and used some from the fourth, if I thought the feature was important enough. To generate forecasts before the original sources updated, I either a) pulled forward the last observed value for each case (if the measure was slow-changing, like a country’s infant mortality rate) or b) hand-coded my own updates (if the measure was liable to change from year to year, like a country’s political regime type).

Now, though, I’ve decided to get out of the “artisanal updating” business, too, for all but the most obvious and uncontroversial things, like which countries recently joined the WTO or held national elections. I’m quitting this business, in part, because it takes a lot of time and the results may be pretty noisy. More important, though, I’m also quitting because it’s not so necessary any more, thanks to  timelier updates from some data providers and the arrival of some valuable new data sets.

This commitment to more efficient updating has led me to adopt the following rules of thumb for my 2014 forecasting work:

  • For structural features that don’t change much from year to year (e.g., population size or infant mortality), include the feature and use the last observed value.
  • For variables that can change from year to year in hard-to-predict ways, only include them if the data source is updated in near-real time or, if it’s updated annually, if those updates are delivered within the first few weeks of the new year.
  • In all cases, only use data that are publicly available, to facilitate replication and to encourage more data sharing.

And here are some of the results of applying those rules of thumb to the list of features I’d like to include in my coup forecasting models for 2014.

  • Use Powell and Thyne’s list of coup events instead of Monty Marshall’s. Powell and Thyne’s list is updated throughout the year as events occur, whereas the publicly available version of Marshall’s list is only updated annually, several months after the start of the year. That wouldn’t matter so much if coups were only the dependent variable, but recent coup activity is also an important predictor, so I need the last year’s updates ASAP.
  • Use Freedom House’s Freedom in the World (FIW) data instead of Polity IV to measure countries’ political regime type. Polity IV offers more granular measures of political regime type than Freedom in the World, but Polity updates aren’t posted until spring or summer of the following year, usually more than a third of the way into my annual forecasting window.
  •  Use IMF data on economic growth instead of the World Bank’s. The Bank now updates its World Development Indicators a couple of times a year, and there’s a great R package that makes it easy to download the bits you need. That’s wonderful for slow-changing structural features, but it still doesn’t get me data on economic performance as fast as I’d like it. I work around that problem by using the IMF’s World Economic Outlook Database, which include projections for years for which observed data aren’t yet available and forecasts for several years into the future.
  • Last but not least, use GDELT instead of UCDP/PRIO or Major Episodes of Political Violence (MEPV) to measure civil conflict. Knowing which countries have had civil unrest or violence in the recent past can help us predict coup attempts, but the major publicly available measures of these things are only updated well into the year. GDELT now represents a nice alternative. It covers the whole world, measures lots of different forms of political cooperation and conflict, and is updated daily, so country-year updates are available on January 2. GDELT’s period of observation starts in 1979, so it’s still a stretch to use it models of super-rare events like mass-killing onsets, where the number of available examples since 1979 on which to train is still relatively small. For less-rare events like coup attempts, though, starting the analysis around 1980 is no problem. (Just don’t forget to normalize them!) With some help from John Beieler, I’m already experimenting with adding annual GDELT summaries to my coup forecasting process, and I’m finding that they do improve the model’s out-of-sample predictive power.

In all of the forecasting work I do, my long-term goals are 1) to make the forecasts more dynamic by updating them more frequently (e.g., monthly, weekly, or even daily instead of yearly) and 2) to automate that updating process as much as possible. The changes I’m making to my coup forecasting process for 2014 don’t directly accomplish either of these things, but they do take me a few steps in both directions. For example, once GDELT is in the mix, it’s possible to start thinking about how to switch to monthly or even daily updates that rely on a sliding window of recent GDELT tallies. And once I’ve got a coup data set that updates in near-real time, I can imagine pinging that source each day to update the counts of coup attempts in the past several years. I’m still not where I’d like to be, but I think I’m finally stepping onto a path that can carry me there.

EVEN BETTER Animated Map of Coup Attempts Worldwide, 1946-2013

[Click here to go straight to the map]

A week ago, I posted an animated map of coup attempts worldwide since 1946 (here). Unfortunately, those maps were built from a country-year data set, so we couldn’t see multiple attempts within a single country over the course of a year. As it happens, though, the lists of coup attempts on which that animation was based does specify the dates of those events. So why toss out all that information?

To get a sharper picture of the distribution of coup attempts across space and time, I rebuilt my mashed-up list of coup attempts from the original sources (Powell & Thyne and Marshall), but now with the dates included. Where only a month was given, I pegged the event to the first day of that month. To avoid double-counting, I then deleted events that appeared to be duplicates (same outcome in the same country within a single week). Finally, to get the animation in CartoDB to give a proper sense of elapsed time, I embedded the results in a larger data frame of all dates over the 68-year period observed. You can find the daily data on my Google Drive (here).

WordPress won’t seem to let me embed the results of my mapping directly in this post, but you can see and interact with the results at CartoDB (here). I think this version shows more clearly how much the rate of coup attempts has slowed in the past couple of decades, and it still does a good job of showing change over time in the geographic distribution of these events.

The two things I can’t figure out how to do so far are 1) to use color to differentiate between successful and failed attempts and 2) to show the year or month and year in the visualization so we know where we are in time. For differentiating by outcome, there’s a variable in the data set that does this, but it looks like the current implementation of the Torque option in CartoDB won’t let me show multiple layers or differentiate between the events by type. On showing the date, I have no clue. If anyone knows how to do either of these things, please let me know.

Singing the Missing-Data Blues

I’m currently in the throes of assembling data to use in forecasts on various forms of political change in countries worldwide for 2014. This labor-intensive process is the not-so-sexy side of “data science” that practitioners like to bang on about if you ask us, but I’m not going to do that here. Instead, I’m going to talk about how hard it is to find data sets that applied forecasters of rare events in international politics can even use in the first place. The steep data demands for predictive models mean that many of the things we’d like to include in our models get left out, and many of the data sets political scientists know and like aren’t useful to applied forecasters.

To see what I’m talking about, let’s assume we’re building a statistical model to forecast some rare event Y in countries worldwide, and we have reason to believe that some variable X should help us predict that Y. If we’re going to include X in our model, we’ll need data, but any old data won’t do. For a measure of X to be useful to an applied forecaster, it has to satisfy a few requirements. This Venn diagram summarizes the four I run into most often:

data.venn.diagram

First, that measure of X has to be internally consistent. Validity is much less of a concern than it is in hypothesis-testing research, since we’re usually not trying to make causal inferences or otherwise build theory. If our measure of X bounces around arbitrarily, though, it’s not going to provide much a predictive signal, no matter how important the concept underlying X may be. Similarly, if the process by which that measure of X is generated keeps changing—say, national statistical agencies make idiosyncratic revisions to their accounting procedures, or coders keep changing their coding rules—then models based on the earlier versions will quickly break. If we know the source or shape of the variation, we might be able to adjust for it, but we aren’t always so lucky.

Second, to be useful in global forecasting, a data set has to offer global coverage, or something close to it. It’s really as simple as that. In the most commonly used statistical models, if a case is missing data on one or more of the inputs, it will be missing from the outputs, too. This is called listwise deletion, and it means we’ll get no forecast for cases that are missing values on any one of the predictor variables. Some machine-learning techniques can generate estimates in the face of missing data, and there are ways to work around listwise deletion in regression models, too (e.g., create categorical versions of continuous variables and treat missing values as another category). But those workarounds aren’t alchemy, and less information means less accurate forecasts.

Worse, the holes in our global data sets usually form a pattern, and that pattern is often correlated with the very things we’re trying to predict. For example, the poorest countries in the world are more likely to experience coups, but they are also more likely not to be able to afford the kind of bureaucracy that can produce high-quality economic statistics. Authoritarian regimes with frustrated citizens may be more likely to experience popular uprisings, but many autocrats won’t let survey research firms ask their citizens politically sensitive questions, and many citizens in those regimes would be reluctant to answer those questions candidly anyway. The fact that our data aren’t missing at random compounds the problem, leaving us without estimates for some cases and screwing up our estimates for the rest. Under these circumstances, it’s often best to omit the offending data set from our modeling process entirely, even if the X it’s measuring seems important.

Third and related to no. 2, if our events are rare, then our measure of X needs historical depth, too. To estimate the forecasting model, we want as rich a library of examples as we can get. For events as rare as onsets of violent rebellion or episodes of mass killing, which typically occur in just one or a few countries worldwide each year, we’ll usually need at least a few decades’ worth of data to start getting decent estimates on the things that differentiate the situations where the event occurs from the many others where it doesn’t. Without that historical depth, we run into the same missing-data problems I described in relation to global coverage.

I think this criterion is much tougher to satisfy than many people realize. In the past 10 or 20 years, statistical agencies, academic researchers, and non-governmental organizations have begun producing new or better data sets on all kinds of things that went unmeasured or poorly measured in the past—things like corruption or inflation or unemployment, to name a few that often come up in conversations about what predicts political instability and change. Those new data sets are great for expanding our view of the present, and they will be a boon to researchers of the future. Unfortunately, though, they can’t magically reconstruct the unobserved past, so they still aren’t very useful for predictive models of rare political events.

The fourth and final circle in that Venn diagram may be both the most important and the least appreciated by people who haven’t tried to produce statistical forecasts in real time: we need timely updates. If I can’t depend on the delivery of fresh data on X before or early in my forecasting window, then I can’t update my forecasts while they’re still relevant, and the model is effectively DOA. If X changes slowly, we can usually get away with using the last available observation until the newer stuff shows up. Population size and GDP per capita are a couple of variables for which this kind of extrapolation is generally fine. Likewise, if the variable changes predictably, we might use forecasts of X before the observed values become available. I sometimes do this with GDP growth rates. Observed data for one year aren’t available for many countries until deep into the next year, but the IMF produces decent forecasts of recent and future growth rates that can be used in the interim.

Maddeningly, though, this criterion alone renders many of the data sets scholars have painstakingly constructed for specific research projects useless for predictive modeling. For example, scholars in recent years have created numerous data sets to characterize countries’ national political regimes, a feature that scores of studies have associated with variation in the risk of many forms of political instability and change. Many of these “boutique” data sets on political regimes are based on careful research and coding procedures, cover the whole world, and reach at least several decades or more into the past. Only two of them, though—Polity IV and Freedom House’s Freedom in the World—are routinely updated. As much as I’d like to use unified democracy scores or measures of authoritarian regime type in my models, I can’t without painting myself into a forecasting corner, so I don’t.

As I hope this post has made clear, the set formed by the intersection of these four criteria is a tight little space. The practical requirements of applied forecasting mean that we have to leave out of our models many things that we believe might be useful predictors, no matter how important the relevant concepts might seem. They also mean that our predictive models on many different topics are often built from the same few dozen “usual suspects”—not because we want to, but because we don’t have much choice. Multiple imputation and certain machine-learning techniques can mitigate some of these problems, but they hardly eliminate them, and the missing information affects our forecasts either way. So the next time you’re reading about a global predictive model on international politics and wondering why it doesn’t include something “obvious” like unemployment or income inequality or survey results, know that these steep data requirements are probably the reason.

Follow

Get every new post delivered to your Inbox.

Join 6,692 other followers

%d bloggers like this: