To Realize the QDDR’s Early-Warning Goal, Invest in Data-Making

The U.S. Department of State dropped its second Quadrennial Diplomacy and Development Review, or QDDR, last week (here). Modeled on the Defense Department’s Quadrennial Defense Review, the QDDR lays out the department’s big-picture concerns and objectives so that—in theory—they can guide planning and shape day-to-day decision-making.

The new QDDR establishes four main goals, one of which is to “strengthen our ability to prevent and respond to internal conflict, atrocities, and fragility.” To help do that, the State Department plans to “increase [its] use of early warning analysis to drive early action on fragility and conflict.” Specifically, State says it will:

  1. Improve our use of tools for analyzing, tracking, and forecasting fragility and conflict, leveraging improvements in analytical capabilities;
  2. Provide more timely and accurate assessments to chiefs of mission and senior decision-makers;
  3. Increase use of early warning data and conflict and fragility assessments in our strategic planning and programming;
  4. Ensure that significant early warning shifts trigger senior-level review of the mission’s strategy and, if necessary, adjustments; and
  5. Train and deploy conflict-specific diplomatic expertise to support countries at risk of conflict or atrocities, including conflict negotiation and mediation expertise for use at posts.

Unsurprisingly, that plan sounds great to me. We can’t now and never will be able to predict precisely where and when violent conflict and atrocities will occur, but we can assess risks with enough accuracy and lead time to enable better strategic planning and programming. These forecasts don’t have to be perfect to be earlier, clearer, and more reliable than the traditional practices of deferring to individual country or regional analysts or just reacting to the news.

Of course, quite a bit of well-designed conflict forecasting is already happening, much of it paid for by the U.S. government. To name a few of the relevant efforts: The Political Instability Task Force (PITF) and the Worldwide Integrated Conflict Early Warning System (W-ICEWS) routinely update forecasts of various forms of political crisis for U.S. government customers. IARPA’s Open Source Indicators (OSI) and Aggregative Contingent Estimation (ACE) programs are simultaneously producing forecasts now and discovering ways to make future forecasts even better. Meanwhile, outside the U.S. government, the European Union has recently developed its own Global Conflict Risk Index (GCRI), and the Early Warning Project now assesses risks of mass atrocities in countries worldwide.

That so much thoughtful risk assessment is being done now doesn’t mean it’s a bad idea to start new projects. If there are any iron laws of forecasting hard-to-predict processes like political violence, one of them is that combinations of forecasts from numerous sources should be more accurate than forecasts from a single model or person or framework. Some of the existing projects already do this kind of combining themselves, but combinations of combinations will often be even better.

Still, if I had to channel the intention expressed in this part of the QDDR into a single activity, it would not be the construction of new models, at least not initially. Instead, it would be data-making. Social science is not Newtonian physics, but it’s not astrology, either. Smart people have been studying politics for a long time, and collectively they have developed a fair number of useful ideas about what causes or precedes violent conflict. But, if you can’t track the things those theorists tell you to track, then your forecasts are going to suffer. To improve significantly on the predictive models of political violence we have now, I think we need better inputs most of all.

When I say “better” inputs, I have a few things in mind. In some cases, we need to build data sets from scratch. When I was updating my coup forecasts earlier this year, a number of people wondered why I didn’t include measures of civil-military relations, which are obviously relevant to this particular risk. The answer was simple: because global data on that topic don’t exist. If we aren’t measuring it, we can’t use it in our forecasts, and the list of relevant features that falls into this set is surprisingly long.

In other cases, we need to revive them. Social scientists often build “boutique” data sets for specific research projects, run the tests they want to run on them, and then move on to the next project. Sometimes, the tests they or others run suggest that some features captured in those data sets would make useful predictors. Those discoveries are great in principle, but if those data sets aren’t being updated, then applied forecasters can’t use that knowledge. To get better forecasts, we need to invest in picking up where those boutique data sets left off so we can incorporate their insights into our applications.

Finally and in almost all cases, we need to observe things more frequently. Most of the data available now to most conflict forecasters is only updated once each year, often on a several-month delay and sometimes as much as two years later (e.g., data describing 2014 becomes available in 2016). That schedule is fine for basic research, but it is crummy for applied forecasting. If we want to be able to give assessments and warnings that as current as possible to those “chiefs of mission and senior decision-makers” mentioned in the QDDR, then we need to build models with data that are updated as frequently as possible. Daily or weekly are ideal, but monthly updates would suffice in many cases and would mark a huge improvement over the status quo.

As I said at the start, we’re never going to get models that reliably tell us far in advance exactly where and when violent conflicts and mass atrocities will erupt. I am confident, however, that we can assess these risks even more accurately than we do now, but only if we start making more, and better versions, of the data our theories tell us we need.

I’ll end with a final plea to any public servants who might be reading this: if you do invest in developing better inputs, please make the results freely available to the public. When you share your data, you give the crowd a chance to help you spot and fix your mistakes, to experiment with various techniques, and to think about what else you might consider, all at no additional cost to you. What’s not to like about that?

Advertisements

A Bit More on Country-Month Modeling

My family is riding the flu carousel right now, and my turn came this week. So, in lieu of trying to write from scratch, I wanted to pick up where my last post—on moving from country-year to country-month modeling—left off.

As many of you know, this notion is hardly new. For at least the past decade, many political scientists who use statistical tools to study violent conflict have been advocating and sometimes implementing research designs that shrink their units of observation on various dimensions, including time. The Journal of Conflict Resolution published a special issue on “disaggregating civil war” in 2009. At the time, that publication felt (to me) more like the cresting of a wave of new work than the start of one, and it was motivated, in part, by frustration over all the questions that a preceding wave of country-year civil-war modeling had inevitably left unanswered. Over the past several years, Mike Ward and his WardLab collaborators at Duke have been using ICEWS and other higher-resolution data sets to develop predictive models of various kinds of political instability at the country-month level. Their work has used designs that deal thoughtfully with the many challenges this approach entails, including spatial and temporal interdependence and the rarity of the events of interest. So have others.

Meanwhile, sociologists who study protests and social movements have been pushing in this direction even longer. Scholars trying to use statistical methods to help understand the dynamic interplay between mobilization, activism, repression, and change recognized that those processes can take important turns in weeks, days, or even hours. So, researchers in that field started trying to build event data sets that recorded as exactly as possible when and where various actions occurred, and they often use event history models and other methods that “take time seriously” to analyze the results. (One of them sat on my dissertation committee and had a big influence on my work at the time.)

As far as I can tell, there are two main reasons that all research in these fields hasn’t stampeded in the direction of disaggregation, and one of them is a doozy. The first and lesser one is computing power. It’s no simple thing to estimate models of mutually causal processes occurring across many heterogeneous units observed at high frequency. We still aren’t great at it, but accelerating improvements in computational processing, storage, software—and co-evolving improvements in statistical methods—have made it more tractable than it was even five or 10 years ago.

The second, more important, and more persistent impediment to disaggregated analysis is data, or the lack thereof. Data sets used by statistically minded political scientists come in two basic flavors: global, and case– or region-specific. Almost all of the global data sets of which I’m aware have always used, and continue to use, country-years as their units of observation.

That’s partly a function of the research questions they were built to help answer, but it’s also a function of cost. Data sets were (and mostly still are) encoded by hand by people sifting through or poring over relevant documents. All that labor takes a lot of time and therefore costs a lot of money. One can make (or ask RAs to make) a reasonably reliable summary judgment about something like whether or not a civil war was occurring in a particular country during particular year much quicker than one can do that for each month of that year, or each district in that country, or both. This difficulty hasn’t stopped everyone from trying, but the exceptions have been few and often case-specific. In a better world, we could have patched together those case-specific sets to make a larger whole, but they often use idiosyncratic definitions and face different informational constraints, making cross-case comparison difficult.

That’s why I’ve been so excited about the launch of GDELT and Phoenix and now the public release of the ICEWS event data. These are, I think, the leading edge of efforts to solve those data-collection problems in an efficient and durable way. ICEWS data have been available for several years to researchers working on a few contracts, but they haven’t been accessible to most of us until now.  At first I thought GDELT had rendered that problem moot, but concerns about its reliability have encouraged me to keep looking. I think Phoenix’s open-source-software approach holds more promise for the long run, but, as its makers describe, it’s still in “beta release” and “under active development.” ICEWS is a more mature project that has tried carefully to solve some of the problems, like event duplication and errors in geolocation, that diminish GDELT’s utility. (Many millions of dollars help.) So, naturally, I and many others have been eager to start exploring it. And now we can. Hooray!

To really open up analysis at this level, though, we’re going to need comparable and publicly (or at least cheaply) available data sets on a lot more of things our theories tell us to care about. As I said in the last post, we have a few of those now, but not many. Some of the work I’ve done over the past couple of years—this, especially—was meant to help fill those gaps, and I’m hoping that work will continue. But it’s just a drop in a leaky bucket. Here’s hoping for a hard turn of the spigot.

Down the Country-Month Rabbit Hole

Some big things happened in the world this week. Iran and the P5+1 agreed on a framework for a nuclear deal, and the agreement looks good. In a presidential election in Nigeria—the world’s seventh–most populous country, and one that few observers would have tagged as a democracy before last weekend—incumbent Goodluck Jonathan lost and then promptly and peacefully conceded defeat. The trickle of countries joining China’s new Asian Infrastructure Investment Bank turned into a torrent.

All of those things happened, but you won’t read more about them here, because I have spent the better part of the past week down a different rabbit hole. Last Friday, after years of almosts and any-time-nows, the event data produced for the Integrated Conflict Early Warning System (ICEWS) finally landed in the public domain, and I have been busy trying to figure out how to put them to use.

ICEWS isn’t the first publicly available trove of political event data, but it compares favorably to the field’s first mover, GDELT, and it currently covers a much longer time span than the other recent entrant, Phoenix.

The public release of ICEWS is exciting because it opens the door wider to dynamic modeling of world politics. Right now, nearly all of the data sets employed in statistical studies of politics around the globe use country-years as their units of observation. That’s not bad if you’re primarily interested in the effects or predictive power of structural features, but it’s pretty awful for explaining and anticipating faster-changing phenomena, like social unrest or violent conflict. GDELT broke the lock on that door, but its high noise-to-signal ratio and the opacity of its coding process have deterred me from investing too much time in developing monitoring or forecasting systems that depend on it.

With ICEWS on the Dataverse, that changes. I think we now have a critical mass of data sets in the public domain that: a) reliably cover important topics for the whole world over many years; b) are routinely updated; and, crucially, c) can be parsed to the month or even the week or day to reward investments in more dynamic modeling. Other suspects fitting this description include:

  • The spell-file version of Polity, which measures national patterns of political authority;
  • Lists of coup attempts maintained by Jonathan Powell and Clayton Thyne (here) and the Center for Systemic Peace (here); and
  • The PITF Worldwide Atrocities Event Dataset, which records information about events involving the deliberate killing of five or more noncombatant civilians (more on it here).

We also have high-quality data sets on national elections (here) and leadership changes (here, described here) that aren’t routinely updated by their sources but would be relatively easy to code by hand for applied forecasting.

With ICEWS, there is, of course, a catch. The public version of the project’s event data set will be updated monthly, but on a one-year delay. For example, when the archive was first posted in March, it ran through February 2014. On April 1, the Lockheed team added March 2014. This delay won’t matter much for scholars doing retrospective analyses, but it’s a critical flaw, if not a fatal one, for applied forecasters who can’t afford to pay—what, probably hundreds of thousands of dollars?—for a real-time subscription.

Fortunately, we might have a workaround. Phil Schrodt has played a huge role in the creation of the field of machine-coded political event data, including GDELT and ICEWS, and he is now part of the crew building Phoenix. In a blog post published the day ICEWS dropped, Phil suggested that Phoenix and ICEWS data will probably look enough alike to allow substituting the former for the latter, perhaps with some careful calibration. As Phil says, we won’t know for sure until we have a wider overlap between the two and can see how well this works in practice, but the possibility is promising enough for me to dig in.

And what does that mean? Well, a week has now passed since ICEWS hit the Dataverse, and so far I have:

  • Written an R function that creates a table of valid country-months for a user-specified time period, to use as scaffolding in the construction and agglomeration of country-month data sets;
  • Written scripts that call that function and some others to ingest and then parse or aggregate the other data sets I mentioned to the country-month level;
  • Worked out a strategy, and written the code, to partition the data into training and test sets for a project on predicting violence against civilians; and
  • Spent a lot of time staring at the screen thinking about, and a little time coding, ways to aggregate, reduce, and otherwise pre-process the ICEWS events and Polity data for that work on violence against civilians and beyond.

What I haven’t done yet—T plus seven days and counting—is any modeling. How’s that for push-button, Big Data magic?

A Note on Trends in Armed Conflict

In a report released earlier this month, the Project for the Study of the 21st Century (PS21) observed that “the body count from the top twenty deadliest wars in 2014 was more than 28% higher than in the previous year.” They counted approximately 163 thousand deaths in 2014, up from 127 thousand in 2013. The report described that increase as “part of a broader multi-year trend” that began in 2007. The project’s executive director, Peter Epps, also appropriately noted that “assessing casualty figures in conflict is notoriously difficult and many of the figures we are looking at here a probably underestimates.”

This is solid work. I do not doubt the existence of the trend it identifies. That said, I would also encourage us to keep it in perspective:

That chart (source) ends in 2005. Uppsala University’s Department of Peace and Conflict (UCDP) hasn’t updated its widely-used data set on battle-related deaths for 2014 yet, but from last year’s edition, we can see the tail end of that longer period, as well as the start of the recent upward trend PS21 identifies. In this chart—R script here—the solid line marks the annual, global sums of their best estimates, and the dotted lines show the sums of the high and low estimates:
Annual, global battle-related deaths, 1989-2013 (source: UCDP)

Annual, global battle-related deaths, 1989-2013 (Data source: UCDP)

If we mentally tack that chart onto the end of the one before it, we can also see that the increase of the past few years has not yet broken the longer spell of relatively low numbers of battle deaths. Not even close. The peak around 2000 in the middle of the nearer chart is a modest bump in the farther one, and the upward trend we’ve seen since 2007 has not yet matched even that local maximum. This chart stops at the end of 2013, but if we used the data assembled by PS21 for the past year to project an increase in 2014, we’d see that we’re still in reasonably familiar territory.

Both of these things can be true. We could be—we are—seeing a short-term increase that does not mark the end of a longer-term ebb. The global economy has grown fantastically since the 1700s, and yet it still suffers serious crises and recessions. The planet has warmed significantly over the past century, but we still see some unusually cool summers and winters.

Lest this sound too sanguine at a time when armed conflict is waxing, let me add two caveats.

First, the picture from the recent past looks decidedly worse if we widen our aperture to include deliberate killings of civilians outside of battle. UCDP keeps a separate data set on that phenomenon—here—which they label “one-sided” violence. If we add the fatalities tallied in that data set to the battle-related ones summarized in the previous plot, here is what we get:

Annual, global battle-related deaths and deaths from one-sided violence, 1989-2013 (Data source: UCDP)

Annual, global battle-related deaths and deaths from one-sided violence, 1989-2013 (Data source: UCDP)

Note the difference in the scale of the y-axis; it is an order of magnitude larger than the one in the previous chart. At this scale, the peaks and valleys in battle-related deaths from the past 25 years get smoothed out, and a single peak—the Rwandan genocide—dominates the landscape. That peak is still much lower than the massifs marking the two World Wars in the first chart, but it is huge nonetheless. Hundreds of thousands of people were killed in a matter of months.

Second, the long persistence of this lower rate does not prove that the risk of violent conflict on the scale of the two World Wars has been reduced permanently. As Bear Braumoeller (here) and Nassim Nicholas Taleb (here; I link reluctantly, because I don’t care for the scornful and condescending tone) have both pointed out, a single war between great powers could end or even reverse this trend, and it is too soon to say with any confidence whether or not the risk of that happening is much lower than it used to be. Like many observers of international relations, I think we need to see how the system processes the (relative) rise of China and declines of Russia and the United States before updating our beliefs about the risk of major wars. As someone who grew up during the Cold War and was morbidly fascinated by the possibility of nuclear conflagration, I think we also need to remember how close we came to nuclear war on some occasions during that long spell, and to ponder how absurdly destructive and terrible that would be.

Strictly speaking, I’m not an academic, but I do a pretty good impersonation of one, so I’ll conclude with a footnote to that second caveat: I did not attribute the idea that the risk of major war is a thing of the past to Steven Pinker, as some do, because as Pinker points out in a written response to Taleb (here), he does not make precisely that claim, and his wider point about a long-term decline in human violence does not depend entirely on an ebb in warfare persisting. It’s hard to see how Pinker’s larger argument could survive a major war between nuclear powers, but then if that happened, who would care one way or another if it had?

“No One Stayed to Count the Bodies”

If you want to understand and appreciate why, even in the age of the Internet and satellites and near-ubiquitous mobile telephony, it remains impossible to measure even the coarsest manifestations of political violence with any precision, read this blog post by Phil Hazlewood, AFP’s bureau chief in Lagos. (Warning: graphic. H/t to Ryan Cummings on Twitter.)

Hazlewood’s post focuses on killings perpetrated by Boko Haram, but the same issues arise in measuring violence committed by states. Violence sometimes eliminates some people who might describe the acts involved, and it intentionally scares many others. If you hear or see details of what happened, that’s often because the killers or their rivals for power wanted you to hear or see those details. We cannot sharply distinguish between the communication of those facts and the political intentions expressed in the violence or the reactions to it. The conversation is the message, and the violence is part of the conversation.

When you see or hear things in spite of those efforts to conceal them, you have to wonder how selection effects limit or distort the information that gets through. North Korea’s gulag system apparently contains thousands and kills some untold numbers each year. Defectors are the outside world’s main source of information about that system, but those defectors are not a random sample of victims, nor are they mechanical recording devices. Instead, they are human beings who have somehow escaped that country and who are now seeking to draw attention to and destroy that system. I do not doubt the basic truth of the gulags’ existence and the horrible things done there, but as a social scientist, I have to consider how those selection processes and motivations shape what we think we know. In the United States, we lack reliable data on fatal encounters with police. That’s partly because different jurisdictions have different capabilities for recording and reporting these incidents, but it’s also partly because some people in that system do not want us to see what they do.

For a previous post of mine on this topic, see “The Fog of War Is Patchy“.

 

A Postscript on Measuring Change Over Time in Freedom in the World

After publishing yesterday’s post on Freedom House’s latest Freedom in the World report (here), I thought some more about better ways to measure what I think Freedom House implies it’s measuring with its annual counts of country-level gains and declines. The problem with those counts is that they don’t account for the magnitude of the changes they represent. That’s like keeping track of how a poker player is doing by counting bets won and bets lost without regard to their value. If we want to assess the current state of the system and compare it earlier states, the size of those gains and declines matters, too.

With that in mind, my first idea was to sum the raw annual changes in countries’ “freedom” scores by year, where the freedom score is just the sum of those 7-point political rights and civil liberties indices. Let’s imagine a year in which three countries saw a 1-point decline in their freedom scores; one country saw a 1-point gain; and one country saw a 3-point gain. Using Freedom House’s measure, that would look like a bad year, with declines outnumbering gains 3 to 2. Using the sum of the raw changes, however, it would look like a good year, with a net change in freedom scores of +1.

Okay, so here’s a plot of those sums of raw annual changes in freedom scores since 1982, when Freedom House rejiggered the timing of its survey.[1] I’ve marked the nine-year period that Freedom House calls out in its report as an unbroken run of bad news, with declines outnumbering gains every year since 2006. As the plot shows, when we account for the magnitude of those gains and losses, things don’t look so grim. In most of those nine years, losses did outweigh gains, but the net loss was rarely large, and two of the nine years actually saw net gains by this measure.

Annual global sums of raw yearly changes in Freedom House freedom scores (inverted), 1983-2014

Annual global sums of raw yearly changes in Freedom House freedom scores (inverted), 1983-2014

After I’d generated that plot, though, I worried that the sum of those raw annual changes still ignored another important dimension: population size. As I understand it, the big question Freedom House is trying to address with its annual report is: “How free is the world?” If we want to answer that question from a classical liberal perspective—and that’s where I think Freedom House is coming from—then individual people, not states, need to be our unit of observation.

Imagine a world with five countries where half the global population lives in one country and the other half is evenly divided between the other four. Now let’s imagine that the one really big country is maximally unfree while the other four countries are maximally free. If we compare scores (or changes in them) by country, things look great; 80 percent of the world is super-free! Meanwhile, though, half the world’s population lives under total dictatorship. An international relations theorist might care more about the distribution of states, but a liberal should care more about the distribution of people.

To take a look at things from this perspective, I decided to generate a scalar measure of freedom in the world system that sums country scores weighted by their share of the global population.[2] To make the result easier to interpret, I started by rescaling the country-level “freedom scores” from 14-2 to 0-10, with 10 indicating most free. A world in which all countries are fully free (according to Freedom House) would score a perfect 10 on this scale, and changes in large countries will move the index more than changes in small ones.

Okay, so here’s a plot of the results for the entire run of Freedom House’s data set, 1972–2014. (Again, 1981 is missing because that’s when Freedom House paused to align their reports with the calendar year.)  Things look pretty different than they do when we count gains and declines or even sum raw changes by country, don’t they?

A population-weighted annual scalar measure of freedom in the world, 1972-2014

A population-weighted annual scalar measure of freedom in the world, 1972-2014

The first thing that jumped out at me were those sharp declines in the mid-1970s and again in the late 1980s and early 1990s. At first I thought I must have messed up the math, because everyone knows things got a lot better when Communism crumbled in Eastern Europe and the Soviet Union, right? It turns out, though, that those swings are driven by changes in China and India, which together account for approximately one-third of the global population. In 1989, after Tienanmen Square, China’s score dropped from a 6/6 (or 1.67 on my 10-point scalar version) to 7/7 (or 0). At the time, China contained nearly one-quarter of the world’s population, so that slump more than offsets the (often-modest) gains made in the countries touched by the so-called fourth wave of democratic transitions. In 1998, China inched back up to 7/6 (0.83), and the global measure moved with it. Meanwhile, India dropped from 2/3 (7.5) to 3/4 (5.8) in 1991, and then again from 3/4 to 4/4 (5.0) in 1993, but it bumped back up to 2/4 (6.67) in 1996 and then 2/3 (7.5) in 1998. The global gains and losses produced by the shifts in those two countries don’t fully align with the conventional narrative about trends in democratization in the past few decades, but I think they do provide a more accurate measure of overall freedom in the world if we care about people instead of states, as liberalism encourages us to do.

Of course, the other thing that caught my eye in that second chart was the more-or-less flat line for the past decade. When we consider the distribution of the world’s population across all those countries where Freedom House tallies gains and declines, it’s hard to find evidence of the extended democratic recession they and others describe. In fact, the only notable downturn in that whole run comes in 2014, when the global score dropped from 5.2 to 5.1. To my mind, that recent downturn marks a worrying development, but it’s harder to notice it when we’ve been hearing cries of “Wolf!” for the eight years before.

NOTES

[1] For the #Rstats crowd: I used the slide function in the package DataCombine to get one-year lags of those indices by country; then I created a new variable representing the difference between the annual score for the current and previous year; then I used ddply from the plyr package to create a data frame with the annual global sums of those differences. Script on GitHub here.

[2] Here, I used the WDI package to get country-year data on population size; used ddply to calculate world population by year; merged those global sums back into the country-year data; used those sums as the denominator in a new variable indicating a country’s share of the global population; and then used ddply again to get a table with the sum of the products of those population weights and the freedom scores. Again, script on GitHub here (same one as before).

Why My Coup Risk Models Don’t Include Any Measures of National Militaries

For the past several years (herehere, here, and here), I’ve used statistical models estimated from country-year data to produce assessments of coup risk in countries worldwide. I rejigger the models a bit each time, but none of the models I’ve used so far has included specific features of countries’ militaries.

That omission strikes a lot of people as a curious one. When I shared this year’s assessments with the Conflict Research Group on Facebook, one group member posted this comment:

Why do none of the covariates feature any data on militaries? Seeing as militaries are the ones who stage the coups, any sort of predictive model that doesn’t account for the militaries themselves would seem incomplete.

I agree in principle. It’s the practical side that gets in the way. I don’t include features of national militaries in the models because I don’t have reliable measures of them with the coverage I need for this task.

To train and then apply these predictive models, I need fairly complete time series for all or nearly all countries of the world that extend back to at least the 1980s and have been updated recently enough to give me a useful input for the current assessment (see here for more on why that’s true). I looked again early this month and still can’t find anything like that on even the big stuff, like military budgets, size, and force structures. There are some series on this topic in the World Bank’s World Development Indicators (WDI) data set, but those series have a lot of gaps, and the presence of those gaps is correlated with other features of the models (e.g., regime type). Ditto for SIPRI. And, of course, those aren’t even the most interesting features for coup risk, like whether or not military promotions favor certain groups over others, or if there is a capable and purportedly loyal presidential guard.

But don’t take my word for it. Here’s what the Correlates of War Project says in the documentation for Version 4.0 of its widely-used data set (PDF) about its measure of military expenditures, one of two features of national militaries it tries to cover (the other is total personnel):

It was often difficult to identify and exclude civil expenditures from reported budgets of less developed nations. For many countries, including some major powers, published military budgets are a catch-all category for a variety of developmental and administrative expenses—public works, colonial administration, development of the merchant marine, construction, and improvement of harbor and navigational facilities, transportation of civilian personnel, and the delivery of mail—of dubious military relevance. Except when we were able to obtain finance ministry reports, it is impossible to make detailed breakdowns. Even when such reports were available, it proved difficult to delineate “purely” military outlays. For example, consider the case in which the military builds a road that facilitates troops movements, but which is used primarily by civilians. A related problem concerns those instances in which the reported military budget does not reflect all of the resources devoted to that sector. This usually happens when a nation tries to hide such expenditures from scrutiny; for instance, most Western scholars and military experts agree that officially reported post-1945 Soviet-bloc totals are unrealistically low, although they disagree on the appropriate adjustments.

And that’s just the part of the “Problems and Possible Errors” section about observing the numerator in a calculation that also requires a complicated denominator. And that’s for what is—in principle, at least—one of the most observable features of a country’s civil-military relations.

Okay, now let’s assume that problem magically disappears, and COW’s has nearly-complete and reliable data on military expenditures. Now we want to use models trained on those data to estimate coup risk for 2015. Whoops: COW only runs through 2010! The World Bank and SIPRI get closer to the current year—observations through 2013 are available now—but there are missing values for lots of countries, and that missingness is caused by other predictors of coup risk, such as national wealth, armed conflict, and political regime type. For example, WDI has no data on military expenditures for Eritrea and North Korea ever, and the series for Central African Republic is patchy throughout and ends in 2010. If I wanted to include military expenditures in my predictive models, I could use multiple imputation to deal with these gaps in the training phase, but then how would I generate current forecasts for these important cases? I could make guesses, but how accurate could those guesses be for a case like Eritrea or North Korea, and then am I adding signal or noise to the resulting forecasts?

Of course, one of the luxuries of applied forecasting is that the models we use can lack important features and still “work.” I don’t need the model to be complete and its parameters to be true for the forecasts to be accurate enough to be useful. Still, I’ll admit that, as a social scientist by training, I find it frustrating to have to set aside so many intriguing ideas because we simply don’t have the data to try them.

Schrodinger’s Coup

You’ve heard of Schrödinger’s cat, right? This is the famous thought experiment proposed by Nobel Prize–winning physicist Erwin Schrödinger to underscore what he saw as the absurdity of quantum superposition—the idea that “an object in a physical system can simultaneously exist in all possible configurations, but observing the system forces the system to collapse and forces the object into just one of those possible states.”

Schrodinger designed his thought experiment to refute the idea that a physical object could simultaneously occupy multiple physical states. At the level of whole cats, anyway, I’m convinced.

When it comes to coups, though, I’m not so sure. Arguments over whether or not certain events were or were not coups or coup attempts usually involve reasonable disagreements over definitions, but fundamental uncertainty about the actions and intentions involved often plays a role, too—especially in failed attempts. Certain sets of events exist in a perpetual state of ambiguity, simultaneously coup and not-coup with no possibility of our ever observing the definitive empirical facts that would force the cases to collapse into a single, clear condition.

Two recent examples help show what I mean. The first is last week’s coup/not coup attempt in the Gambia. From initial reports, it seemed pretty clear that some disgruntled soldiers had tried and failed to seize power while the president was traveling. That’s a classic coup scenario, and all the elements present in most coup definitions were there: military or political insiders seeking to overthrow the government through the use or threat of force.

This week, though, we hear that the gunmen in question were diasporans who hatched the plot abroad without any help on the inside. As the New York Times reported,

According to the [Justice Department’s] complaint, filed in federal court in Minnesota, the plot to topple Mr. Jammeh was hatched in October. Roughly a dozen Gambians in the United States, Germany, Britain and Senegal were involved in the plot, the complaint said. The plotters apparently thought, mistakenly, that members of the Gambian armed forces would join their cause…

The plot went awry when State House guards overwhelmed the attackers with heavy fire, leaving many dead or wounded. Mr. Faal and Mr. Njie escaped and returned to the United States, where they were arrested, the complaint said.

So, were the putschists really just a cabal of outsiders, in which case this event would not qualify as a coup attempt under most academic definitions? Or did they have collaborators on the inside who were discovered or finked out at the crucial moment, making a coup attempt look like a botched raid? The Justice Department’s complaint implies the former, but we’ll never know for sure.

Lesotho offers a second recent example of coup-tum superposition. In late August, that country’s prime minister, Thomas Thabane, fled to neighboring South Africa and cried “Coup!” after soldiers shut down radio stations and surrounded his residence and police headquarters in the capital. But, as Kristen van Schie reported for Al Jazeera,

Not so, said the military. It claimed the police were planning on arming UTTA—the government-aligned youth movement accused of planning to disrupt Monday’s march [against Thabane]. It was not so much a coup as a preventative anti-terrorism operation, it said.

The prime minister and the South African government continued to describe the event as a coup attempt despite that denial, but other observers disagreed. As analyst Dimpho Motsamai told van Schie, “Can one call it a coup when the military haven’t declared they’ve taken over government?” Maybe this really was just a misunderstanding accelerated by the country’s persistent factional crisis.

This uncertainty is generic to the study of political behavior, where the determination of a case’s status depends, in part, on the actors’ intentions, which can never be firmly established. Did certain members of the military in the Gambia mean to cooperate with the diasporans who shot their way toward the presidential last week, only to quit or get stopped before the decisive moment arrived? Was the commander of Lesotho’s armed forces planning to oust the prime minister when he ordered soldiers out of the barracks in late August, only to change his mind and tune after Thabane escaped the country?

To determine with certainty whether or not these events were coup attempts, we need clear answers to those questions, but we can’t get them. Instead, we can only see the related actions, and even those are incompletely and unreliably reported in most cases. Sometimes we get post hoc descriptions and explanations of those actions from the participants or close observers, but humans are notoriously unreliable reporters of their own intentions, especially in high-visibility, high-stakes situations like these.

Because this problem is fundamental to the study of political behavior, the best we can do is acknowledge it and adjust our estimations and inferences accordingly. When assembling data on coup attempts for comparative analysis, instead of just picking one source, we might use Bayesian measurement models to try to quantify this collective uncertainty (see here for a related example). Then, before reporting new findings on the causes or correlates of coup attempts, we might ask: which cases are more ambiguous than the others, and how would their removal from or addition to the sample alter our conclusions?

Deriving a Fuzzy-Set Measure of Democracy from Several Dichotomous Data Sets

In a recent post, I described an ongoing project in which Shahryar Minhas, Mike Ward, and I are using text mining and machine learning to produce fuzzy-set measures of various political regime types for all countries of the world. As part of the NSF-funded MADCOW project,* our ultimate goal is to devise a process that routinely updates those data in near-real time at low cost. We’re not there yet, but our preliminary results are promising, and we plan to keep tinkering.

One of crucial choices we had to make in our initial analysis was how to measure each regime type for the machine-learning phase of the process. This choice is important because our models are only going to be as good as the data from which they’re derived. If the targets in that machine-learning process don’t reliably represent the concepts we have in mind, then the resulting models will be looking for the wrong things.

For our first cut, we decided to use dichotomous measures of several regime types, and to base those dichotomous measures on stringent criteria. So, for example, we identified as democracies only those cases with a score of 10, the maximum, on Polity’s scalar measure of democracy. For military rule, we only coded as 1 those cases where two major data sets agreed that a regime was authoritarian and only military-led, with no hybrids or modifiers. Even though the targets of our machine-learning process were crisply bivalent, we could get fuzzy-set measures from our classifiers by looking at the probabilities of class membership they produce.

In future iterations, though, I’m hoping we’ll get a chance to experiment with targets that are themselves fuzzy or that just take advantage of a larger information set. Bayesian measurement error models offer a great way to generate those targets.

Imagine that you have a set of cases that may or may not belong in some category of interest—say, democracy. Now imagine that you’ve got a set of experts who vote yes (1) or no (0) on the status of each of those cases and don’t always agree. We can get a simple estimate of the probability that a given case is a democracy by averaging the experts’ votes, and that’s not necessarily a bad idea. If, however, we suspect that some experts are more error prone than others, and that the nature of those errors follows certain patterns, then we can do better with a model that gleans those patterns from the data and adjusts the averaging accordingly. That’s exactly what a Bayesian measurement error model does. Instead of an unweighted average of the experts’ votes, we get an inverse-error-rate-weighted average, which should be more reliable than the unweighted version if the assumption about predictable patterns in those errors is largely correct.

I’m not trained in Bayesian data analysis and don’t know my way around the software used to estimate these models, so I sought and received generous help on this task from Sean J. Taylor. I compiled yes/no measures of democracy from five country-year data sets that ostensibly use similar definitions and coding criteria:

  • Cheibub, Gandhi, and Vreeland’s Democracy and Dictatorship (DD) data set, 1946–2008 (here);
  • Boix, Miller, and Rosato’s dichotomous coding of democracy, 1800–2007 (here);
  • A binary indicator of democracy derived from Polity IV using the Political Instability Task Force’s coding rules, 1800–2013;
  • The lists of electoral democracies in Freedom House’s annual Freedom in the World reports, 1989–2013; and
  • My own Democracy/Autocracy data set, 1955–2010 (here).

Sean took those five columns of zeroes and ones and used them to estimate a model with no prior assumptions about the five sources’ relative reliability. James Melton, Stephen Meserve, and Daniel Pemstein use the same technique to produce the terrific Unified Democracy Scores. What we’re doing is a little different, though. Where their approach treats democracy as a scalar concept and estimates a composite index from several measures, we’re accepting the binary conceptualization underlying our five sources and estimating the probability that a country qualifies as a democracy. In fuzzy-set terms, this probability represents a case’s degree of membership in the democracy set, not how democratic it is.

The distinction between a country’s degree of membership in that set and its degree of democracy is subtle but potentially meaningful, and the former will sometimes be a better fit for an analytic task than the latter. For example, if you’re looking to distinguish categorically between democracies and autocracies in order to estimate the difference in some other quantity across the two sets, it makes more sense to base that split on a probabilistic measure of set membership than an arbitrarily chosen cut point on a scalar measure of democracy-ness. You would still need to choose a threshold, but “greater than 0.5” has a natural interpretation (“probably a democracy”) that suits the task in a way that an arbitrary cut point on an index doesn’t. And, of course, you could still perform a sensitivity analysis by moving the cut point around and seeing how much that choice affects your results.

So that’s the theory, anyway. What about the implementation?

I’m excited to report that the estimates from our initial measurement model of democracy look great to me. As someone who has spent a lot of hours wringing my hands over the need to make binary calls on many ambiguous regimes (Russia in the late 1990s? Venezuela under Hugo Chavez? Bangladesh between coups?), I think these estimates are accurately distinguishing the hazy cases from the rest and even doing a good job estimating the extent of that uncertainty.

As a first check, let’s take a look at the distribution of the estimated probabilities. The histogram below shows the estimates for the period 1989–2007, the only years for which we have inputs from all five of the source data sets. Voilà, the distribution has the expected shape. Most countries most of the time are readily identified as democracies or non-democracies, but the membership status of a sizable subset of country-years is more uncertain.

Estimated Probabilities of Democracy for All Countries Worldwide, 1989-2007

Estimated Probabilities of Democracy for All Countries Worldwide, 1989-2007

Of course, we can and should also look at the estimates for specific cases. I know a little more about countries that emerged from the collapse of the Soviet Union than I do about the rest of the world, so I like to start there when eyeballing regime data. The chart below compares scores for several of those countries that have exhibited more variation over the past 20+ years. Most of the rest of the post-Soviet states are slammed up against 1 (Estonia, Latvia, and Lithuania) or 0 (e.g., Uzbekistan, Turkmenistan, Tajikistan), so I left them off the chart. I also limited the range of years to the ones for which data are available from all five sources. By drawing strength from other years and countries, the model can produce estimates for cases with fewer or even no inputs. Still, the estimates will be less reliable for those cases, so I thought I would focus for now on the estimates based on a common set of “votes.”

Estimated Probability of Democracy for Selected Soviet Successor States, 1991-2007

Estimated Probability of Democracy for Selected Soviet Successor States, 1991-2007

Those estimates look about right to me. For example, Georgia’s status is ambiguous and trending less likely until the Rose Revolution of 2003, after which point it’s probably but not certainly a democracy, and the trend bends down again soon thereafter. Meanwhile, Russia is fairly confidently identified as a democracy after the constitutional crisis of 1993, but its status becomes uncertain around the passage of power from Yeltsin to Putin and then solidifies as most likely authoritarian by the mid-2000s. Finally, Armenia was one of the cases I found most difficult to code when building the Democracy/Autocracy data set for the Political Instability Task Force, so I’m gratified to see its probability of democracy oscillating around 0.5 throughout.

One nice feature of a Bayesian measurement error model is that, in addition to estimating the scores, we can also estimate confidence intervals to help quantify our uncertainty about those scores. The plot below shows Armenia’s trend line with the upper and lower bounds of a 90-percent confidence interval. Here, it’s even easier to see just how unclear this country’s democracy status has been since it regained independence. From 1991 until at least 2007, its 90-percent confidence interval straddled the toss-up line. How’s that for uncertain?

Armenia's Estimated Probability of Democracy with 90% Confidence Interval

Armenia’s Estimated Probability of Democracy with 90% Confidence Interval

Sean and I are still talking about ways to tweak this process, but I think the data it’s producing are already useful and interesting. I’m considering using these estimates in a predictive model of coup attempts and seeing if and how the results differ from ones based on the Polity index and the Unified Democracy Scores. Meanwhile, the rest of the MADCOW crew and I are now talking about applying the same process to dichotomous indicators of military rule, one-party rule, personal rule, and monarchy and then experimenting with machine-learning processes that use the results as their targets. There are lots of moving parts in our regime data-making process, and this one isn’t necessarily the highest priority, but it would be great to get to follow this path and see where it leads.

* NSF Award 1259190, Collaborative Research: Automated Real-time Production of Political Indicators

Mining Texts to Generate Fuzzy Measures of Political Regime Type at Low Cost

Political scientists use the term “regime type” to refer to the formal and informal structure of a country’s government. Of course, “government” entails a lot of things, so discussions of regime type focus more specifically on how rulers are selected and how their authority is organized and exercised. The chief distinction in contemporary work on regime type is between democracies and non-democracies, but there’s some really good work on variations of non-democracy as well (see here and here, for example).

Unfortunately, measuring regime type is hard, and conventional measures of regime type suffer from one or two crucial drawbacks.

First, many of the data sets we have now represent regime types or their components with bivalent categorical measures that sweep meaningful uncertainty under the rug. Specific countries at specific times are identified as fitting into one and only one category, even when researchers knowledgeable about those cases might be unsure or disagree about where they belong. For example, all of the data sets that distinguish categorically between democracies and non-democracies—like this one, this one, and this one—agree that Norway is the former and Saudi Arabia the latter, but they sometimes diverge on the classification of countries like Russia, Venezuela, and Pakistan, and rightly so.

Importantly, the degree of our uncertainty about where a case belongs may itself be correlated with many of the things that researchers use data on regime type to study. As a result, findings and forecasts derived from those data are likely to be sensitive to those bivalent calls in ways that are hard to understand when that uncertainty is ignored. In principle, it should be possible to make that uncertainty explicit by reporting the probability that a case belongs in a specific set instead of making a crisp yes/no decision, but that’s not what most of the data sets we have now do.

Second, virtually all of the existing measures are expensive to produce. These data sets are coded either by hand or through expert surveys, and routinely covering the world this way takes a lot of time and resources. (I say this from knowledge of the budgets for the production of some of these data sets, and from personal experience.) Partly because these data are so costly to make, many of these measures aren’t regularly updated. And, if the data aren’t regularly updated, we can’t use them to generate the real-time forecasts that offer the toughest test of our theories and are of practical value to some audiences.

As part of the NSF-funded MADCOW project*, Michael D. (Mike) Ward, Philip Schrodt, and I are exploring ways to use text mining and machine learning to generate measures of regime type that are fuzzier in a good way from a process that is mostly automated. These measures would explicitly represent uncertainty about where specific cases belong by reporting the probability that a certain case fits a certain regime type instead of forcing an either/or decision. Because the process of generating these measures would be mostly automated, they would be much cheaper to produce than the hand-coded or survey-based data sets we use now, and they could be updated in near-real time as relevant texts become available.

At this week’s annual meeting of the American Political Science Association, I’ll be presenting a paper—co-authored with Mike and Shahryar Minhas of Duke University’s WardLab—that describes preliminary results from this endeavor. Shahryar, Mike, and I started by selecting a corpus of familiar and well-structured texts describing politics and human-rights practices each year in all countries worldwide: the U.S. State Department’s Country Reports on Human Rights Practices, and Freedom House’s Freedom in the World. After pre-processing those texts in a few conventional ways, we dumped the two reports for each country-year into a single bag of words and used text mining to extract features from those bags in the form of vectorized tokens that may be grossly described as word counts. (See this recent post for some things I learned from that process.) Next, we used those vectorized tokens as inputs to a series of binary classification models representing a few different ideal-typical regime types as observed in few widely used, human-coded data sets. Finally, we applied those classification models to a test set of country-years held out at the start to assess the models’ ability to classify regime types in cases they had not previously “seen.” The picture below illustrates the process and shows how we hope eventually to develop models that can be applied to recent documents to generate new regime data in near-real time.

Overview of MADCOW Regime Classification Process

Overview of MADCOW Regime Classification Process

Our initial results demonstrate that this strategy can work. Our classifiers perform well out of sample, achieving high or very high precision and recall scores in cross-validation on all four of the regime types we have tried to measure so far: democracy, monarchy, military rule, and one-party rule. The separation plots below are based on out-of-sample results from support vector machines trained on data from the 1990s and most of the 2000s and then applied to new data from the most recent few years available. When a classifier works perfectly, all of the red bars in the separation plot will appear to the right of all of the pink bars, and the black line denoting the probability of a “yes” case will jump from 0 to 1 at the point of separation. These classifiers aren’t perfect, but they seem to be working very well.

 

prelim.democracy.svm.sepplot

prelim.military.svm.sepplot

prelim.monarchy.svm.sepplot

prelim.oneparty.svm.sepplot

Of course, what most of us want to do when we find a new data set is to see how it characterizes cases we know. We can do that here with heat maps of the confidence scores from the support vector machines. The maps below show the values from the most recent year available for two of the four regime types: 2012 for democracy and 2010 for military rule. These SVM confidence scores indicate the distance and direction of each case from the hyperplane used to classify the set of observations into 0s and 1s. The probabilities used in the separation plots are derived from them, but we choose to map the raw confidence scores because they exhibit more variance than the probabilities and are therefore easier to visualize in this form.

prelim.democracy.svmcomf.worldmap.2012

prelim.military.svmcomf.worldmap.2010

 

On the whole, cases fall out as we would expect them to. The democracy classifier confidently identifies Western Europe, Canada, Australia, and New Zealand as democracies; shows interesting variations in Eastern Europe and Latin America; and confidently identifies nearly all of the rest of the world as non-democracies (defined for this task as a Polity score of 10). Meanwhile, the military rule classifier sees Myanmar, Pakistan, and (more surprisingly) Algeria as likely examples in 2010, and is less certain about the absence of military rule in several West African and Middle Eastern countries than in the rest of the world.

These preliminary results demonstrate that it is possible to generate probabilistic measures of regime type from publicly available texts at relatively low cost. That does not mean we’re fully satisfied with the output and ready to move to routine data production, however. For now, we’re looking at a couple of ways to improve the process.

First, the texts included in the relatively small corpus we have assembled so far only cover a narrow set of human-rights practices and political procedures. In future iterations, we plan to expand the corpus to include annual or occasional reports that discuss a broader range of features in each country’s national politics. Eventually, we hope to add news stories to the mix. If we can develop models that perform well on an amalgamation of occasional reports and news stories, we will be able to implement this process in near-real time, constantly updating probabilistic measures of regime type for all countries of the world at very low cost.

Second, the stringent criteria we used to observe each regime type in constructing the binary indicators on which the classifiers are trained also appear to be shaping the results in undesirable ways. We started this project with a belief that membership in these regime categories is inherently fuzzy, and we are trying to build a process that uses text mining to estimate degrees of membership in those fuzzy sets. If set membership is inherently ambiguous in a fair number of cases, then our approximation of a membership function should be bimodal, but not too neatly so. Most cases most of the time can be placed confidently at one end of the range of degrees of membership or the other, but there is considerable uncertainty at any moment in time about a non-trivial number of cases, and our estimates should reflect that fact.

If that’s right, then our initial estimates are probably too tidy, and we suspect that the stringent operationalization of each regime type in the training data is partly to blame. In future iterations, we plan to experiment with less stringent criteria—for example, by identifying a case as military rule if any of our sources tags it as such. With help from Sean J. Taylor, we’re also looking at ways we might use Bayesian measurement error models to derive fuzzy measures of regime type from multiple categorical data sets, and then use that fuzzy measure as the target in our machine-learning process.

So, stay tuned for more, and if you’ll be at APSA this week, please come to our Friday-morning panel and let us know what you think.

* NSF Award 1259190, Collaborative Research: Automated Real-time Production of Political Indicators

  • Author

  • Follow me on Twitter

  • Follow Dart-Throwing Chimp on WordPress.com
  • Enter your email address to follow this blog and receive notifications of new posts by email.

    Join 13,632 other followers

  • Archives

  • Advertisements
%d bloggers like this: