Finding the Right Statistic

Earlier this week, Think Progress reported that at least five black women have died in police custody in the United States since mid-July. The author of that post, Carimah Townes, wrote that those deaths “[shine] an even brighter spotlight on the plight of black women in the criminal justice system and [fuel] the Black Lives Matter movement.” I saw the story on Facebook, where the friend who posted it inferred that “a disproportionate percentage of those who died in jail are from certain ethnic minorities.”

As a citizen, I strongly support efforts to draw attention to implicit and explicit racism in the U.S. criminal justice system, and in the laws that system is supposed to enforce. The inequality of American justice across racial and ethnic groups is a matter of fact, not opinion, and its personal and social costs are steep.

As a social scientist, though, I wondered how much the number in that Think Progress post — five — really tells us. To infer bias, we need to make comparisons to other groups. How many white women died in police custody during that same time? What about black men and white men? And so on for other subsets of interest.

Answering those questions would still get us only partway there, however. To make the comparisons fair, we would also need to know how many people from each of those groups passed through police custody during that time. In epidemiological jargon, what we want are incidence rates for each group: the number of cases from some period divided by the size of the population during that period. Here, cases are deaths, and the population of interest is the number of people from that group who spent time in police custody.

I don’t have those data for the United States for second half of July, and I doubt that they exist in aggregate at this point. What we do have now, however, is a U.S. Department of Justice report from October 2014 on mortality in local jails and state prisons (PDF). This isn’t exactly what we’re after, but it’s close.

So what do those data say? Here’s an excerpt from Table 6, which reports the “mortality rate per 100,000 local jail inmates by selected decedent characteristics, 2000–2012”:

                    2008     2009     2010     2011     2012
By Sex
Male                 123      129      125      123      129
Female               120      120      124      122      123

Race/Hispanic Origin
White                185      202      202      212      220
Black/Afr. Am.       109      100      102       94      109
Hispanic/Latino       70       71       58       67       60
Other                 41       53       36       28       31

Given what we know about the inequality of American justice, these figures surprised me. According to data assembled by the DOJ, the mortality rate of blacks in local jails in those recent years was about half the rate for whites. For Latinos, it was about one-third the rate for whites.

That table got me wondering why those rates were so different from what I’d expected. Table 8 in the same report offers some clues. It provides death rates by cause for each of those same subgroups for the whole 13-year period. According to that table, white inmates committed suicide in local jails at a much higher rate than blacks and Latinos: 80 per 100,000 versus 14 and 25, respectively. Those figures jibe with ones on suicide rates for the general population. White inmates also died from heart disease and drug and alcohol intoxication at a higher rate than their black and Latino counterparts. In short, it looks like whites are more likely than blacks or Latinos to die while in local jails, mostly because they are much more likely to commit suicide there.

These statistics tell us nothing about whether or not racism or malfeasance played a role in the deaths of any of those five black women mentioned in the Think Progress post. They also provide a woefully incomplete picture of the treatment of different racial and ethnic groups by police and the U.S. criminal justice system. For example and as FiveThirtyEight reported just a few days ago, DOJ statistics also show that the rate of arrest-related deaths by homicide is almost twice as high for blacks as whites — 3.4 per 100,000 compared to 1.8. In many parts of the U.S., blacks convicted of murder are more likely than their white counterparts to get the death penalty, even when controlling for similarities in the crimes involved and especially when the victims were white (see here). A 2013 Pew Research Center Study found that, in 2010, black men were six times as likely as white men to be incarcerated in federal, state and local jails.

Bearing all of that in mind, what I hope those figures do is serve as a simple reminder that, when mustering evidence of a pattern, it’s important to consider the right statistic for the question. Raw counts will rarely be that statistic. If we want to make comparisons across groups, we need to think about differences in group size and other factors that might affect group exposure, too.

In China, Don’t Mistake the Trees for the Forest

Anyone who pays much attention to news of the world knows that China’s economy is cooling a bit. Official statistics—which probably aren’t true but may still be useful—show annual growth slowing from over 7.5 to around 7 percent or lower and staying there for a while.

For economists, the big question seems to be whether or not policy-makers can control the descent and avoid a hard landing or crash. Meanwhile, political scientists and sociologists wonder whether or not that economic slowdown will spur social unrest that could produce a national political crisis or reform. Most of what I remember reading on the topic has suggested that the risk of large-scale social unrest will remain low as long as China avoids the worst-case economic scenarios. GDP growth in the 6–7 percent range would be a letdown, but it’s still pretty solid compared to most places and is hardly a crisis.

I don’t know enough about economics to wade into that field’s debate, but I do wonder if an ecological fallacy might be leading many political scientists to underestimate the likelihood of significant social unrest in China in response to this economic slowdown. We commit an ecological fallacy when we assume that the characteristics of individuals in a group match the central tendencies of that group—for example, assuming that a kid you meet from a wealthy, high-performing high school is rich and will score well on the SAT. Put another way, an ecological fallacy involves mistakenly assuming that each tree shares the characteristic features of the forest they comprise.

Now consider the chart below, from a recent article in the Financial Times about the uneven distribution of economic malaise across China’s provinces. As the story notes, “The slowdown has affected some areas far worse than others. Perhaps predictably, the worst-hit places are those that can least afford it.”

The chart reminds us that China is a large and heterogeneous country—and, as it happens, social unrest isn’t a national referendum. You don’t need a majority vote from a whole country to get popular protest that can threaten to reorder national politics; you just need to reach a critical point, and that point can often be reached with a very small fraction of the total population. So, instead of looking at national tendencies to infer national risk, we should look at the tails of the relevant distributions to see if they’re getting thicker or longer. The people and places at the wrong ends of those distributions represent pockets of potential unrest; other things being equal, the more of them there are, the greater the cumulative probability of relevant action.

So how do things look in that thickening tail? Here again is that recent story in the FT:

Last month more than 30 provincial taxi drivers drank poison and collapsed together on the busiest shopping street in Beijing in a dramatic protest against economic and working conditions in their home town.

The drivers, who the police say all survived, were from Suifenhe, a city on the Russian border in the northeastern province of Heilongjiang…

Heilongjiang is among the poorest performers. While national nominal growth slipped to 5.8 per cent in the first quarter compared with a year earlier — its lowest level since the global financial crisis — the province’s nominal GDP actually contracted, by 3.2 per cent.

In the provincial capital of Harbin, signs of economic malaise are everywhere.

The relatively small, ritual protest described at the start of that block quote wouldn’t seem to pose much threat to Communist Party rule, but then neither did Mohamed Bouazizi’s self-immolation in Tunisia in December 2010.

Meanwhile, as the chart below shows, data collected by China Labor Bulletin show that the incidence of strikes and other forms of labor unrest has increased in China in the past year. Each such incident is arguably another roll of the dice that could blow up into a larger and longer episode. Any one event is extremely unlikely to catalyze a larger campaign that might reshape national politics in a significant way, but the more trials run, the higher the cumulative probability.

Monthly counts of labor incidents in China, January 2012-May 2015 (data source: China Labor Bulletin)

Monthly counts of labor incidents in China, January 2012-May 2015 (data source: China Labor Bulletin)

The point of this post is to remind myself and anyone bothering to read it that statistics describing the national economy in the aggregate aren’t a reliable guide to the likelihood of those individual events, and thus of a larger and more disruptive episode, because they conceal important variation in the distribution they summarize. I suspect that most China experts already think in these terms, but I think most generalists (like me) do not. I also suspect that this sub-national variation is one reason why statistical models using country-year data generally find weak association between things like economic growth and inflation on the one hand and demonstrations and strikes on the other. Maybe with better data in the future, we’ll find stronger affirmation of the belief many of us hold that economic distress has a strong effect on the likelihood of social unrest, because we won’t be forced into an ecological fallacy by the limits of available information.

Oh, and by the way: the same goes for Russia.

About That Apparent Decline in Violent Conflict…

Is violent conflict declining, or isn’t it? I’ve written here and elsewhere about evidence that warfare and mass atrocities have waned significantly in recent decades, at least when measured by the number of people killed in those episodes. Not everyone sees the world the same way, though. Bear Braumoeller asserts that, to understand how war prone the world is, we should look at how likely countries are to use force against politically relevant rivals, and by this measure the rate of warfare has held pretty steady over the past two centuries. Tanisha Fazal argues that wars have become less lethal without becoming less frequent because of medical advances that help keep more people in war zones alive. Where I have emphasized war’s lethal consequences, these two authors emphasize war’s likelihood, but their arguments suggest that violent conflict hasn’t really waned the way I’ve alleged it has.

This week, we got another important contribution to the wider debate in which my shallow contributions are situated. In an updated working paper, Pasquale Cirillo and Nassim Nicholas Taleb claim to show that

Violence is much more severe than it seems from conventional analyses and the prevailing “long peace” theory which claims that violence has declined… Contrary to current discussions…1) the risk of violent conflict has not been decreasing, but is rather underestimated by techniques relying on naive year-on-year changes in the mean, or using sample mean as an estimator of the true mean of an extremely fat-tailed phenomenon; 2) armed conflicts have memoryless inter-arrival times, thus incompatible with the idea of a time trend.

Let me say up front that I only have a weak understanding of the extreme value theory (EVT) models used in Cirillo and Taleb’s paper. I’m a political scientist who uses statistical methods, not a statistician, and I have neither studied nor tried to use the specific techniques they employ.

Bearing that in mind, I think the paper successfully undercuts the most optimistic view about the future of violent conflict—that violent conflict has inexorably and permanently declined—but then I don’t know many people who actually hold that view. Most of the work on this topic distinguishes between the observed fact of a substantial decline in the rate of deaths from political violence and the underlying risk of those deaths and the conflicts that produce them. We can (partly) see the former, but we can’t see the latter; instead, we have to try to infer it from the conflicts that occur. Observed history is, in a sense, a single sample drawn from a distribution of many possible histories, and, like all samples, this one is only a jittery snapshot of the deeper data-generating process in which we’re really interested. What Cirillo and Taleb purport to show is that long sequences of relative peace like the one we have seen in recent history are wholly consistent with a data-generating process in which the risk of war and death from it have not really changed at all.

Of course, the fact that a decades-long decline in violent conflict like the one we’ve seen since World War II could happen by chance doesn’t necessarily mean that it is happening by chance. The situation is not dissimilar to one we see in sports when a batter or shooter seems to go cold for a while. Oftentimes that cold streak will turn out to be part of the normal variation in performance, and the athlete will eventually regress to the mean—but not every time. Sometimes, athletes really do get and stay worse, maybe because of aging or an injury or some other life change, and the cold streak we see is the leading edge of that sustained decline. The hard part is telling in real time which process is happening. To try to do that, we might look for evidence of those plausible causes, but humans are notoriously good at spotting patterns where there are none, and at telling ourselves stories about why those patterns are occurring that turn out to be bunk.

The same logic applies to thinking about trends in violent conflict. Maybe the downward trend in observed death rates is just a chance occurrence in an unchanged system, but maybe it isn’t. And, as Andrew Gelman told Zach Beauchamp, the statistics alone can’t answer this question. Cirillo and Taleb’s analysis, and Braumoeller’s before it, imply that the history we’ve seen in the recent past  is about as likely as any other, but that fact isn’t proof of its randomness. Just as rare events sometimes happen, so do systemic changes.

Claims that “This time really is different” are usually wrong, so I think the onus is on people who believe the underlying risk of war is declining to make a compelling argument about why that’s true. When I say “compelling,” I mean an argument that a) identifies specific causal mechanisms and b) musters evidence of change over time in the presence or prevalence of those mechanisms. That’s what Steven Pinker tries at great length to do in The Better Angels of Our Nature, and what Joshua Goldstein did in Winning the War on War.

My own thinking about this issue connects the observed decline in the the intensity of violent conflict to the rapid increase in the past 100+ years in the size and complexity of the global economy and the changes in political and social institutions that have co-occurred with it. No, globalization is not new, and it certainly didn’t stop the last two world wars. Still, I wonder if the profound changes of the past two centuries are accumulating into a global systemic transformation akin to the one that occurred locally in now-wealthy societies in which organized violent conflict has become exceptionally rare. Proponents of democratic peace theory see a similar pattern in the recent evidence, but I think they are too quick to give credit for that pattern to one particular stream of change that may be as much consequence as cause of the deeper systemic transformation. I also realize that this systemic transformation is producing negative externalities—climate change and heightened risks of global pandemics, to name two—that could offset the positive externalities or even lead to sharp breaks in other directions.

It’s impossible to say which, if any, of these versions is “true,” but the key point is that we can find real-world evidence of mechanisms that could be driving down the underlying risk of violent conflict. That evidence, in turn, might strengthen our confidence in the belief that the observed pattern has meaning, even if it doesn’t and can’t prove that meaning or any of the specific explanations for it.

Finally, without deeply understanding the models Cirillo and Taleb used, I also wondered when I first read their new paper if their findings weren’t partly an artifact of those models, or maybe some assumptions the authors made when specifying them. The next day, David Roodman wrote something that strengthened this source of uncertainty. According to Roodman, the extreme value theory (EVT) models employed by Cirillo and Taleb can be used to test for time trends, but the ones described in this new paper don’t. Instead, Cirillo and Taleb specify their models in a way that assumes there is no time trend and then use them to confirm that there isn’t. “It seems to me,” Roodman writes, “that if Cirillo and Taleb want to rule out a time trend according to their own standard of evidence, then they should introduce one in their EVT models and test whether it is statistically distinguishable from zero.”

If Roodman is correct on this point, and if Cirillo and Taleb were to do what he recommends and still find no evidence of a time trend, I would update my beliefs accordingly. In other words, I would worry a little more than I do now about the risk of much larger and deadlier wars occurring again in my expected lifetime.

The Stacked-Label Column Plot

Most of the statistical work I do involves events that occur rarely in places over time. One of the best ways to get or give a feel for the structure of data like that is with a plot that shows variation in counts of those events across sequential, evenly-sized slices of time. For me, that usually means a sequence of annual, global counts of those events, like the one below for successful and failed coup attempts over the past several decades (see here for the R script that generated that plot and a few others and here for the data):

Annual, global counts of successful and failed coup attempts per the Cline Center's SPEED Project

Annual, global counts of successful and failed coup attempts per the Cline Center’s SPEED Project, 1946-2005

One thing I don’t like about those plots, though, is the loss of information that comes from converting events to counts. Sometimes we want to know not just how many events occurred in a particular year but also where they occurred, and we don’t want to have to query the database or look at a separate table to find out.

I try to do both in one go with a type of column chart I’ll call the stacked-label column plot. Instead of building columns from bricks of identical color, I use blocks of text that describe another attribute of each unit—usually country names in my work, but it could be lots of things. In order for those blocks to have comparable visual weight, they need to be equally sized, which usually means using labels of uniform length (e.g., two– or three-letter country codes) and a fixed-width font like Courier New.

I started making these kinds of plots in the 1990s, using Excel spreadsheets or tables in Microsoft Word to plot things like protest events and transitions to and from democracy. A couple decades later, I’m finally trying to figure out how to make them in R. Here is my first reasonably successful attempt, using data I just finished updating on when countries joined the World Trade Organization (WTO) or its predecessor, the General Agreement on Tariffs and Trade (GATT).

Note: Because the Wordpress template I use crams blog-post content into a column that’s only half as wide as the screen, you might have trouble reading the text labels in some browsers. If you can’t make out the letters, try clicking on the plot, then increasing the zoom if needed.

Annual, global counts of countries joining the global free-trade regime, 1960-2014

Annual, global counts of countries joining the global free-trade regime, 1960-2014

Without bothering to read the labels, you can see the time trend fine. Since 1960, there have been two waves of countries joining the global free-trade regime: one in the early 1960s, and another in the early 1990s. Those two waves correspond to two spates of state creation, so without the labels, many of us might infer that those stacks are composed mostly or entirely of new states joining.

When we scan the labels, though, we discover a different story. As expected, the wave in the early 1960s does include a lot of newly independent African states, but it also includes a couple of Warsaw Pact countries (Yugoslavia and Poland) and some middle-income cases from other parts of the world (e.g., Argentina and South Korea). Meanwhile, the wave of the early 1990s turns out to include very few post-Communist countries, most of which didn’t join until the end of that decade or early in the next one. Instead, we see a second wave of “developing” countries joining on the eve of the transition from GATT to the WTO, which officially happened on January 1, 1995. I’m sure people who really know the politics of the global free-trade regime, or of specific cases or regions, can spot some other interesting stories in there, too. The point, though, is that we can’t discover those stories if we can’t see the case labels.

Here’s another one that shows which countries had any coup attempts each year between 1960 and 2014, according to Jonathan Powell and Clayton Thyne‘s running list. In this case, color tells us the outcomes of those coup attempts: red if any succeeded, dark grey if they all failed.

Countries with any coup attempts per Powell and Thyne, 1960-2014

One story that immediately catches my eye in this plot is Argentina’s (ARG) remarkable propensity for coups in the early 1960s. It shows up in each of the first four columns, although only in 1962 are any of those attempts successful. Again, this is information we lose when we only plot the counts without identifying the cases.

The way I’m doing it now, this kind of chart requires data to be stored in (or converted to) event-file format, not the time-series cross-sectional format that many of us usually use. Instead of one row per unit–time slice, you want one row for each event. Each row should at least two columns with the case label and the time slice in which the event occurred.

If you’re interested in playing around with these types of plots, you can find the R script I used to generate the ones above here. Perhaps some enterprising soul will take it upon him- or herself to write a function that makes it easy to produce this kind of chart across a variety of data structures.

It would be especially nice to have a function that worked properly when the same label appears more than once in a given time slice. Right now, I’m using the function ‘match’ to assign y values that evenly stack the events within each bin. That doesn’t work for the second or third or nth match, though, because the ‘match’ function always returns the position of the first match in the relevant vector. So, for example, if I try to plot all coup attempts each year instead of all countries with any coup attempts each year, the second or later events in the same country get placed in the same position as the first, which ultimately means they show up as blank spaces in the columns. Sadly, I haven’t figured out yet how to identify location in that vector in a more general way to fix this problem.

Demography and Democracy Revisited

Last spring on this blog, I used Richard Cincotta’s work on age structure to take another look at the relationship between democracy and “development” (here). In his predictive models of democratization, Rich uses variation in median age as a proxy for a syndrome of socioeconomic changes we sometimes call “modernization” and argues that “a country’s chances for meaningful democracy increase as its population ages.” Rich’s models have produced some unconventional predictions that have turned out well, and if you buy the scientific method, this apparent predictive power implies that the underlying theory holds some water.

Over the weekend, Rich sent me a spreadsheet with his annual estimates of median age for all countries from 1972 to 2015, so I decided to take my own look at the relationship between those estimates and the occurrence of democratic transitions. For the latter, I used a data set I constructed for PITF (here) that covers 1955–2010, giving me a period of observation running from 1972 to 2010. In this initial exploration, I focused specifically on switches from authoritarian rule to democracy, which are observed with a binary variable that covers all country-years where an autocracy was in place on January 1. That variable (rgjtdem) is coded 1 if a democratic regime came into being at some point during that calendar year and 0 otherwise. Between 1972 and 2010, 94 of those switches occurred worldwide. The data set also includes, among other things, a “clock” counting consecutive years of authoritarian rule and an indicator for whether or not the country has ever had a democratic regime before.

To assess the predictive power of median age and compare it to other measures of socioeconomic development, I used the base and caret packages in R to run 10 iterations of five-fold cross-validation on the following series of discrete-time hazard (logistic regression) models:

  • Base model. Any prior democracy (0/1), duration of autocracy (logged), and the product of the two.
  • GDP per capita. Base model plus the Maddison Project’s estimates of GDP per capita in 1990 Geary-Khamis dollars (here), logged.
  • Infant mortality. Base model plus the U.S. Census Bureau’s estimates of deaths under age 1 per 1,000 live births (here), logged.
  • Median age. Base model plus Cincotta’s estimates of median age, untransformed.

The chart below shows density plots and averages of the AUC scores (computed with ‘roc.area’ from the verification package) for each of those models across the 10 iterations of five-fold CV. Contrary to the conventional assumption that GDP per capita is a useful predictor of democratic transitions—How many papers have you read that tossed this measure into the model as a matter of course?—I find that the model with the Maddison Project measure actually makes slightly less accurate predictions than the one with duration and prior democracy alone. More relevant to this post, though, the two demographic measures clearly improve the predictions of democratic transitions relative to the base model, and median age adds a smidgen more predictive signal than infant mortality.

Of course, all of these things—national wealth, infant mortality rates, and age structures—have also been changing pretty steadily in a single direction for decades, so it’s hard to untangle the effects of the covariates from other features of the world system that are also trending over time. To try to address that issue and to check for nonlinearity in the relationship, I used Simon Wood’s mgcv package in R to estimate a semiparametric logistic regression model with smoothing splines for year and median age alongside the indicator of prior democracy and regime duration. Plots of the marginal effects of year and median age estimated from that model are shown below. As the left-hand plot shows, the time effect is really a hump in risk that started in the late 1980s and peaked sharply in the early 1990s; it is not the across-the-board post–Cold War increase that we often see covered in models with a dummy variable for years after 1991. More germane to this post, though, we still see a marginal effect from median age, even when accounting for those generic effects of time. Consistent with Cincotta’s argument and other things being equal, countries with higher median age are more likely to transition to democracy than countries with younger populations.


I read these results as a partial affirmation of modernization theory—not the whole teleological and normative package, but the narrower empirical conjecture about a bundle of socioeconomic transformations that often co-occur and are associated with a higher likelihood of attempting and sustaining democratic government. Statistical studies of this idea (including my own) have produced varied results, but the analysis I’m describing here suggests that some of the null results may stem from the authors’ choice of measures. GDP per capita is actually a poor proxy for modernization; there are a number of ways countries can get richer, and not all of them foster (or are fostered by) the socioeconomic transformations that form the kernel of modernization theory (cf. Equatorial Guinea). By contrast, demographic measures like infant mortality rates and median age are more tightly coupled to those broader changes about which Seymour Martin Lipset originally wrote. And, according to my analysis, those demographic measures are also associated with a country’s propensity for democratic transition.

Shifting to the applied forecasting side, I think these results confirm that median age is a useful addition to models of regime transitions, and it seems capture more information about those propensities than GDP (by a lot) and infant mortality (by a little). Like all slow-changing structural indicators, though, median age is a blunt instrument. Annual forecasts based on it alone would be pretty clunky, and longer-term forecasts would do well to consider other domestic and international forces that also shape (and are shaped by) these changes.

PS. If you aren’t already familiar with modernization theory and want more background, this ungated piece by Sheri Berman for Foreign Affairs is pretty good: “What to Read on Modernization Theory.”

PPS. The code I used for this analysis is now on GitHub, here. It includes a link to the folder on my Google Drive with all of the required data sets.

Statistical Assessments of Coup Risk for 2015

Which countries around the world are more likely to see coup attempts in 2015?

For the fourth year in a row, I’ve used statistical models to generate one answer to that question, where a coup is defined more or less as a forceful seizure of national political authority by military or political insiders. (I say “more or less” because I’m blending data from two sources with slightly different definitions; see below for details.) A coup doesn’t need to succeed to count as an attempt, but it does need to involve public action; alleged plots and rumors of plots don’t qualify. Neither do insurgencies or foreign invasions, which by definition involve military or political outsiders. The heat map below shows variation in estimated coup risk for 2015, with countries colored by quintiles (fifths).


The dot plot below shows the estimates and their 90-percent confidence intervals (CIs) for the 40 countries with the highest estimated risk. The estimates are the unweighted average of forecasts from two logistic regression models; more on those in a sec. To get CIs for estimates from those two models, I took a cue from a forthcoming article by Lyon, Wintle, and Burgman (fourth publication listed here; the version I downloaded last year has apparently been taken down, and I can’t find another) and just averaged the CIs from the two models.


I’ve consistently used simple two– or three-model ensembles to generate these coup forecasts, usually pairing a logistic regression model with an implementation of Random Forests on the same or similar data. This year, I decided to use only a pair of logistic regression models representing somewhat different ideas about coup risk. Consistent with findings from other work in which I’ve been involved (here), k-fold cross-validation told me that Random Forests wasn’t really boosting forecast accuracy, and sticking to logistic regression makes it possible to get and average those CIs. The first model matches one I used last year, and it includes the following covariates:

  • Infant mortality rate. Deaths of children under age 1 per 1,000 live births, relative to the annual global median, logged. This measure that primarily reflects national wealth but is also sensitive to variations in quality of life produced by things like corruption and inequality. (Source: U.S. Census Bureau)
  • Recent coup activity. A yes/no indicator of whether or not there have been any coup attempts in that country in the past five years. I’ve tried logged event counts and longer windows, but this simple version contains as much predictive signal as any other. (Sources: Center for Systemic Peace and Powell and Thyne)
  • Political regime type. Following Fearon and Laitin (here), a categorical measure differentiating between autocracies, anocracies, democracies, and other forms. (Source: Center for Systemic Peace, with hard-coded updates for 2014)
  • Regime durability. The “number of years since the last substantive change in authority characteristics (defined as a 3-point change in the POLITY score).” (Source: Center for Systemic Peace, with hard-coded updates for 2014)
  • Election year. A yes/no indicator for whether or not any national elections (executive, legislative, or general) are scheduled to take place during the forecast year. (Source: NELDA, with hard-coded updates for 2011–2015)
  • Economic growth. The previous year’s annual GDP growth rate. To dampen the effects of extreme values on the model estimates, I take the square root of the absolute value and then multiply that by -1 for cases where the raw value less than 0. (Source: IMF)
  • Political salience of elite ethnicity. A yes/no indicator for whether or not the ethnic identity of national leaders is politically salient. (Source: PITF, with hard-coded updates for 2014)
  • Violent civil conflict. A yes/no indicator for whether or not any major armed civil or ethnic conflict is occurring in the country. (Source: Center for Systemic Peace, with hard-coded updates for 2014)
  • Country age. Years since country creation or independence, logged. (Source: me)
  • Coup-tagion. Two variables representing (logged) counts of coup attempts during the previous year in other countries around the world and in the same geographic region. (Source: me)
  • Post–Cold War period. A binary variable marking years after the disintegration of the USSR in 1991.
  • Colonial heritage. Three separate binary indicators identifying countries that were last colonized by Great Britain, France, or Spain. (Source: me)

The second model takes advantage of new data from Geddes, Wright, and Frantz on autocratic regime types (here) to consider how qualitative differences in political authority structures and leadership might shape coup risk—both directly, and indirectly by mitigating or amplifying the effects of other things. Here’s the full list of covariates in this one:

  • Infant mortality rate. Deaths of children under age 1 per 1,000 live births, relative to the annual global median, logged. This measure that primarily reflects national wealth but is also sensitive to variations in quality of life produced by things like corruption and inequality. (Source: U.S. Census Bureau)
  • Recent coup activity. A yes/no indicator of whether or not there have been any coup attempts in that country in the past five years. I’ve tried logged event counts and longer windows, but this simple version contains as much predictive signal as any other. (Sources: Center for Systemic Peace and Powell and Thyne)
  • Regime type. Using the binary indicators included in the aforementioned data from Geddes, Wright, and Frantz with hard-coded updates for the period 2011–2014, a series of variables differentiating between the following:
    • Democracies
    • Military autocracies
    • One-party autocracies
    • Personalist autocracies
    • Monarchies
  • Regime duration. Number of years since the last change in political regime type, logged. (Source: Geddes, Wright, and Frantz, with hard-coded updates for the period 2011–2014)
  • Regime type * regime duration. Interactions to condition the effect of regime duration on regime type.
  • Leader’s tenure. Number of years the current chief executive has held that office, logged. (Source: PITF, with hard-coded updates for 2014)
  • Regime type * leader’s tenure. Interactions to condition the effect of leader’s tenure on regime type.
  • Election year. A yes/no indicator for whether or not any national elections (executive, legislative, or general) are scheduled to take place during the forecast year. (Source: NELDA, with hard-coded updates for 2011–2015)
  • Regime type * election year. Interactions to condition the effect of election years on regime type.
  • Economic growth. The previous year’s annual GDP growth rate. To dampen the effects of extreme values on the model estimates, I take the square root of the absolute value and then multiply that by -1 for cases where the raw value less than 0. (Source: IMF)
  • Regime type * economic growth. Interactions to condition the effect of economic growth on regime type.
  • Post–Cold War period. A binary variable marking years after the disintegration of the USSR in 1991.

As I’ve done for the past couple of years, I used event lists from two sources—the Center for Systemic Peace (about halfway down the page here) and Jonathan Powell and Clayton Thyne (Dataset 3 here)—to generate the historical data on which those models were trained. Country-years are the unit of observation in this analysis, so a country-year is scored 1 if either CSP or P&T saw any coup attempts there during those 12 months and 0 otherwise. The plot below shows annual counts of successful and failed coup attempts in countries worldwide from 1946 through 2014 according to the two data sources. There is a fair amount of variance in the annual counts and the specific events that comprise them, but the basic trend over time is the same. The incidence of coup attempts rose in the 1950s; spiked in the early 1960s; remained relatively high throughout the rest of the Cold War; declined in the 1990s, after the Cold War ended; and has remained relatively low throughout the 2000s and 2010s.

Annual counts of coup events worldwide from two data sources, 1946-2014

Annual counts of coup events worldwide from two data sources, 1946-2014

I’ve been posting annual statistical assessments of coup risk on this blog since early 2012; see here, here, and here for the previous three iterations. I have rejiggered the modeling a bit each year, but the basic process (and the person designing and implementing it) has remained the same. So, how accurate have these forecasts been?

The table below reports areas under the ROC curve (AUC) and Brier scores (the 0–1 version) for the forecasts from each of those years and their averages, using the the two coup event data sources alone and together as different versions of the observed ground truth. Focusing on the “either” columns, because that’s what I’m usually using when estimating the models, we can see the the average accuracy—AUC in the low 0.80s and Brier score of about 0.03—is comparable to what we see in many other country-year forecasts of rare political events using a variety of modeling techniques (see here). With the AUC, we can also see a downward trend over time. With so few events involved, though, three years is too few to confidently deduce a trend, and those averages are consistent with what I typically see in k-fold cross-validation. So, at this point, I suspect those swings are just normal variation.

AUC and Brier scores for coup forecasts posted on Dart-Throwing Chimp, 2012-2014, by coup event data source

AUC and Brier scores for coup forecasts posted on Dart-Throwing Chimp, 2012-2014, by coup event data source

The separation plot designed by Greenhill, Ward, and Sacks (here) offers a nice way to visualize the accuracy of these forecasts. The ones below show the three annual slices using the “either” version of the outcome, and they reinforce the story told in the table: the forecasts have correctly identified most of the countries that saw coup attempts in the past three years as relatively high-risk cases, but the accuracy has declined over time. Let’s define a surprise as a case that fell outside the top 30 of the ordered forecasts but still saw a coup attempt. In 2012, just one of four countries that saw coup attempts was a surprise: Papua New Guinea, ranked 48. In 2013, that number increased to two of five (Eritrea at 51 and Egypt at 58), and in 2014 it rose to three of five (Burkina Faso at 42, Ukraine at 57, and the Gambia at 68). Again, though, the average accuracy across the three years is consistent with what I typically see in k-fold cross-validation of these kinds of models in the historical data, so I don’t think we should make too much of that apparent time trend just yet.

cou.scoring.sepplot.2012 cou.scoring.sepplot.2013 cou.scoring.sepplot.2014

This year, for the first time, I am also running an experiment in crowdsourcing coup risk assessments by way of a pairwise wiki survey (survey here, blog post explaining it here, and preliminary results discussed here). My long-term goal is to repeat this process numerous times on this topic and some others (for example, onsets of state-led mass killing episodes) to see how the accuracy of the two approaches compares and how their output might be combined. Statistical forecasts are usually much more accurate than human judgment, but that advantage may be reduced or eliminated when we aggregate judgments from large and diverse crowds, or when we don’t have data on important features to use in those statistical models. Models that use annual data also suffer in comparison to crowdsourcing processes that can update continuously, as that wiki survey does (albeit with a lot of inertia).

We can’t incorporate the output from that wiki survey into the statistical ensemble, because the survey doesn’t generate predicted probabilities; it only assesses relative risk. We can, however, compare the rank orderings the two methods produce. The plot below juxtaposes the rankings produced by the statistical models (left) with the ones from the wiki survey (right). About 500 votes have been cast since I wrote up the preliminary results, but I’m going to keep things simple for now and use the preliminary survey results I already wrote up. The colored arrows identify cases ranked at least 10 spots higher (red) or lower (blue) by the crowd than the statistical models. As the plot shows, there are many differences between the two, even toward the top of the rankings where the differences in statistical estimates are bigger and therefore more meaningful. For example, the crowd sees Nigeria, Libya, and Venezuela as top 10 risks while the statistical models do not; of those three, only Nigeria ranks in the top 30 on the statistical forecasts. Meanwhile, the crowd pushes Niger and Guinea-Bissau out of the top 10 down to the 20s, and it sees Madagascar, Afghanistan, Egypt, and Ivory Coast as much lower risks than the models do. Come 2016, it will be interesting to see which version was more accurate.


If you are interested in getting hold of the data or R scripts used to produce these forecasts and figures, please send me an email at ulfelder at gmail dot com.

A Few Rules of Thumb for Data Munging in Political Science

1. However hard you think it will be to assemble a data set for a particular analysis, it will be exponentially harder, with the size of the exponent determined by the scope and scale of the required data.

  • Corollary: If the data you need would cover the world (or just poor countries), they probably don’t exist.
  • Corollary: If the data you need would extend very far back in time, they probably don’t exist.
  • Corollary: If the data you need are politically sensitive, they probably don’t exist. If they do exist, you probably can’t get them. If you can get them, you probably shouldn’t trust them.

2. However reliable you think your data are, they probably aren’t.

  • Corollary: A couple of digits after decimal point is plenty. With data this noisy, what do those thousandths really mean, anyway?

3. Just because a data transformation works doesn’t mean it’s doing what you meant it to do.

4. The only really reliable way to make sure that your analysis is replicable is to have someone previously unfamiliar with the work try to replicate it. Unfortunately, a person’s incentive to replicate someone else’s work is inversely correlated with his or her level of prior involvement in the project. Ergo, this will rarely happen until after you have posted your results.

5. If your replication materials will include random parts (e.g., sampling) and you’re using R, don’t forget to set the seed for random number generation at the start. (Alas, I am living this mistake today.)

Please use the Comments to suggest additions, corrections, or modifications.

Post Mortem on 2014 Preseason NFL Forecasts

Let’s end the year with a whimper, shall we?

Back in September (here), I used a wiki survey to generate a preseason measure of pro-football team strength and then ran that measure through a statistical model and some simulations to gin up forecasts for all 256 games of the 2014 regular season. That season ended on Sunday, so now we can see how those forecasts turned out.

The short answer: not awful, but not so great, either.

To assess the data and model’s predictive power, I’m going to focus on predicted win totals. Based on my game-level forecasts, how many contests was each team expected to win? Those totals nicely summarize the game-level predictions, and they are the focus of StatsbyLopez’s excellent post-season predictive review, here, against which I can compare my results.

StatsbyLopez used two statistics to assess predictive accuracy: mean absolute error (MAE) and mean squared error (MSE). The first is the average of the distance between each team’s projected and observed win totals. The second is the average of the square of those distances. MAE is a little easier to interpret—on average, how far off was each team’s projected win total?—while MSE punishes larger errors more than the first, which is nice if you care about how noisy your predictions are. StatsbyLopez used those stats to compare five sets of statistical predictions to the preseason betting line (Vegas) and a couple of simple benchmarks: last year’s win totals and a naive forecast of eight wins for everyone, which is what you’d expect to get if you just flipped a coin to pick winners.

Lopez’s post includes some nice column charts comparing those stats across sources, but it doesn’t include the stats themselves, so I’m going to have to eyeball his numbers and do the comparison in prose.

I summarized my forecasts two ways: 1) counts of the games each team had a better-than-even chance of winning, and 2) sums of each team’s predicted probabilities of winning.

  • The MAE for my whole-game counts was 2.48—only a little bit better than the ultra-naive eight-wins-for-everyone prediction and worse than everything else, including just using last year’s win totals. The MSE for those counts was 8.89, still worse than everything except the simple eights. For comparison, the MAE and MSE for the Vegas predictions were roughly 2.0 and 6.0, respectively.
  • The MAE for my sums was 2.31—about as good as the worst of the five “statsheads” Lopez considered, but still a shade worse than just carrying forward the 2013 win totals. The MSE for those summed win probabilities, however, was 7.05. That’s better than one of the sources Lopez considered and pretty close to two others, and it handily beats the two naive benchmarks.

To get a better sense of how large the errors in my forecasts were and how they were distributed, I also plotted the predicted and observed win totals by team. In the charts below, the black dots are the predictions, and the red dots are the observed results. The first plot uses the whole-game counts; the second the summed win probabilities. Teams are ordered from left to right according to their rank in the preseason wiki survey.

Predicted (black) and observed (red) 2014 regular-season win totals by team using whole-game counts

Predicted (black) and observed (red) 2014 regular-season win totals by team using whole-game counts

Predicted (black) and observed (red) 2014 regular-season win totals by team using summed win probabilities

Predicted (black) and observed (red) 2014 regular-season win totals by team using summed win probabilities

Substantively, those charts spotlight some things most football fans could already tell you: Dallas and Arizona were the biggest positive surprises of the 2014 regular season, while San Francisco, New Orleans, and Chicago were probably the biggest disappointments.  Detroit and Buffalo also exceeded expectations, although only one of them made it to the postseason, while Tampa Bay, Tennessee, the NY Giants, and the Washington football team also under-performed.

Statistically, it’s interesting but not surprising that the summed win probabilities do markedly better than the whole-game counts. Pro football is a noisy game, and we throw out a lot of information about the uncertainty of each contest’s outcome when we convert those probabilities into binary win/lose calls. In essence, those binary calls are inherently overconfident, so the win counts they produce are, predictably, much noisier than the ones we get by summing the underlying probabilities.

In spite of its modest performance in 2014, I plan to repeat this exercise next year. The linear regression model I use to convert the survey results into game-level forecasts has home-field advantage and the survey scores as its only inputs. The 2014 version of that model was estimated from just a single prior season’s data, so doubling the size of the historical sample to 512 games will probably help a little. Like all survey results, my team-strength score depends on the pool of respondents, and I keep hoping to get a bigger and better-informed crowd to participate in that part of the exercise. And, most important, it’s fun!

Is Algorithmic Judgment Creepy or Wonderful?

For the Nieman Lab’s Predictions for Journalism 2015, Zeynep Tufekci writes that

We’re seeing the birth of a new era, the era of judging machines: machines that calculate not just how to quickly sort a database, or perform a mathematical calculation, but to decide what is “best,” “relevant,” “appropriate,” or “harmful.”

Tufekci believes we’re increasingly “creeped out” by this trend, and she thinks that’s appropriate. It’s not the algorithms themselves that bother her so much as the noiselessness of their presence. Decisions are constantly being made for us without our even realizing it, and those decisions could reshape our lives.

Or, in some cases, save them. At FiveThirtyEight, Andrew Flowers reports on the U.S. Army’s efforts to apply machine-learning techniques to large data sets to develop a predictive tool—an algorithm—that can accurately identify soldiers at highest risk of attempting suicide. The Army has a serious suicide problem, and an algorithm that can help clinicians decide which individuals require additional interventions could help mitigate that problem. The early results are promising:

The model’s predictive abilities were impressive. Those soldiers who were rated in the top 5 percent of risk were responsible for 52 percent of all suicides — they were the needles, and the Army was starting to find them.

So which is it? Are algorithmic interventions creepy or wonderful?

I’ve been designing and hawking algorithms to help people assess risks for more than 15 years, so it won’t surprise anyone to hear that I tilt toward the “wonderful” camp. Maybe it’s just the paychecks talking, but consciously, at least, my defense of algorithms starts from the fact that we humans consistently overestimate the power of our intuition. As researchers like Paul Meehl and Phil Tetlock keep showing, we’re not nearly as good at compiling and assessing information as we think we are. So, the baseline condition—unassisted human judgment—is often much worse than we recognize, and there’s lots of room to improve.

Flowers’ story on the Army’s suicide risk-detection efforts offers a case in point. As Flowers notes, “The Army is constructing a high-tech weapon to fight suicide because it’s losing the battle against it.” The status quo, in which clinicians make judgments about these risks without the benefit of explicit predictive modeling, is failing to stem the increase in soldiers’ suicide rates. Under the conditions, the risk-assessing algorithm doesn’t have to work perfectly to have some positive effect. Moving the needle even a little bit in the right direction could save dozens of soldiers’ lives each year.

Where I agree strongly with Tufekci is on the importance of transparency. I want to have algorithms helping me decide what’s most relevant and what the best course of action might be, but I also want to know where and when those algorithms are operating. As someone who builds these kinds of tools, I also want to be able to poke around under the hood. The latter won’t always be possible in the commercial world—algorithms are a form of trade knowledge, and I understand the need for corporations (and freelancers!) to protect their comparative advantages—but informed consent should be a given.

Wisdom of Crowds FTW

I’m a cyclist who rides indoors a fair amount, especially in cold or wet weather. A couple of months ago, I bought an indoor cycle with a flywheel and a power meter. For the past several years, I’d been using the kind of trainer you attach to the back wheel of your bike for basement rides. Now, though, my younger son races, so I wanted something we could both use without too much fuss, and his coach wants to see power data from his home workouts.

To train properly with a power meter, I need to benchmark my current fitness. The conventional benchmark is Functional Threshold Power (FTP), which you can estimate from your average power output over a 20-minute test. To get the best estimate, you need to go as hard as you can for the full 20 minutes. To do that, you need to pace yourself. Go out too hard and you’ll blow up partway through. Go out too easy and you’ll probably end up lowballing yourself.

Once you have an estimate of your FTP, that pacing is easy to do: just ride at the wattage you expect to average. But what do you do when you’re taking the test for the first time?

I decided to solve that problem by appealing to the wisdom of the crowd. When I ride outdoors, I often ride with the same group, and many of those guys train with power meters. That means they know me and they know power data. Basically, I had my own little panel of experts.

Early this week, I emailed that group, told them how much I weigh (about 155 lbs), and asked them to send me estimates of the wattage they thought I could hold for 20 minutes. Weight matters because power covaries with it. What the other guys observe is my speed, which is a function of power relative to weight. So, to estimate power based on observed speed, they need to know my weight, too.

I got five responses that ranged from 300 to 350. Based on findings from the Good Judgment Project, I decided to use the median of those five guesses—314—as my best estimate.

I did the test on Tuesday. After 15 minutes of easy spinning, I did 3 x 30 sec at about 300W with 30 sec easy in between, then another 2 min easy, then 3 min steady above 300W, then 7 min easy, and then I hit it. Following emailed advice from Dave Guttenplan, who sometimes rides with our group, I started out a little below my target, then ramped up my effort after about 5 min. At the halfway point, I peeked at my interval data and saw that I was averaging 310W. With 5 min to go, I tried to up the pace a bit more. With 1 min to go, I tried to dial up again and found I couldn’t go much harder. No finish-line sprint for me. When the 20-minute mark finally arrived, I hit the “interval” button, dialed the resistance down, and spent the next minute or so trying not to barf—a good sign that I’d given it just about all I had.

And guess what the final average was: 314!

Now, you might be thinking I tried to hit that number because it makes for a good story. Of course I was using the number as a guideline, but I’m as competitive as the next guy, so I was actually pretty motivated to outperform the group’s expectations. Over the last few minutes of the test, I was getting a bit cross-eyed, too, and I don’t remember checking the output very often.

This result is also partly coincidence. Even the best power meters have a margin of error of about 2 percent, and that’s assuming they’re properly calibrated. So the best I can say is that my average output from that test was probably around 314W, give or take several watts.

Still, as an applied stats guy who regularly works with “wisdom of crowds” systems, I thought this was a great illustration of those methods’ utility. In this case, the remarkable accuracy of the crowd-based estimate surely had a lot to do with the crowd’s expertise. I only got five guesses, but they came from people who know a lot about me as a rider and whose experience training with power and looking at other riders’ numbers has given them a strong feel for the distribution of these stats. If I’d asked a much bigger crowd who didn’t know me or the data, I suspect the estimate would have missed badly (like this one). Instead, I got just what I needed.


Get every new post delivered to your Inbox.

Join 13,310 other followers

%d bloggers like this: