An Applied Forecaster’s Bad Dream

This is the sort of thing that freaks me out every time I’m getting ready to deliver or post a new set of forecasts:

In its 2015 States of Fragility report, the Organization for Economic Co-operation and Development (OECD) decided to complicate its usual one-dimensional list of fragile states by assessing five dimensions of fragility: Violence, Justice, Institutions, Economic Foundations and Resilience…

Unfortunately, something went wrong during the calculations. In my attempts to replicate the assessment, I found that the OECD misclassified a large number of states.

That’s from a Monkey Cage post by Thomas Leo Scherer, published today. Here, per Scherer, is why those errors matter:

Recent research by Judith Kelley and Beth Simmons shows that international indicators are an influential policy tool. Indicators focus international attention on low performers to positive and negative effect. They cause governments in poorly ranked countries to take action to raise their scores when they realize they are being monitored or as domestic actors mobilize and demand change after learning how they rate versus other countries. Given their potential reach, indicators should be handled with care.

For individuals or organizations involved in scientific or public endeavors, the best way to mitigate that risk is transparency. We can and should argue about concepts, measures, and model choices, but given a particular set of those elements, we should all get essentially the same results. When one or more of those elements is hidden, we can’t fully understand what the reported results represent, and researchers who want to improve the design by critiquing and perhaps extending it are forced to box shadows. Also, individuals and organizations can double– and triple-check their own work, but errors are almost inevitable. When getting the best possible answers matters more than the risk of being seen making mistakes, then transparency is the way to go. This is why the Early Warning Project shares the data and code used to produce its statistical risk assessments in a public repository, and why Reinhart and Rogoff probably (hopefully?) wish they’d done something similar.

Of course, even though transparency improves the probability of catching errors and improving on our designs, it doesn’t automatically produce those goods. What’s more, we can know that we’re doing the right thing and still dread the public discovery of an error. Add to that risk the near-certainty of other researchers scoffing at your terrible code, and it’s easy see why even the best practices won’t keep you from breaking out in a cold sweat each time you hit “Send” or “Publish” on a new piece of work.

 

The Myth of Comprehensive Data

“What about using Twitter sentiment?”

That suggestion came to me from someone at a recent Data Science DC meetup, after I’d given a short talk on assessing risks of mass atrocities for the Early Warning Project, and as the next speaker started his presentation on predicting social unrest. I had devoted the first half of my presentation to a digression of sorts, talking about how the persistent scarcity of relevant public data still makes it impossible to produce global forecasts of rare political crises—things like coups, insurgencies, regime breakdowns, and mass atrocities—that are as sharp and dynamic as we would like.

The meetup wasn’t the first time I’d heard that suggestion, and I think all of the well-intentioned people who have made it to me have believed that data derived from Twitter would escape or overcome those constraints. In fact, the Twitter stream embodies them. Over the past two decades, technological, economic, and political changes have produced an astonishing surge in the amount of information available from and about the world, but that surge has not occurred evenly around the globe.

Think of the availability of data as plant life in a rugged landscape, where dry peaks are places of data scarcity and fertile valleys represent data-rich environments. The technological developments of the past 20 years are like a weather pattern that keeps dumping more and more rain on that topography. That rain falls unevenly across the landscape, however, and it doesn’t have the same effect everywhere it lands. As a result, plants still struggle to grow on many of those rocky peaks, and much of the new growth occurs where water already collected and flora were already flourishing.

The Twitter stream exemplifies this uneven distribution of data in a couple of important ways. Take a look at the map below, a screenshot I took after letting Tweetping run for about 16 hours spanning May 6–7, 2015. The brighter the glow, the more Twitter activity Tweetping saw.

tweetping 1530 20150506 to 0805 20150507

Some of the spatial variation in that map reflects differences in the distribution of human populations, but not all of it. Here’s a map of population density, produced by Daysleeper using data from CEISIN (source). If you compare this one to the map of Twitter usage, you’ll see that they align pretty well in Europe, the Americas, and some parts of Asia. In Africa and other parts of Asia, though, not so much. If it were just a matter of population density, then India and eastern China should burn brightest, but they—and especially China—are relatively dark compared to “the West.” Meanwhile, in Africa, we see pockets of activity, but there are whole swathes of the continent that are populated as or more densely than the brighter parts of South America, but from which we see virtually no Twitter activity.

world population density map

So why are some pockets of human settlement less visible than others? Two forces stand out: wealth and politics.

First and most obvious, access to Twitter depends on electricity and telecommunications infrastructure and gadgets and literacy and health and time, all of which are much scarcer in poorer parts of the world than they are in richer places. The map below shows lights at night, as seen from space by U.S. satellites 20 years ago and then mapped by NASA (source). These light patterns are sometimes used as a proxy for economic development (e.g., here).

earth_lights

This view of the world helps explain some of the holes in our map of Twitter activity, but not all of it. For example, many of the densely populated parts of Africa don’t light up much at night, just as they don’t on Tweetping, because they lack the relevant infrastructure and power production. Even 20 years ago, though, India and China looked much brighter through this lens than they do on our Twitter usage map.

So what else is going on? The intensity and character of Twitter usage also depends on freedoms of information and speech—the ability and desire to access the platform and to speak openly on it—and this political layer keeps other areas in the dark in that Tweetping map. China, North Korea, Cuba, Ethiopia, Eritrea—if you’re trying to anticipate important political crises, these are all countries you would want to track closely, but Twitter is barely used or unavailable in all of them as a direct or indirect consequence of public policy. And, of course, there are also many places where Twitter is accessible and used but censorship distorts the content of the stream. For example, Saudi Arabia lights up pretty well on the Twitter-usage map, but it’s hard to imagine people speaking freely on it when a tweet can land you in prison.

Clearly, wealth and political constraints still strongly shape the view of the world we can get from new data sources like Twitter. Contrary to the heavily-marketed myth of “comprehensive data,” poverty and repression continue to hide large swathes of the world out of our digital sight, or to distort the glimpses we get of them.

Unfortunate for efforts to forecast rare political crises, those two structural features that so strongly shape the production and quality of data also correlate with the risks we want to anticipate. The map below shows the Early Warning Project‘s most recent statistical assessments of the risk of onsets of state-led mass-killing episodes. Now flash back to the visualization of Twitter usage above, and you’ll see that many of the countries colored most brightly on this map are among the darkest on that one. Even in 2015, the places about which we most need more information to sharpen our forecasts of rare political crises are the ones that are still hardest to see.

ewp.sra.world.2014

Statistically, this is the second-worst of all possible worlds, the worst one being the total absence of information. Data are missing not at random, and the processes producing those gaps are the same ones that put places at greater risk of mass atrocities and other political calamities. This association means that models we estimate with those data will often be misleading. There are ways to mitigate these problems, but they aren’t necessarily simple, cheap, or effective, and that’s before we even start in on the challenges of extracting useful measures from something as heterogeneous and complex as the Twitter stream.

So that’s what I see when I hear people suggest that social media or Google Trends or other forms of “digital exhaust” have mooted the data problems about which I so often complain. Lots of organizations are spending a lot of money trying to overcome these problems, but the political and economic topography producing them does not readily yield. The Internet is part of this complex adaptive system, not a space outside it, and its power to transform that system is neither as strong nor as fast-acting as many of us—especially in the richer and freer parts of the world—presume.

To Realize the QDDR’s Early-Warning Goal, Invest in Data-Making

The U.S. Department of State dropped its second Quadrennial Diplomacy and Development Review, or QDDR, last week (here). Modeled on the Defense Department’s Quadrennial Defense Review, the QDDR lays out the department’s big-picture concerns and objectives so that—in theory—they can guide planning and shape day-to-day decision-making.

The new QDDR establishes four main goals, one of which is to “strengthen our ability to prevent and respond to internal conflict, atrocities, and fragility.” To help do that, the State Department plans to “increase [its] use of early warning analysis to drive early action on fragility and conflict.” Specifically, State says it will:

  1. Improve our use of tools for analyzing, tracking, and forecasting fragility and conflict, leveraging improvements in analytical capabilities;
  2. Provide more timely and accurate assessments to chiefs of mission and senior decision-makers;
  3. Increase use of early warning data and conflict and fragility assessments in our strategic planning and programming;
  4. Ensure that significant early warning shifts trigger senior-level review of the mission’s strategy and, if necessary, adjustments; and
  5. Train and deploy conflict-specific diplomatic expertise to support countries at risk of conflict or atrocities, including conflict negotiation and mediation expertise for use at posts.

Unsurprisingly, that plan sounds great to me. We can’t now and never will be able to predict precisely where and when violent conflict and atrocities will occur, but we can assess risks with enough accuracy and lead time to enable better strategic planning and programming. These forecasts don’t have to be perfect to be earlier, clearer, and more reliable than the traditional practices of deferring to individual country or regional analysts or just reacting to the news.

Of course, quite a bit of well-designed conflict forecasting is already happening, much of it paid for by the U.S. government. To name a few of the relevant efforts: The Political Instability Task Force (PITF) and the Worldwide Integrated Conflict Early Warning System (W-ICEWS) routinely update forecasts of various forms of political crisis for U.S. government customers. IARPA’s Open Source Indicators (OSI) and Aggregative Contingent Estimation (ACE) programs are simultaneously producing forecasts now and discovering ways to make future forecasts even better. Meanwhile, outside the U.S. government, the European Union has recently developed its own Global Conflict Risk Index (GCRI), and the Early Warning Project now assesses risks of mass atrocities in countries worldwide.

That so much thoughtful risk assessment is being done now doesn’t mean it’s a bad idea to start new projects. If there are any iron laws of forecasting hard-to-predict processes like political violence, one of them is that combinations of forecasts from numerous sources should be more accurate than forecasts from a single model or person or framework. Some of the existing projects already do this kind of combining themselves, but combinations of combinations will often be even better.

Still, if I had to channel the intention expressed in this part of the QDDR into a single activity, it would not be the construction of new models, at least not initially. Instead, it would be data-making. Social science is not Newtonian physics, but it’s not astrology, either. Smart people have been studying politics for a long time, and collectively they have developed a fair number of useful ideas about what causes or precedes violent conflict. But, if you can’t track the things those theorists tell you to track, then your forecasts are going to suffer. To improve significantly on the predictive models of political violence we have now, I think we need better inputs most of all.

When I say “better” inputs, I have a few things in mind. In some cases, we need to build data sets from scratch. When I was updating my coup forecasts earlier this year, a number of people wondered why I didn’t include measures of civil-military relations, which are obviously relevant to this particular risk. The answer was simple: because global data on that topic don’t exist. If we aren’t measuring it, we can’t use it in our forecasts, and the list of relevant features that falls into this set is surprisingly long.

In other cases, we need to revive them. Social scientists often build “boutique” data sets for specific research projects, run the tests they want to run on them, and then move on to the next project. Sometimes, the tests they or others run suggest that some features captured in those data sets would make useful predictors. Those discoveries are great in principle, but if those data sets aren’t being updated, then applied forecasters can’t use that knowledge. To get better forecasts, we need to invest in picking up where those boutique data sets left off so we can incorporate their insights into our applications.

Finally and in almost all cases, we need to observe things more frequently. Most of the data available now to most conflict forecasters is only updated once each year, often on a several-month delay and sometimes as much as two years later (e.g., data describing 2014 becomes available in 2016). That schedule is fine for basic research, but it is crummy for applied forecasting. If we want to be able to give assessments and warnings that as current as possible to those “chiefs of mission and senior decision-makers” mentioned in the QDDR, then we need to build models with data that are updated as frequently as possible. Daily or weekly are ideal, but monthly updates would suffice in many cases and would mark a huge improvement over the status quo.

As I said at the start, we’re never going to get models that reliably tell us far in advance exactly where and when violent conflicts and mass atrocities will erupt. I am confident, however, that we can assess these risks even more accurately than we do now, but only if we start making more, and better versions, of the data our theories tell us we need.

I’ll end with a final plea to any public servants who might be reading this: if you do invest in developing better inputs, please make the results freely available to the public. When you share your data, you give the crowd a chance to help you spot and fix your mistakes, to experiment with various techniques, and to think about what else you might consider, all at no additional cost to you. What’s not to like about that?

Waiting for Data-dot

A suburban house. A desk cluttered with papers, headphones, stray cables, and a pair of socks. Dawn.

Jay, sitting at the desk, opens a browser tab and clicks on a favorited site to see if a data set he needs to produce forecasts has been updated yet. It has not. He pauses, slurps coffee from a large mug, and tries another site. As before.

JAY: (giving up again). Nothing to be done.

 

[Apologies to Samuel Beckett.]

 

 

Down the Country-Month Rabbit Hole

Some big things happened in the world this week. Iran and the P5+1 agreed on a framework for a nuclear deal, and the agreement looks good. In a presidential election in Nigeria—the world’s seventh–most populous country, and one that few observers would have tagged as a democracy before last weekend—incumbent Goodluck Jonathan lost and then promptly and peacefully conceded defeat. The trickle of countries joining China’s new Asian Infrastructure Investment Bank turned into a torrent.

All of those things happened, but you won’t read more about them here, because I have spent the better part of the past week down a different rabbit hole. Last Friday, after years of almosts and any-time-nows, the event data produced for the Integrated Conflict Early Warning System (ICEWS) finally landed in the public domain, and I have been busy trying to figure out how to put them to use.

ICEWS isn’t the first publicly available trove of political event data, but it compares favorably to the field’s first mover, GDELT, and it currently covers a much longer time span than the other recent entrant, Phoenix.

The public release of ICEWS is exciting because it opens the door wider to dynamic modeling of world politics. Right now, nearly all of the data sets employed in statistical studies of politics around the globe use country-years as their units of observation. That’s not bad if you’re primarily interested in the effects or predictive power of structural features, but it’s pretty awful for explaining and anticipating faster-changing phenomena, like social unrest or violent conflict. GDELT broke the lock on that door, but its high noise-to-signal ratio and the opacity of its coding process have deterred me from investing too much time in developing monitoring or forecasting systems that depend on it.

With ICEWS on the Dataverse, that changes. I think we now have a critical mass of data sets in the public domain that: a) reliably cover important topics for the whole world over many years; b) are routinely updated; and, crucially, c) can be parsed to the month or even the week or day to reward investments in more dynamic modeling. Other suspects fitting this description include:

  • The spell-file version of Polity, which measures national patterns of political authority;
  • Lists of coup attempts maintained by Jonathan Powell and Clayton Thyne (here) and the Center for Systemic Peace (here); and
  • The PITF Worldwide Atrocities Event Dataset, which records information about events involving the deliberate killing of five or more noncombatant civilians (more on it here).

We also have high-quality data sets on national elections (here) and leadership changes (here, described here) that aren’t routinely updated by their sources but would be relatively easy to code by hand for applied forecasting.

With ICEWS, there is, of course, a catch. The public version of the project’s event data set will be updated monthly, but on a one-year delay. For example, when the archive was first posted in March, it ran through February 2014. On April 1, the Lockheed team added March 2014. This delay won’t matter much for scholars doing retrospective analyses, but it’s a critical flaw, if not a fatal one, for applied forecasters who can’t afford to pay—what, probably hundreds of thousands of dollars?—for a real-time subscription.

Fortunately, we might have a workaround. Phil Schrodt has played a huge role in the creation of the field of machine-coded political event data, including GDELT and ICEWS, and he is now part of the crew building Phoenix. In a blog post published the day ICEWS dropped, Phil suggested that Phoenix and ICEWS data will probably look enough alike to allow substituting the former for the latter, perhaps with some careful calibration. As Phil says, we won’t know for sure until we have a wider overlap between the two and can see how well this works in practice, but the possibility is promising enough for me to dig in.

And what does that mean? Well, a week has now passed since ICEWS hit the Dataverse, and so far I have:

  • Written an R function that creates a table of valid country-months for a user-specified time period, to use as scaffolding in the construction and agglomeration of country-month data sets;
  • Written scripts that call that function and some others to ingest and then parse or aggregate the other data sets I mentioned to the country-month level;
  • Worked out a strategy, and written the code, to partition the data into training and test sets for a project on predicting violence against civilians; and
  • Spent a lot of time staring at the screen thinking about, and a little time coding, ways to aggregate, reduce, and otherwise pre-process the ICEWS events and Polity data for that work on violence against civilians and beyond.

What I haven’t done yet—T plus seven days and counting—is any modeling. How’s that for push-button, Big Data magic?

On Evaluating and Presenting Forecasts

On Saturday afternoon, at the International Studies Association‘s annual conference in New Orleans, I’m slated to participate in a round-table discussion with Patrick Brandt, Kristian Gleditsch, and Håvard Hegre on “assessing forecasts of (rare) international events.” Our chair, Andy Halterman, gave us three questions he’d like to discuss:

  1. What are the quantitative assessments of forecasting accuracy that people should use when they publish forecasts?
  2. What’s the process that should be used to determine whether a gain in a scoring metric represents an actual improvement in the model?
  3. How should model uncertainty and past model performance be conveyed to government or non-academic users of forecasts?

As I did for a Friday panel on alternative academic careers (here), I thought I’d use the blog to organize my thoughts ahead of the event and share them with folks who are interested but can’t attend the round table. So, here goes:

When assessing predictive power, we use perfection as the default benchmark. We ask, “Was she right?” or “How close to true value did he get?”

In fields where predictive accuracy is already very good, this approach seems reasonable. When the object of the forecasts are rare international events, however, I think this is a mistake, or at least misleading. It implies that perfection is attainable, and that distance from perfection is what we care about. In fact, approximating perfection is not a realistic goal in many fields, and what we really care about in those situations is distance from the available alternatives. In other words, I think we should always assess accuracy in comparative terms, not absolute ones. So, the question becomes: “Compared to what?”

I can think of two situations in which we’d want to forecast international events, and the ways we assess and describe the accuracy of the results will differ across the two. First, there is basic research, where the goal is to develop and test theory. This is what most scholars are doing most of the time, and here the benchmark should be other relevant theories. We want compare predictive power across nested models or models representing competing hypotheses to see which version does a better job anticipating real-world behavior—and, by implication, explaining it.

Then, of course, there is applied research, where the goal is to support or improve some decision process. Policy-making and investing are probably the two most common ones. Here, the benchmark should be the status quo. What we want to know is: “How much does the proposed forecasting process improve on the one(s) used now?” If the status quo is unclear, that already tells you something important about the state of forecasting in that field—namely, that it probably isn’t great. Even in that case, though, I think it’s still important to pick a benchmark that’s more realistic than perfection. Depending on the rarity of the event in question, that will usually mean either random guessing (for frequent events) or base rates (for rare ones).

How we communicate our findings on predictive power will also differ across basic and applied research, or at least I think it should. This has less to do with the goals of the work than it does with the audiences at which they’re usually aimed. When the audience is other scholars, I think it’s reasonable to expect them to understand the statistics and, so, to use those. For frequent events, Brier or logarithmic scores are often best, whereas for rare events I find that AUC scores are usually more informative, and I know a lot of people like to use F-1 scores in this context, too.

In applied settings, though, we’re usually doing the work as a service to someone else who probably doesn’t know the mechanics of the relevant statistics and shouldn’t be expected to. In my experience, it’s a bad idea in these settings to try to educate your customer on the spot about things like Brier or AUC scores. They don’t need to know those statistics, and you’re liable to come across as aloof or even condescending if you presume to spend time teaching them. Instead, I’d recommend using the practical problem they’re asking you to help solve to frame your representation of your predictive power. Propose a realistic decision process—or, if you can, take the one they’re already using—and describe the results you’d get if you plugged your forecasts into it.

In applied contexts, people often will also want to know how your process performed on crucial cases and what events would have surprised it, so it’s good to be prepared to talk about those as well. These topics are germane to basic research, too, but crucial cases will be defined differently in the two contexts. For scholars, crucial cases are usually understood as most-likely and least-likely ones in relation to the theory being tested. For policy-makers and other applied audiences, the crucial cases are usually understood as the ones where surprise was or would have been costliest.

So that’s how I think about assessing and describing the accuracy of forecasts of the kinds of (rare) international events a lot of us study. Now, if you’ll indulge me, I’d like to close with a pedantic plea: Can we please reserve the terms “forecast” and “prediction” for statements about things that haven’t happened and not apply them to estimates we generate for cases with known outcomes?

This might seem like a petty concern, but it’s actually tied to the philosophy of knowledge that underpins science, or my understanding of it, anyway. Making predictions about things that haven’t already happened is a crucial part of the scientific method. To learn from prediction, we assume that a model’s forecasting power tells us something about its proximity to the “true” data-generating process. This assumption won’t always be right, but it’s proven pretty useful over the past few centuries, so I’m okay sticking with it for now. For obvious reasons, it’s much easier to make accurate “predictions” about cases with known outcomes than unknown ones, so the scientific value of the two endeavors is very different. In light of that fact, I think we should be as clear and honest with ourselves and our audiences as we can about which one we’re doing, and therefore how much we’re learning.

When we’re doing this stuff in practice, there are three basic modes: 1) in-sample fitting, 2) cross-validation (CV), and 3) forecasting. In-sample fitting is the least informative of the three and, in my opinion, really should only be used in exploratory analysis and should not be reported in finished work. It tells us a lot more about the method than the phenomenon of interest.

CV is usually more informative than in-sample fitting, but not always. Each iteration of CV on the same data set moves you a little closer to in-sample fitting, because you effectively train to the idiosyncrasies of your chosen test set. Using multiple iterations of CV may ameliorate this problem, but it doesn’t always eliminate it. And on topics where the available data have already been worked to death—as they have on many problems of interest to scholars of international affairs—cross-validation really isn’t much more informative than in-sample fitting unless you’ve got a brand-new data series you can throw at the task and are focused on learning about it.

True forecasting—making clear statements about things that haven’t happened yet and then seeing how they turn out—is uniquely informative in this regard, so I think it’s important to reserve that term for the situations where that’s actually what we’re doing. When we describe in-sample and cross-validation estimates as forecasts, we confuse our readers, and we risk confusing ourselves about how much we’re really learning.

Of course, that’s easier for some phenomena than it is for others. If your theory concerns the risk of interstate wars, for example, you’re probably (and thankfully) not going to get a lot of opportunities to test it through prediction. Rather than sweeping those issues under the rug, though, I think we should recognize them for what they are. They are not an excuse to elide the huge differences between prediction and fitting models to history. Instead, they are a big haymaker of a reminder that social science is especially hard—not because humans are uniquely unpredictable, but rather because we only have the one grand and always-running experiment to observe, and we and our work are part of it.

Demography and Democracy Revisited

Last spring on this blog, I used Richard Cincotta’s work on age structure to take another look at the relationship between democracy and “development” (here). In his predictive models of democratization, Rich uses variation in median age as a proxy for a syndrome of socioeconomic changes we sometimes call “modernization” and argues that “a country’s chances for meaningful democracy increase as its population ages.” Rich’s models have produced some unconventional predictions that have turned out well, and if you buy the scientific method, this apparent predictive power implies that the underlying theory holds some water.

Over the weekend, Rich sent me a spreadsheet with his annual estimates of median age for all countries from 1972 to 2015, so I decided to take my own look at the relationship between those estimates and the occurrence of democratic transitions. For the latter, I used a data set I constructed for PITF (here) that covers 1955–2010, giving me a period of observation running from 1972 to 2010. In this initial exploration, I focused specifically on switches from authoritarian rule to democracy, which are observed with a binary variable that covers all country-years where an autocracy was in place on January 1. That variable (rgjtdem) is coded 1 if a democratic regime came into being at some point during that calendar year and 0 otherwise. Between 1972 and 2010, 94 of those switches occurred worldwide. The data set also includes, among other things, a “clock” counting consecutive years of authoritarian rule and an indicator for whether or not the country has ever had a democratic regime before.

To assess the predictive power of median age and compare it to other measures of socioeconomic development, I used the base and caret packages in R to run 10 iterations of five-fold cross-validation on the following series of discrete-time hazard (logistic regression) models:

  • Base model. Any prior democracy (0/1), duration of autocracy (logged), and the product of the two.
  • GDP per capita. Base model plus the Maddison Project’s estimates of GDP per capita in 1990 Geary-Khamis dollars (here), logged.
  • Infant mortality. Base model plus the U.S. Census Bureau’s estimates of deaths under age 1 per 1,000 live births (here), logged.
  • Median age. Base model plus Cincotta’s estimates of median age, untransformed.

The chart below shows density plots and averages of the AUC scores (computed with ‘roc.area’ from the verification package) for each of those models across the 10 iterations of five-fold CV. Contrary to the conventional assumption that GDP per capita is a useful predictor of democratic transitions—How many papers have you read that tossed this measure into the model as a matter of course?—I find that the model with the Maddison Project measure actually makes slightly less accurate predictions than the one with duration and prior democracy alone. More relevant to this post, though, the two demographic measures clearly improve the predictions of democratic transitions relative to the base model, and median age adds a smidgen more predictive signal than infant mortality.

transit.auc.by.fold

Of course, all of these things—national wealth, infant mortality rates, and age structures—have also been changing pretty steadily in a single direction for decades, so it’s hard to untangle the effects of the covariates from other features of the world system that are also trending over time. To try to address that issue and to check for nonlinearity in the relationship, I used Simon Wood’s mgcv package in R to estimate a semiparametric logistic regression model with smoothing splines for year and median age alongside the indicator of prior democracy and regime duration. Plots of the marginal effects of year and median age estimated from that model are shown below. As the left-hand plot shows, the time effect is really a hump in risk that started in the late 1980s and peaked sharply in the early 1990s; it is not the across-the-board post–Cold War increase that we often see covered in models with a dummy variable for years after 1991. More germane to this post, though, we still see a marginal effect from median age, even when accounting for those generic effects of time. Consistent with Cincotta’s argument and other things being equal, countries with higher median age are more likely to transition to democracy than countries with younger populations.

transit.ageraw.effect.spline.with.year

I read these results as a partial affirmation of modernization theory—not the whole teleological and normative package, but the narrower empirical conjecture about a bundle of socioeconomic transformations that often co-occur and are associated with a higher likelihood of attempting and sustaining democratic government. Statistical studies of this idea (including my own) have produced varied results, but the analysis I’m describing here suggests that some of the null results may stem from the authors’ choice of measures. GDP per capita is actually a poor proxy for modernization; there are a number of ways countries can get richer, and not all of them foster (or are fostered by) the socioeconomic transformations that form the kernel of modernization theory (cf. Equatorial Guinea). By contrast, demographic measures like infant mortality rates and median age are more tightly coupled to those broader changes about which Seymour Martin Lipset originally wrote. And, according to my analysis, those demographic measures are also associated with a country’s propensity for democratic transition.

Shifting to the applied forecasting side, I think these results confirm that median age is a useful addition to models of regime transitions, and it seems capture more information about those propensities than GDP (by a lot) and infant mortality (by a little). Like all slow-changing structural indicators, though, median age is a blunt instrument. Annual forecasts based on it alone would be pretty clunky, and longer-term forecasts would do well to consider other domestic and international forces that also shape (and are shaped by) these changes.

PS. If you aren’t already familiar with modernization theory and want more background, this ungated piece by Sheri Berman for Foreign Affairs is pretty good: “What to Read on Modernization Theory.”

PPS. The code I used for this analysis is now on GitHub, here. It includes a link to the folder on my Google Drive with all of the required data sets.

Why My Coup Risk Models Don’t Include Any Measures of National Militaries

For the past several years (herehere, here, and here), I’ve used statistical models estimated from country-year data to produce assessments of coup risk in countries worldwide. I rejigger the models a bit each time, but none of the models I’ve used so far has included specific features of countries’ militaries.

That omission strikes a lot of people as a curious one. When I shared this year’s assessments with the Conflict Research Group on Facebook, one group member posted this comment:

Why do none of the covariates feature any data on militaries? Seeing as militaries are the ones who stage the coups, any sort of predictive model that doesn’t account for the militaries themselves would seem incomplete.

I agree in principle. It’s the practical side that gets in the way. I don’t include features of national militaries in the models because I don’t have reliable measures of them with the coverage I need for this task.

To train and then apply these predictive models, I need fairly complete time series for all or nearly all countries of the world that extend back to at least the 1980s and have been updated recently enough to give me a useful input for the current assessment (see here for more on why that’s true). I looked again early this month and still can’t find anything like that on even the big stuff, like military budgets, size, and force structures. There are some series on this topic in the World Bank’s World Development Indicators (WDI) data set, but those series have a lot of gaps, and the presence of those gaps is correlated with other features of the models (e.g., regime type). Ditto for SIPRI. And, of course, those aren’t even the most interesting features for coup risk, like whether or not military promotions favor certain groups over others, or if there is a capable and purportedly loyal presidential guard.

But don’t take my word for it. Here’s what the Correlates of War Project says in the documentation for Version 4.0 of its widely-used data set (PDF) about its measure of military expenditures, one of two features of national militaries it tries to cover (the other is total personnel):

It was often difficult to identify and exclude civil expenditures from reported budgets of less developed nations. For many countries, including some major powers, published military budgets are a catch-all category for a variety of developmental and administrative expenses—public works, colonial administration, development of the merchant marine, construction, and improvement of harbor and navigational facilities, transportation of civilian personnel, and the delivery of mail—of dubious military relevance. Except when we were able to obtain finance ministry reports, it is impossible to make detailed breakdowns. Even when such reports were available, it proved difficult to delineate “purely” military outlays. For example, consider the case in which the military builds a road that facilitates troops movements, but which is used primarily by civilians. A related problem concerns those instances in which the reported military budget does not reflect all of the resources devoted to that sector. This usually happens when a nation tries to hide such expenditures from scrutiny; for instance, most Western scholars and military experts agree that officially reported post-1945 Soviet-bloc totals are unrealistically low, although they disagree on the appropriate adjustments.

And that’s just the part of the “Problems and Possible Errors” section about observing the numerator in a calculation that also requires a complicated denominator. And that’s for what is—in principle, at least—one of the most observable features of a country’s civil-military relations.

Okay, now let’s assume that problem magically disappears, and COW’s has nearly-complete and reliable data on military expenditures. Now we want to use models trained on those data to estimate coup risk for 2015. Whoops: COW only runs through 2010! The World Bank and SIPRI get closer to the current year—observations through 2013 are available now—but there are missing values for lots of countries, and that missingness is caused by other predictors of coup risk, such as national wealth, armed conflict, and political regime type. For example, WDI has no data on military expenditures for Eritrea and North Korea ever, and the series for Central African Republic is patchy throughout and ends in 2010. If I wanted to include military expenditures in my predictive models, I could use multiple imputation to deal with these gaps in the training phase, but then how would I generate current forecasts for these important cases? I could make guesses, but how accurate could those guesses be for a case like Eritrea or North Korea, and then am I adding signal or noise to the resulting forecasts?

Of course, one of the luxuries of applied forecasting is that the models we use can lack important features and still “work.” I don’t need the model to be complete and its parameters to be true for the forecasts to be accurate enough to be useful. Still, I’ll admit that, as a social scientist by training, I find it frustrating to have to set aside so many intriguing ideas because we simply don’t have the data to try them.

Estimating NFL Team-Specific Home-Field Advantage

This morning, I tinkered a bit with my pro-football preseason team strength survey data from 2013 and 2014 to see what other simple things I might do to improve the accuracy of forecasts derived from future versions of them.

My first idea is to go beyond a generic estimate of home-field advantage—about 3 points, according to my and everyone else’s estimates—with team-specific versions of that quantity. The intuition is that some venues confer a bigger advantage than others. For example, I would guess that Denver enjoys a bigger home-field edge than most teams because their stadium is at high altitude. The Broncos live there, so they’re used to it, but visiting teams have to adapt, and that process supposedly takes about a day for every 1,000 feet over 3,000. Some venues are louder than others, and that noise is often dialed up when visiting teams would prefer some quiet. And so on.

To explore this idea, I’m using a simple hierarchical linear model to estimate team-specific intercepts after taking preseason estimates of relative team strength into account. The line of R code used to estimate the model requires the lme4 package and looks like this:

mod1 <- lmer(score.raw ~ wiki.diff + (1 | home_team), results)

Where

score.raw = home_score - visitor_score
wiki.diff = home_wiki - visitor_wiki

Those wiki vectors are the team strength scores estimated from preseason pairwise wiki surveys. The ‘results’ data frame includes scores for all regular and postseason games from those two years so far, courtesy of devstopfix’s NFL results repository on GitHub (here). Because the net game and strength scores are both ordered home to visitor, we can read those random intercepts for each home team as estimates of team-specific home advantage. There are probably other sources of team-specific bias in my data, so those estimates are going to be pretty noisy, because I think it’s a reasonable starting point.

My initial results are shown in the plot below, which I get with these two lines of code, the second of which requires the lattice package:

ha1 <- ranef(mod1, condVar=TRUE)
dotplot(ha1)

Bear in mind that the generic (fixed) intercept is 2.7, so the estimated home-field advantage for each team is what’s shown in the plot plus that number. For example, these estimates imply that my Ravens enjoy a net advantage of about 3 points when they play in Baltimore, while their division-rival Bengals are closer to 6.

home.advantage.estimates

In light of DeflateGate, I guess I shouldn’t be surprised to see the Pats at the top of the chart, almost a whole point higher than the second-highest team. Maybe their insanely home low fumble rate has something to do with it.* I’m also heartened to see relatively high estimates for Denver, given the intuition that started this exercise, and Seattle, which I’ve heard said enjoys an unusually large home-field edge. At the same time, I honestly don’t know what to make of the exceptionally low estimates for DC and Jacksonville, who appear from these estimates to suffer a net home-field disadvantage. That strikes me as odd and undercuts my confidence in the results.

In any case, that’s how far my tinkering took me today. If I get really bored motivated, I might try re-estimating the model with just the 2013 data and then running the 2014 preseason survey scores through that model to generate “forecasts” that I can compare to the ones I got from the simple linear model with just the generic intercept (here). The point of the exercise was to try to get more accurate forecasts from simple models, and the only way to test that is to do it. I’m also trying to decide if I need to cross these team-specific effects with season-specific effects to try to control for differences across years in the biases in the wiki survey results when estimating these team-specific intercepts. But I’m not there yet.

* After I published this post, Michael Lopez helpfully pointed me toward a better take on the Patriots’ fumble rate (here), and Mo Patel observed that teams manage their own footballs on the road, too, so that particular tweak—if it really happened—wouldn’t have a home-field-specific effect.

Statistical Assessments of Coup Risk for 2015

Which countries around the world are more likely to see coup attempts in 2015?

For the fourth year in a row, I’ve used statistical models to generate one answer to that question, where a coup is defined more or less as a forceful seizure of national political authority by military or political insiders. (I say “more or less” because I’m blending data from two sources with slightly different definitions; see below for details.) A coup doesn’t need to succeed to count as an attempt, but it does need to involve public action; alleged plots and rumors of plots don’t qualify. Neither do insurgencies or foreign invasions, which by definition involve military or political outsiders. The heat map below shows variation in estimated coup risk for 2015, with countries colored by quintiles (fifths).

forecast.heatmap.2015

The dot plot below shows the estimates and their 90-percent confidence intervals (CIs) for the 40 countries with the highest estimated risk. The estimates are the unweighted average of forecasts from two logistic regression models; more on those in a sec. To get CIs for estimates from those two models, I took a cue from a forthcoming article by Lyon, Wintle, and Burgman (fourth publication listed here; the version I downloaded last year has apparently been taken down, and I can’t find another) and just averaged the CIs from the two models.

forecast.dotplot.2015

I’ve consistently used simple two– or three-model ensembles to generate these coup forecasts, usually pairing a logistic regression model with an implementation of Random Forests on the same or similar data. This year, I decided to use only a pair of logistic regression models representing somewhat different ideas about coup risk. Consistent with findings from other work in which I’ve been involved (here), k-fold cross-validation told me that Random Forests wasn’t really boosting forecast accuracy, and sticking to logistic regression makes it possible to get and average those CIs. The first model matches one I used last year, and it includes the following covariates:

  • Infant mortality rate. Deaths of children under age 1 per 1,000 live births, relative to the annual global median, logged. This measure that primarily reflects national wealth but is also sensitive to variations in quality of life produced by things like corruption and inequality. (Source: U.S. Census Bureau)
  • Recent coup activity. A yes/no indicator of whether or not there have been any coup attempts in that country in the past five years. I’ve tried logged event counts and longer windows, but this simple version contains as much predictive signal as any other. (Sources: Center for Systemic Peace and Powell and Thyne)
  • Political regime type. Following Fearon and Laitin (here), a categorical measure differentiating between autocracies, anocracies, democracies, and other forms. (Source: Center for Systemic Peace, with hard-coded updates for 2014)
  • Regime durability. The “number of years since the last substantive change in authority characteristics (defined as a 3-point change in the POLITY score).” (Source: Center for Systemic Peace, with hard-coded updates for 2014)
  • Election year. A yes/no indicator for whether or not any national elections (executive, legislative, or general) are scheduled to take place during the forecast year. (Source: NELDA, with hard-coded updates for 2011–2015)
  • Economic growth. The previous year’s annual GDP growth rate. To dampen the effects of extreme values on the model estimates, I take the square root of the absolute value and then multiply that by -1 for cases where the raw value less than 0. (Source: IMF)
  • Political salience of elite ethnicity. A yes/no indicator for whether or not the ethnic identity of national leaders is politically salient. (Source: PITF, with hard-coded updates for 2014)
  • Violent civil conflict. A yes/no indicator for whether or not any major armed civil or ethnic conflict is occurring in the country. (Source: Center for Systemic Peace, with hard-coded updates for 2014)
  • Country age. Years since country creation or independence, logged. (Source: me)
  • Coup-tagion. Two variables representing (logged) counts of coup attempts during the previous year in other countries around the world and in the same geographic region. (Source: me)
  • Post–Cold War period. A binary variable marking years after the disintegration of the USSR in 1991.
  • Colonial heritage. Three separate binary indicators identifying countries that were last colonized by Great Britain, France, or Spain. (Source: me)

The second model takes advantage of new data from Geddes, Wright, and Frantz on autocratic regime types (here) to consider how qualitative differences in political authority structures and leadership might shape coup risk—both directly, and indirectly by mitigating or amplifying the effects of other things. Here’s the full list of covariates in this one:

  • Infant mortality rate. Deaths of children under age 1 per 1,000 live births, relative to the annual global median, logged. This measure that primarily reflects national wealth but is also sensitive to variations in quality of life produced by things like corruption and inequality. (Source: U.S. Census Bureau)
  • Recent coup activity. A yes/no indicator of whether or not there have been any coup attempts in that country in the past five years. I’ve tried logged event counts and longer windows, but this simple version contains as much predictive signal as any other. (Sources: Center for Systemic Peace and Powell and Thyne)
  • Regime type. Using the binary indicators included in the aforementioned data from Geddes, Wright, and Frantz with hard-coded updates for the period 2011–2014, a series of variables differentiating between the following:
    • Democracies
    • Military autocracies
    • One-party autocracies
    • Personalist autocracies
    • Monarchies
  • Regime duration. Number of years since the last change in political regime type, logged. (Source: Geddes, Wright, and Frantz, with hard-coded updates for the period 2011–2014)
  • Regime type * regime duration. Interactions to condition the effect of regime duration on regime type.
  • Leader’s tenure. Number of years the current chief executive has held that office, logged. (Source: PITF, with hard-coded updates for 2014)
  • Regime type * leader’s tenure. Interactions to condition the effect of leader’s tenure on regime type.
  • Election year. A yes/no indicator for whether or not any national elections (executive, legislative, or general) are scheduled to take place during the forecast year. (Source: NELDA, with hard-coded updates for 2011–2015)
  • Regime type * election year. Interactions to condition the effect of election years on regime type.
  • Economic growth. The previous year’s annual GDP growth rate. To dampen the effects of extreme values on the model estimates, I take the square root of the absolute value and then multiply that by -1 for cases where the raw value less than 0. (Source: IMF)
  • Regime type * economic growth. Interactions to condition the effect of economic growth on regime type.
  • Post–Cold War period. A binary variable marking years after the disintegration of the USSR in 1991.

As I’ve done for the past couple of years, I used event lists from two sources—the Center for Systemic Peace (about halfway down the page here) and Jonathan Powell and Clayton Thyne (Dataset 3 here)—to generate the historical data on which those models were trained. Country-years are the unit of observation in this analysis, so a country-year is scored 1 if either CSP or P&T saw any coup attempts there during those 12 months and 0 otherwise. The plot below shows annual counts of successful and failed coup attempts in countries worldwide from 1946 through 2014 according to the two data sources. There is a fair amount of variance in the annual counts and the specific events that comprise them, but the basic trend over time is the same. The incidence of coup attempts rose in the 1950s; spiked in the early 1960s; remained relatively high throughout the rest of the Cold War; declined in the 1990s, after the Cold War ended; and has remained relatively low throughout the 2000s and 2010s.

Annual counts of coup events worldwide from two data sources, 1946-2014

Annual counts of coup events worldwide from two data sources, 1946-2014

I’ve been posting annual statistical assessments of coup risk on this blog since early 2012; see here, here, and here for the previous three iterations. I have rejiggered the modeling a bit each year, but the basic process (and the person designing and implementing it) has remained the same. So, how accurate have these forecasts been?

The table below reports areas under the ROC curve (AUC) and Brier scores (the 0–1 version) for the forecasts from each of those years and their averages, using the the two coup event data sources alone and together as different versions of the observed ground truth. Focusing on the “either” columns, because that’s what I’m usually using when estimating the models, we can see the the average accuracy—AUC in the low 0.80s and Brier score of about 0.03—is comparable to what we see in many other country-year forecasts of rare political events using a variety of modeling techniques (see here). With the AUC, we can also see a downward trend over time. With so few events involved, though, three years is too few to confidently deduce a trend, and those averages are consistent with what I typically see in k-fold cross-validation. So, at this point, I suspect those swings are just normal variation.

AUC and Brier scores for coup forecasts posted on Dart-Throwing Chimp, 2012-2014, by coup event data source

AUC and Brier scores for coup forecasts posted on Dart-Throwing Chimp, 2012-2014, by coup event data source

The separation plot designed by Greenhill, Ward, and Sacks (here) offers a nice way to visualize the accuracy of these forecasts. The ones below show the three annual slices using the “either” version of the outcome, and they reinforce the story told in the table: the forecasts have correctly identified most of the countries that saw coup attempts in the past three years as relatively high-risk cases, but the accuracy has declined over time. Let’s define a surprise as a case that fell outside the top 30 of the ordered forecasts but still saw a coup attempt. In 2012, just one of four countries that saw coup attempts was a surprise: Papua New Guinea, ranked 48. In 2013, that number increased to two of five (Eritrea at 51 and Egypt at 58), and in 2014 it rose to three of five (Burkina Faso at 42, Ukraine at 57, and the Gambia at 68). Again, though, the average accuracy across the three years is consistent with what I typically see in k-fold cross-validation of these kinds of models in the historical data, so I don’t think we should make too much of that apparent time trend just yet.

cou.scoring.sepplot.2012 cou.scoring.sepplot.2013 cou.scoring.sepplot.2014

This year, for the first time, I am also running an experiment in crowdsourcing coup risk assessments by way of a pairwise wiki survey (survey here, blog post explaining it here, and preliminary results discussed here). My long-term goal is to repeat this process numerous times on this topic and some others (for example, onsets of state-led mass killing episodes) to see how the accuracy of the two approaches compares and how their output might be combined. Statistical forecasts are usually much more accurate than human judgment, but that advantage may be reduced or eliminated when we aggregate judgments from large and diverse crowds, or when we don’t have data on important features to use in those statistical models. Models that use annual data also suffer in comparison to crowdsourcing processes that can update continuously, as that wiki survey does (albeit with a lot of inertia).

We can’t incorporate the output from that wiki survey into the statistical ensemble, because the survey doesn’t generate predicted probabilities; it only assesses relative risk. We can, however, compare the rank orderings the two methods produce. The plot below juxtaposes the rankings produced by the statistical models (left) with the ones from the wiki survey (right). About 500 votes have been cast since I wrote up the preliminary results, but I’m going to keep things simple for now and use the preliminary survey results I already wrote up. The colored arrows identify cases ranked at least 10 spots higher (red) or lower (blue) by the crowd than the statistical models. As the plot shows, there are many differences between the two, even toward the top of the rankings where the differences in statistical estimates are bigger and therefore more meaningful. For example, the crowd sees Nigeria, Libya, and Venezuela as top 10 risks while the statistical models do not; of those three, only Nigeria ranks in the top 30 on the statistical forecasts. Meanwhile, the crowd pushes Niger and Guinea-Bissau out of the top 10 down to the 20s, and it sees Madagascar, Afghanistan, Egypt, and Ivory Coast as much lower risks than the models do. Come 2016, it will be interesting to see which version was more accurate.

coup.forecast.comparison.2015

If you are interested in getting hold of the data or R scripts used to produce these forecasts and figures, please send me an email at ulfelder at gmail dot com.

Follow

Get every new post delivered to your Inbox.

Join 11,875 other followers

%d bloggers like this: