A Plea for More Prediction

The second Annual Bank Conference on Africa happened in Berkeley, CA, earlier this week, and the World Bank’s Development Impact blog has an outstanding summary of the 50-odd papers presented there. If you have to pick between reading this post and that one, go there.

One paper on that roster that caught my eye revisits the choice of statistical models for the study of civil wars. As authors John Paul Dunne and Nan Tian describe, the default choice is logistic regression, although probit gets a little playing time, too. They argue, however, that a zero-inflated Poisson (ZIP) model matches the data-generating process better than either of these traditional picks, and they show that this choice affects what we learn about the causes of civil conflict.

Having worked on statistical models of civil conflict for nearly 20 years, I have some opinions on that model-choice issue, but those aren’t what I want to discuss right now. Instead, I want to wonder aloud why more researchers don’t use prediction as the yardstick—or at least one of the yardsticks—for adjudicating these model comparisons.

In their paper, Dunne and Tian stake their claim about the superiority of ZIP to logit and probit on comparisons of Akaike information criteria (AIC) and Vuong tests. Okay, but if their goal is to see if ZIP fits the underlying data-generating process better than those other choices, what better way to find out than by comparing out-of-sample predictive power?

Prediction is fundamental to the accumulation of scientific knowledge. The better we understand why and how something happens, the more accurate our predictions of it should be. When we estimate models from observational data and only look at how well our models fit the data from which they were estimated, we learn some things about the structure of that data set, but we don’t learn how well those things generalize to other relevant data sets. If we believe that the world isn’t deterministic—that the observed data are just one of many possible realizations of the world—then we need to care about that ability to generalize, because that generalization and the discovery of its current limits is the heart of the scientific enterprise.

From a scientific standpoint, the ideal world would be one in which we could estimate models representing rival theories, then compare the accuracy of the predictions they generate across a large number of relevant “trials” as they unfold in real time. That’s difficult for scholars studying big but rare events like civil wars and wars between states; though; a lot of time has to pass before we’ll see enough new examples to make a statistically powerful comparison across models.

But, hey, there’s an app for that—cross-validation! Instead of using all the data in the initial estimation, hold some out to use as a test set for the models we get from the rest. Better yet, split the data into several equally-sized folds and then iterate the training and testing across all possible groupings of them (k-fold cross-validation). Even better, repeat that process a bunch of times and compare distributions of the resulting statistics.

Prediction is the gold standard in most scientific fields, and cross-validation is standard practice in many areas of applied forecasting, because they are more informative than in-sample tests. For some reason, political science still mostly eschews both.* Here’s hoping that changes soon.

* For some recent exceptions to this rule on topics in world politics, see Ward, Greenhill, and Bakke and Blair, Blattman, and Hartman on predicting civil conflict; Chadefaux on warning signs of interstate war; Hill and Jones on state repression; and Chenoweth and me on the onset of nonviolent campaigns.

An Applied Forecaster’s Bad Dream

This is the sort of thing that freaks me out every time I’m getting ready to deliver or post a new set of forecasts:

In its 2015 States of Fragility report, the Organization for Economic Co-operation and Development (OECD) decided to complicate its usual one-dimensional list of fragile states by assessing five dimensions of fragility: Violence, Justice, Institutions, Economic Foundations and Resilience…

Unfortunately, something went wrong during the calculations. In my attempts to replicate the assessment, I found that the OECD misclassified a large number of states.

That’s from a Monkey Cage post by Thomas Leo Scherer, published today. Here, per Scherer, is why those errors matter:

Recent research by Judith Kelley and Beth Simmons shows that international indicators are an influential policy tool. Indicators focus international attention on low performers to positive and negative effect. They cause governments in poorly ranked countries to take action to raise their scores when they realize they are being monitored or as domestic actors mobilize and demand change after learning how they rate versus other countries. Given their potential reach, indicators should be handled with care.

For individuals or organizations involved in scientific or public endeavors, the best way to mitigate that risk is transparency. We can and should argue about concepts, measures, and model choices, but given a particular set of those elements, we should all get essentially the same results. When one or more of those elements is hidden, we can’t fully understand what the reported results represent, and researchers who want to improve the design by critiquing and perhaps extending it are forced to box shadows. Also, individuals and organizations can double– and triple-check their own work, but errors are almost inevitable. When getting the best possible answers matters more than the risk of being seen making mistakes, then transparency is the way to go. This is why the Early Warning Project shares the data and code used to produce its statistical risk assessments in a public repository, and why Reinhart and Rogoff probably (hopefully?) wish they’d done something similar.

Of course, even though transparency improves the probability of catching errors and improving on our designs, it doesn’t automatically produce those goods. What’s more, we can know that we’re doing the right thing and still dread the public discovery of an error. Add to that risk the near-certainty of other researchers scoffing at your terrible code, and it’s easy see why even the best practices won’t keep you from breaking out in a cold sweat each time you hit “Send” or “Publish” on a new piece of work.

 

The Myth of Comprehensive Data

“What about using Twitter sentiment?”

That suggestion came to me from someone at a recent Data Science DC meetup, after I’d given a short talk on assessing risks of mass atrocities for the Early Warning Project, and as the next speaker started his presentation on predicting social unrest. I had devoted the first half of my presentation to a digression of sorts, talking about how the persistent scarcity of relevant public data still makes it impossible to produce global forecasts of rare political crises—things like coups, insurgencies, regime breakdowns, and mass atrocities—that are as sharp and dynamic as we would like.

The meetup wasn’t the first time I’d heard that suggestion, and I think all of the well-intentioned people who have made it to me have believed that data derived from Twitter would escape or overcome those constraints. In fact, the Twitter stream embodies them. Over the past two decades, technological, economic, and political changes have produced an astonishing surge in the amount of information available from and about the world, but that surge has not occurred evenly around the globe.

Think of the availability of data as plant life in a rugged landscape, where dry peaks are places of data scarcity and fertile valleys represent data-rich environments. The technological developments of the past 20 years are like a weather pattern that keeps dumping more and more rain on that topography. That rain falls unevenly across the landscape, however, and it doesn’t have the same effect everywhere it lands. As a result, plants still struggle to grow on many of those rocky peaks, and much of the new growth occurs where water already collected and flora were already flourishing.

The Twitter stream exemplifies this uneven distribution of data in a couple of important ways. Take a look at the map below, a screenshot I took after letting Tweetping run for about 16 hours spanning May 6–7, 2015. The brighter the glow, the more Twitter activity Tweetping saw.

tweetping 1530 20150506 to 0805 20150507

Some of the spatial variation in that map reflects differences in the distribution of human populations, but not all of it. Here’s a map of population density, produced by Daysleeper using data from CEISIN (source). If you compare this one to the map of Twitter usage, you’ll see that they align pretty well in Europe, the Americas, and some parts of Asia. In Africa and other parts of Asia, though, not so much. If it were just a matter of population density, then India and eastern China should burn brightest, but they—and especially China—are relatively dark compared to “the West.” Meanwhile, in Africa, we see pockets of activity, but there are whole swathes of the continent that are populated as or more densely than the brighter parts of South America, but from which we see virtually no Twitter activity.

world population density map

So why are some pockets of human settlement less visible than others? Two forces stand out: wealth and politics.

First and most obvious, access to Twitter depends on electricity and telecommunications infrastructure and gadgets and literacy and health and time, all of which are much scarcer in poorer parts of the world than they are in richer places. The map below shows lights at night, as seen from space by U.S. satellites 20 years ago and then mapped by NASA (source). These light patterns are sometimes used as a proxy for economic development (e.g., here).

earth_lights

This view of the world helps explain some of the holes in our map of Twitter activity, but not all of it. For example, many of the densely populated parts of Africa don’t light up much at night, just as they don’t on Tweetping, because they lack the relevant infrastructure and power production. Even 20 years ago, though, India and China looked much brighter through this lens than they do on our Twitter usage map.

So what else is going on? The intensity and character of Twitter usage also depends on freedoms of information and speech—the ability and desire to access the platform and to speak openly on it—and this political layer keeps other areas in the dark in that Tweetping map. China, North Korea, Cuba, Ethiopia, Eritrea—if you’re trying to anticipate important political crises, these are all countries you would want to track closely, but Twitter is barely used or unavailable in all of them as a direct or indirect consequence of public policy. And, of course, there are also many places where Twitter is accessible and used but censorship distorts the content of the stream. For example, Saudi Arabia lights up pretty well on the Twitter-usage map, but it’s hard to imagine people speaking freely on it when a tweet can land you in prison.

Clearly, wealth and political constraints still strongly shape the view of the world we can get from new data sources like Twitter. Contrary to the heavily-marketed myth of “comprehensive data,” poverty and repression continue to hide large swathes of the world out of our digital sight, or to distort the glimpses we get of them.

Unfortunate for efforts to forecast rare political crises, those two structural features that so strongly shape the production and quality of data also correlate with the risks we want to anticipate. The map below shows the Early Warning Project‘s most recent statistical assessments of the risk of onsets of state-led mass-killing episodes. Now flash back to the visualization of Twitter usage above, and you’ll see that many of the countries colored most brightly on this map are among the darkest on that one. Even in 2015, the places about which we most need more information to sharpen our forecasts of rare political crises are the ones that are still hardest to see.

ewp.sra.world.2014

Statistically, this is the second-worst of all possible worlds, the worst one being the total absence of information. Data are missing not at random, and the processes producing those gaps are the same ones that put places at greater risk of mass atrocities and other political calamities. This association means that models we estimate with those data will often be misleading. There are ways to mitigate these problems, but they aren’t necessarily simple, cheap, or effective, and that’s before we even start in on the challenges of extracting useful measures from something as heterogeneous and complex as the Twitter stream.

So that’s what I see when I hear people suggest that social media or Google Trends or other forms of “digital exhaust” have mooted the data problems about which I so often complain. Lots of organizations are spending a lot of money trying to overcome these problems, but the political and economic topography producing them does not readily yield. The Internet is part of this complex adaptive system, not a space outside it, and its power to transform that system is neither as strong nor as fast-acting as many of us—especially in the richer and freer parts of the world—presume.

To Realize the QDDR’s Early-Warning Goal, Invest in Data-Making

The U.S. Department of State dropped its second Quadrennial Diplomacy and Development Review, or QDDR, last week (here). Modeled on the Defense Department’s Quadrennial Defense Review, the QDDR lays out the department’s big-picture concerns and objectives so that—in theory—they can guide planning and shape day-to-day decision-making.

The new QDDR establishes four main goals, one of which is to “strengthen our ability to prevent and respond to internal conflict, atrocities, and fragility.” To help do that, the State Department plans to “increase [its] use of early warning analysis to drive early action on fragility and conflict.” Specifically, State says it will:

  1. Improve our use of tools for analyzing, tracking, and forecasting fragility and conflict, leveraging improvements in analytical capabilities;
  2. Provide more timely and accurate assessments to chiefs of mission and senior decision-makers;
  3. Increase use of early warning data and conflict and fragility assessments in our strategic planning and programming;
  4. Ensure that significant early warning shifts trigger senior-level review of the mission’s strategy and, if necessary, adjustments; and
  5. Train and deploy conflict-specific diplomatic expertise to support countries at risk of conflict or atrocities, including conflict negotiation and mediation expertise for use at posts.

Unsurprisingly, that plan sounds great to me. We can’t now and never will be able to predict precisely where and when violent conflict and atrocities will occur, but we can assess risks with enough accuracy and lead time to enable better strategic planning and programming. These forecasts don’t have to be perfect to be earlier, clearer, and more reliable than the traditional practices of deferring to individual country or regional analysts or just reacting to the news.

Of course, quite a bit of well-designed conflict forecasting is already happening, much of it paid for by the U.S. government. To name a few of the relevant efforts: The Political Instability Task Force (PITF) and the Worldwide Integrated Conflict Early Warning System (W-ICEWS) routinely update forecasts of various forms of political crisis for U.S. government customers. IARPA’s Open Source Indicators (OSI) and Aggregative Contingent Estimation (ACE) programs are simultaneously producing forecasts now and discovering ways to make future forecasts even better. Meanwhile, outside the U.S. government, the European Union has recently developed its own Global Conflict Risk Index (GCRI), and the Early Warning Project now assesses risks of mass atrocities in countries worldwide.

That so much thoughtful risk assessment is being done now doesn’t mean it’s a bad idea to start new projects. If there are any iron laws of forecasting hard-to-predict processes like political violence, one of them is that combinations of forecasts from numerous sources should be more accurate than forecasts from a single model or person or framework. Some of the existing projects already do this kind of combining themselves, but combinations of combinations will often be even better.

Still, if I had to channel the intention expressed in this part of the QDDR into a single activity, it would not be the construction of new models, at least not initially. Instead, it would be data-making. Social science is not Newtonian physics, but it’s not astrology, either. Smart people have been studying politics for a long time, and collectively they have developed a fair number of useful ideas about what causes or precedes violent conflict. But, if you can’t track the things those theorists tell you to track, then your forecasts are going to suffer. To improve significantly on the predictive models of political violence we have now, I think we need better inputs most of all.

When I say “better” inputs, I have a few things in mind. In some cases, we need to build data sets from scratch. When I was updating my coup forecasts earlier this year, a number of people wondered why I didn’t include measures of civil-military relations, which are obviously relevant to this particular risk. The answer was simple: because global data on that topic don’t exist. If we aren’t measuring it, we can’t use it in our forecasts, and the list of relevant features that falls into this set is surprisingly long.

In other cases, we need to revive them. Social scientists often build “boutique” data sets for specific research projects, run the tests they want to run on them, and then move on to the next project. Sometimes, the tests they or others run suggest that some features captured in those data sets would make useful predictors. Those discoveries are great in principle, but if those data sets aren’t being updated, then applied forecasters can’t use that knowledge. To get better forecasts, we need to invest in picking up where those boutique data sets left off so we can incorporate their insights into our applications.

Finally and in almost all cases, we need to observe things more frequently. Most of the data available now to most conflict forecasters is only updated once each year, often on a several-month delay and sometimes as much as two years later (e.g., data describing 2014 becomes available in 2016). That schedule is fine for basic research, but it is crummy for applied forecasting. If we want to be able to give assessments and warnings that as current as possible to those “chiefs of mission and senior decision-makers” mentioned in the QDDR, then we need to build models with data that are updated as frequently as possible. Daily or weekly are ideal, but monthly updates would suffice in many cases and would mark a huge improvement over the status quo.

As I said at the start, we’re never going to get models that reliably tell us far in advance exactly where and when violent conflicts and mass atrocities will erupt. I am confident, however, that we can assess these risks even more accurately than we do now, but only if we start making more, and better versions, of the data our theories tell us we need.

I’ll end with a final plea to any public servants who might be reading this: if you do invest in developing better inputs, please make the results freely available to the public. When you share your data, you give the crowd a chance to help you spot and fix your mistakes, to experiment with various techniques, and to think about what else you might consider, all at no additional cost to you. What’s not to like about that?

Waiting for Data-dot

A suburban house. A desk cluttered with papers, headphones, stray cables, and a pair of socks. Dawn.

Jay, sitting at the desk, opens a browser tab and clicks on a favorited site to see if a data set he needs to produce forecasts has been updated yet. It has not. He pauses, slurps coffee from a large mug, and tries another site. As before.

JAY: (giving up again). Nothing to be done.

 

[Apologies to Samuel Beckett.]

 

 

Down the Country-Month Rabbit Hole

Some big things happened in the world this week. Iran and the P5+1 agreed on a framework for a nuclear deal, and the agreement looks good. In a presidential election in Nigeria—the world’s seventh–most populous country, and one that few observers would have tagged as a democracy before last weekend—incumbent Goodluck Jonathan lost and then promptly and peacefully conceded defeat. The trickle of countries joining China’s new Asian Infrastructure Investment Bank turned into a torrent.

All of those things happened, but you won’t read more about them here, because I have spent the better part of the past week down a different rabbit hole. Last Friday, after years of almosts and any-time-nows, the event data produced for the Integrated Conflict Early Warning System (ICEWS) finally landed in the public domain, and I have been busy trying to figure out how to put them to use.

ICEWS isn’t the first publicly available trove of political event data, but it compares favorably to the field’s first mover, GDELT, and it currently covers a much longer time span than the other recent entrant, Phoenix.

The public release of ICEWS is exciting because it opens the door wider to dynamic modeling of world politics. Right now, nearly all of the data sets employed in statistical studies of politics around the globe use country-years as their units of observation. That’s not bad if you’re primarily interested in the effects or predictive power of structural features, but it’s pretty awful for explaining and anticipating faster-changing phenomena, like social unrest or violent conflict. GDELT broke the lock on that door, but its high noise-to-signal ratio and the opacity of its coding process have deterred me from investing too much time in developing monitoring or forecasting systems that depend on it.

With ICEWS on the Dataverse, that changes. I think we now have a critical mass of data sets in the public domain that: a) reliably cover important topics for the whole world over many years; b) are routinely updated; and, crucially, c) can be parsed to the month or even the week or day to reward investments in more dynamic modeling. Other suspects fitting this description include:

  • The spell-file version of Polity, which measures national patterns of political authority;
  • Lists of coup attempts maintained by Jonathan Powell and Clayton Thyne (here) and the Center for Systemic Peace (here); and
  • The PITF Worldwide Atrocities Event Dataset, which records information about events involving the deliberate killing of five or more noncombatant civilians (more on it here).

We also have high-quality data sets on national elections (here) and leadership changes (here, described here) that aren’t routinely updated by their sources but would be relatively easy to code by hand for applied forecasting.

With ICEWS, there is, of course, a catch. The public version of the project’s event data set will be updated monthly, but on a one-year delay. For example, when the archive was first posted in March, it ran through February 2014. On April 1, the Lockheed team added March 2014. This delay won’t matter much for scholars doing retrospective analyses, but it’s a critical flaw, if not a fatal one, for applied forecasters who can’t afford to pay—what, probably hundreds of thousands of dollars?—for a real-time subscription.

Fortunately, we might have a workaround. Phil Schrodt has played a huge role in the creation of the field of machine-coded political event data, including GDELT and ICEWS, and he is now part of the crew building Phoenix. In a blog post published the day ICEWS dropped, Phil suggested that Phoenix and ICEWS data will probably look enough alike to allow substituting the former for the latter, perhaps with some careful calibration. As Phil says, we won’t know for sure until we have a wider overlap between the two and can see how well this works in practice, but the possibility is promising enough for me to dig in.

And what does that mean? Well, a week has now passed since ICEWS hit the Dataverse, and so far I have:

  • Written an R function that creates a table of valid country-months for a user-specified time period, to use as scaffolding in the construction and agglomeration of country-month data sets;
  • Written scripts that call that function and some others to ingest and then parse or aggregate the other data sets I mentioned to the country-month level;
  • Worked out a strategy, and written the code, to partition the data into training and test sets for a project on predicting violence against civilians; and
  • Spent a lot of time staring at the screen thinking about, and a little time coding, ways to aggregate, reduce, and otherwise pre-process the ICEWS events and Polity data for that work on violence against civilians and beyond.

What I haven’t done yet—T plus seven days and counting—is any modeling. How’s that for push-button, Big Data magic?

On Evaluating and Presenting Forecasts

On Saturday afternoon, at the International Studies Association‘s annual conference in New Orleans, I’m slated to participate in a round-table discussion with Patrick Brandt, Kristian Gleditsch, and Håvard Hegre on “assessing forecasts of (rare) international events.” Our chair, Andy Halterman, gave us three questions he’d like to discuss:

  1. What are the quantitative assessments of forecasting accuracy that people should use when they publish forecasts?
  2. What’s the process that should be used to determine whether a gain in a scoring metric represents an actual improvement in the model?
  3. How should model uncertainty and past model performance be conveyed to government or non-academic users of forecasts?

As I did for a Friday panel on alternative academic careers (here), I thought I’d use the blog to organize my thoughts ahead of the event and share them with folks who are interested but can’t attend the round table. So, here goes:

When assessing predictive power, we use perfection as the default benchmark. We ask, “Was she right?” or “How close to true value did he get?”

In fields where predictive accuracy is already very good, this approach seems reasonable. When the object of the forecasts are rare international events, however, I think this is a mistake, or at least misleading. It implies that perfection is attainable, and that distance from perfection is what we care about. In fact, approximating perfection is not a realistic goal in many fields, and what we really care about in those situations is distance from the available alternatives. In other words, I think we should always assess accuracy in comparative terms, not absolute ones. So, the question becomes: “Compared to what?”

I can think of two situations in which we’d want to forecast international events, and the ways we assess and describe the accuracy of the results will differ across the two. First, there is basic research, where the goal is to develop and test theory. This is what most scholars are doing most of the time, and here the benchmark should be other relevant theories. We want compare predictive power across nested models or models representing competing hypotheses to see which version does a better job anticipating real-world behavior—and, by implication, explaining it.

Then, of course, there is applied research, where the goal is to support or improve some decision process. Policy-making and investing are probably the two most common ones. Here, the benchmark should be the status quo. What we want to know is: “How much does the proposed forecasting process improve on the one(s) used now?” If the status quo is unclear, that already tells you something important about the state of forecasting in that field—namely, that it probably isn’t great. Even in that case, though, I think it’s still important to pick a benchmark that’s more realistic than perfection. Depending on the rarity of the event in question, that will usually mean either random guessing (for frequent events) or base rates (for rare ones).

How we communicate our findings on predictive power will also differ across basic and applied research, or at least I think it should. This has less to do with the goals of the work than it does with the audiences at which they’re usually aimed. When the audience is other scholars, I think it’s reasonable to expect them to understand the statistics and, so, to use those. For frequent events, Brier or logarithmic scores are often best, whereas for rare events I find that AUC scores are usually more informative, and I know a lot of people like to use F-1 scores in this context, too.

In applied settings, though, we’re usually doing the work as a service to someone else who probably doesn’t know the mechanics of the relevant statistics and shouldn’t be expected to. In my experience, it’s a bad idea in these settings to try to educate your customer on the spot about things like Brier or AUC scores. They don’t need to know those statistics, and you’re liable to come across as aloof or even condescending if you presume to spend time teaching them. Instead, I’d recommend using the practical problem they’re asking you to help solve to frame your representation of your predictive power. Propose a realistic decision process—or, if you can, take the one they’re already using—and describe the results you’d get if you plugged your forecasts into it.

In applied contexts, people often will also want to know how your process performed on crucial cases and what events would have surprised it, so it’s good to be prepared to talk about those as well. These topics are germane to basic research, too, but crucial cases will be defined differently in the two contexts. For scholars, crucial cases are usually understood as most-likely and least-likely ones in relation to the theory being tested. For policy-makers and other applied audiences, the crucial cases are usually understood as the ones where surprise was or would have been costliest.

So that’s how I think about assessing and describing the accuracy of forecasts of the kinds of (rare) international events a lot of us study. Now, if you’ll indulge me, I’d like to close with a pedantic plea: Can we please reserve the terms “forecast” and “prediction” for statements about things that haven’t happened and not apply them to estimates we generate for cases with known outcomes?

This might seem like a petty concern, but it’s actually tied to the philosophy of knowledge that underpins science, or my understanding of it, anyway. Making predictions about things that haven’t already happened is a crucial part of the scientific method. To learn from prediction, we assume that a model’s forecasting power tells us something about its proximity to the “true” data-generating process. This assumption won’t always be right, but it’s proven pretty useful over the past few centuries, so I’m okay sticking with it for now. For obvious reasons, it’s much easier to make accurate “predictions” about cases with known outcomes than unknown ones, so the scientific value of the two endeavors is very different. In light of that fact, I think we should be as clear and honest with ourselves and our audiences as we can about which one we’re doing, and therefore how much we’re learning.

When we’re doing this stuff in practice, there are three basic modes: 1) in-sample fitting, 2) cross-validation (CV), and 3) forecasting. In-sample fitting is the least informative of the three and, in my opinion, really should only be used in exploratory analysis and should not be reported in finished work. It tells us a lot more about the method than the phenomenon of interest.

CV is usually more informative than in-sample fitting, but not always. Each iteration of CV on the same data set moves you a little closer to in-sample fitting, because you effectively train to the idiosyncrasies of your chosen test set. Using multiple iterations of CV may ameliorate this problem, but it doesn’t always eliminate it. And on topics where the available data have already been worked to death—as they have on many problems of interest to scholars of international affairs—cross-validation really isn’t much more informative than in-sample fitting unless you’ve got a brand-new data series you can throw at the task and are focused on learning about it.

True forecasting—making clear statements about things that haven’t happened yet and then seeing how they turn out—is uniquely informative in this regard, so I think it’s important to reserve that term for the situations where that’s actually what we’re doing. When we describe in-sample and cross-validation estimates as forecasts, we confuse our readers, and we risk confusing ourselves about how much we’re really learning.

Of course, that’s easier for some phenomena than it is for others. If your theory concerns the risk of interstate wars, for example, you’re probably (and thankfully) not going to get a lot of opportunities to test it through prediction. Rather than sweeping those issues under the rug, though, I think we should recognize them for what they are. They are not an excuse to elide the huge differences between prediction and fitting models to history. Instead, they are a big haymaker of a reminder that social science is especially hard—not because humans are uniquely unpredictable, but rather because we only have the one grand and always-running experiment to observe, and we and our work are part of it.

Demography and Democracy Revisited

Last spring on this blog, I used Richard Cincotta’s work on age structure to take another look at the relationship between democracy and “development” (here). In his predictive models of democratization, Rich uses variation in median age as a proxy for a syndrome of socioeconomic changes we sometimes call “modernization” and argues that “a country’s chances for meaningful democracy increase as its population ages.” Rich’s models have produced some unconventional predictions that have turned out well, and if you buy the scientific method, this apparent predictive power implies that the underlying theory holds some water.

Over the weekend, Rich sent me a spreadsheet with his annual estimates of median age for all countries from 1972 to 2015, so I decided to take my own look at the relationship between those estimates and the occurrence of democratic transitions. For the latter, I used a data set I constructed for PITF (here) that covers 1955–2010, giving me a period of observation running from 1972 to 2010. In this initial exploration, I focused specifically on switches from authoritarian rule to democracy, which are observed with a binary variable that covers all country-years where an autocracy was in place on January 1. That variable (rgjtdem) is coded 1 if a democratic regime came into being at some point during that calendar year and 0 otherwise. Between 1972 and 2010, 94 of those switches occurred worldwide. The data set also includes, among other things, a “clock” counting consecutive years of authoritarian rule and an indicator for whether or not the country has ever had a democratic regime before.

To assess the predictive power of median age and compare it to other measures of socioeconomic development, I used the base and caret packages in R to run 10 iterations of five-fold cross-validation on the following series of discrete-time hazard (logistic regression) models:

  • Base model. Any prior democracy (0/1), duration of autocracy (logged), and the product of the two.
  • GDP per capita. Base model plus the Maddison Project’s estimates of GDP per capita in 1990 Geary-Khamis dollars (here), logged.
  • Infant mortality. Base model plus the U.S. Census Bureau’s estimates of deaths under age 1 per 1,000 live births (here), logged.
  • Median age. Base model plus Cincotta’s estimates of median age, untransformed.

The chart below shows density plots and averages of the AUC scores (computed with ‘roc.area’ from the verification package) for each of those models across the 10 iterations of five-fold CV. Contrary to the conventional assumption that GDP per capita is a useful predictor of democratic transitions—How many papers have you read that tossed this measure into the model as a matter of course?—I find that the model with the Maddison Project measure actually makes slightly less accurate predictions than the one with duration and prior democracy alone. More relevant to this post, though, the two demographic measures clearly improve the predictions of democratic transitions relative to the base model, and median age adds a smidgen more predictive signal than infant mortality.

transit.auc.by.fold

Of course, all of these things—national wealth, infant mortality rates, and age structures—have also been changing pretty steadily in a single direction for decades, so it’s hard to untangle the effects of the covariates from other features of the world system that are also trending over time. To try to address that issue and to check for nonlinearity in the relationship, I used Simon Wood’s mgcv package in R to estimate a semiparametric logistic regression model with smoothing splines for year and median age alongside the indicator of prior democracy and regime duration. Plots of the marginal effects of year and median age estimated from that model are shown below. As the left-hand plot shows, the time effect is really a hump in risk that started in the late 1980s and peaked sharply in the early 1990s; it is not the across-the-board post–Cold War increase that we often see covered in models with a dummy variable for years after 1991. More germane to this post, though, we still see a marginal effect from median age, even when accounting for those generic effects of time. Consistent with Cincotta’s argument and other things being equal, countries with higher median age are more likely to transition to democracy than countries with younger populations.

transit.ageraw.effect.spline.with.year

I read these results as a partial affirmation of modernization theory—not the whole teleological and normative package, but the narrower empirical conjecture about a bundle of socioeconomic transformations that often co-occur and are associated with a higher likelihood of attempting and sustaining democratic government. Statistical studies of this idea (including my own) have produced varied results, but the analysis I’m describing here suggests that some of the null results may stem from the authors’ choice of measures. GDP per capita is actually a poor proxy for modernization; there are a number of ways countries can get richer, and not all of them foster (or are fostered by) the socioeconomic transformations that form the kernel of modernization theory (cf. Equatorial Guinea). By contrast, demographic measures like infant mortality rates and median age are more tightly coupled to those broader changes about which Seymour Martin Lipset originally wrote. And, according to my analysis, those demographic measures are also associated with a country’s propensity for democratic transition.

Shifting to the applied forecasting side, I think these results confirm that median age is a useful addition to models of regime transitions, and it seems capture more information about those propensities than GDP (by a lot) and infant mortality (by a little). Like all slow-changing structural indicators, though, median age is a blunt instrument. Annual forecasts based on it alone would be pretty clunky, and longer-term forecasts would do well to consider other domestic and international forces that also shape (and are shaped by) these changes.

PS. If you aren’t already familiar with modernization theory and want more background, this ungated piece by Sheri Berman for Foreign Affairs is pretty good: “What to Read on Modernization Theory.”

PPS. The code I used for this analysis is now on GitHub, here. It includes a link to the folder on my Google Drive with all of the required data sets.

Why My Coup Risk Models Don’t Include Any Measures of National Militaries

For the past several years (herehere, here, and here), I’ve used statistical models estimated from country-year data to produce assessments of coup risk in countries worldwide. I rejigger the models a bit each time, but none of the models I’ve used so far has included specific features of countries’ militaries.

That omission strikes a lot of people as a curious one. When I shared this year’s assessments with the Conflict Research Group on Facebook, one group member posted this comment:

Why do none of the covariates feature any data on militaries? Seeing as militaries are the ones who stage the coups, any sort of predictive model that doesn’t account for the militaries themselves would seem incomplete.

I agree in principle. It’s the practical side that gets in the way. I don’t include features of national militaries in the models because I don’t have reliable measures of them with the coverage I need for this task.

To train and then apply these predictive models, I need fairly complete time series for all or nearly all countries of the world that extend back to at least the 1980s and have been updated recently enough to give me a useful input for the current assessment (see here for more on why that’s true). I looked again early this month and still can’t find anything like that on even the big stuff, like military budgets, size, and force structures. There are some series on this topic in the World Bank’s World Development Indicators (WDI) data set, but those series have a lot of gaps, and the presence of those gaps is correlated with other features of the models (e.g., regime type). Ditto for SIPRI. And, of course, those aren’t even the most interesting features for coup risk, like whether or not military promotions favor certain groups over others, or if there is a capable and purportedly loyal presidential guard.

But don’t take my word for it. Here’s what the Correlates of War Project says in the documentation for Version 4.0 of its widely-used data set (PDF) about its measure of military expenditures, one of two features of national militaries it tries to cover (the other is total personnel):

It was often difficult to identify and exclude civil expenditures from reported budgets of less developed nations. For many countries, including some major powers, published military budgets are a catch-all category for a variety of developmental and administrative expenses—public works, colonial administration, development of the merchant marine, construction, and improvement of harbor and navigational facilities, transportation of civilian personnel, and the delivery of mail—of dubious military relevance. Except when we were able to obtain finance ministry reports, it is impossible to make detailed breakdowns. Even when such reports were available, it proved difficult to delineate “purely” military outlays. For example, consider the case in which the military builds a road that facilitates troops movements, but which is used primarily by civilians. A related problem concerns those instances in which the reported military budget does not reflect all of the resources devoted to that sector. This usually happens when a nation tries to hide such expenditures from scrutiny; for instance, most Western scholars and military experts agree that officially reported post-1945 Soviet-bloc totals are unrealistically low, although they disagree on the appropriate adjustments.

And that’s just the part of the “Problems and Possible Errors” section about observing the numerator in a calculation that also requires a complicated denominator. And that’s for what is—in principle, at least—one of the most observable features of a country’s civil-military relations.

Okay, now let’s assume that problem magically disappears, and COW’s has nearly-complete and reliable data on military expenditures. Now we want to use models trained on those data to estimate coup risk for 2015. Whoops: COW only runs through 2010! The World Bank and SIPRI get closer to the current year—observations through 2013 are available now—but there are missing values for lots of countries, and that missingness is caused by other predictors of coup risk, such as national wealth, armed conflict, and political regime type. For example, WDI has no data on military expenditures for Eritrea and North Korea ever, and the series for Central African Republic is patchy throughout and ends in 2010. If I wanted to include military expenditures in my predictive models, I could use multiple imputation to deal with these gaps in the training phase, but then how would I generate current forecasts for these important cases? I could make guesses, but how accurate could those guesses be for a case like Eritrea or North Korea, and then am I adding signal or noise to the resulting forecasts?

Of course, one of the luxuries of applied forecasting is that the models we use can lack important features and still “work.” I don’t need the model to be complete and its parameters to be true for the forecasts to be accurate enough to be useful. Still, I’ll admit that, as a social scientist by training, I find it frustrating to have to set aside so many intriguing ideas because we simply don’t have the data to try them.

Estimating NFL Team-Specific Home-Field Advantage

This morning, I tinkered a bit with my pro-football preseason team strength survey data from 2013 and 2014 to see what other simple things I might do to improve the accuracy of forecasts derived from future versions of them.

My first idea is to go beyond a generic estimate of home-field advantage—about 3 points, according to my and everyone else’s estimates—with team-specific versions of that quantity. The intuition is that some venues confer a bigger advantage than others. For example, I would guess that Denver enjoys a bigger home-field edge than most teams because their stadium is at high altitude. The Broncos live there, so they’re used to it, but visiting teams have to adapt, and that process supposedly takes about a day for every 1,000 feet over 3,000. Some venues are louder than others, and that noise is often dialed up when visiting teams would prefer some quiet. And so on.

To explore this idea, I’m using a simple hierarchical linear model to estimate team-specific intercepts after taking preseason estimates of relative team strength into account. The line of R code used to estimate the model requires the lme4 package and looks like this:

mod1 <- lmer(score.raw ~ wiki.diff + (1 | home_team), results)

Where

score.raw = home_score - visitor_score
wiki.diff = home_wiki - visitor_wiki

Those wiki vectors are the team strength scores estimated from preseason pairwise wiki surveys. The ‘results’ data frame includes scores for all regular and postseason games from those two years so far, courtesy of devstopfix’s NFL results repository on GitHub (here). Because the net game and strength scores are both ordered home to visitor, we can read those random intercepts for each home team as estimates of team-specific home advantage. There are probably other sources of team-specific bias in my data, so those estimates are going to be pretty noisy, because I think it’s a reasonable starting point.

My initial results are shown in the plot below, which I get with these two lines of code, the second of which requires the lattice package:

ha1 <- ranef(mod1, condVar=TRUE)
dotplot(ha1)

Bear in mind that the generic (fixed) intercept is 2.7, so the estimated home-field advantage for each team is what’s shown in the plot plus that number. For example, these estimates imply that my Ravens enjoy a net advantage of about 3 points when they play in Baltimore, while their division-rival Bengals are closer to 6.

home.advantage.estimates

In light of DeflateGate, I guess I shouldn’t be surprised to see the Pats at the top of the chart, almost a whole point higher than the second-highest team. Maybe their insanely home low fumble rate has something to do with it.* I’m also heartened to see relatively high estimates for Denver, given the intuition that started this exercise, and Seattle, which I’ve heard said enjoys an unusually large home-field edge. At the same time, I honestly don’t know what to make of the exceptionally low estimates for DC and Jacksonville, who appear from these estimates to suffer a net home-field disadvantage. That strikes me as odd and undercuts my confidence in the results.

In any case, that’s how far my tinkering took me today. If I get really bored motivated, I might try re-estimating the model with just the 2013 data and then running the 2014 preseason survey scores through that model to generate “forecasts” that I can compare to the ones I got from the simple linear model with just the generic intercept (here). The point of the exercise was to try to get more accurate forecasts from simple models, and the only way to test that is to do it. I’m also trying to decide if I need to cross these team-specific effects with season-specific effects to try to control for differences across years in the biases in the wiki survey results when estimating these team-specific intercepts. But I’m not there yet.

* After I published this post, Michael Lopez helpfully pointed me toward a better take on the Patriots’ fumble rate (here), and Mo Patel observed that teams manage their own footballs on the road, too, so that particular tweak—if it really happened—wouldn’t have a home-field-specific effect.

Follow

Get every new post delivered to your Inbox.

Join 12,478 other followers

%d bloggers like this: