On Evaluating and Presenting Forecasts

On Saturday afternoon, at the International Studies Association‘s annual conference in New Orleans, I’m slated to participate in a round-table discussion with Patrick Brandt, Kristian Gleditsch, and Håvard Hegre on “assessing forecasts of (rare) international events.” Our chair, Andy Halterman, gave us three questions he’d like to discuss:

  1. What are the quantitative assessments of forecasting accuracy that people should use when they publish forecasts?
  2. What’s the process that should be used to determine whether a gain in a scoring metric represents an actual improvement in the model?
  3. How should model uncertainty and past model performance be conveyed to government or non-academic users of forecasts?

As I did for a Friday panel on alternative academic careers (here), I thought I’d use the blog to organize my thoughts ahead of the event and share them with folks who are interested but can’t attend the round table. So, here goes:

When assessing predictive power, we use perfection as the default benchmark. We ask, “Was she right?” or “How close to true value did he get?”

In fields where predictive accuracy is already very good, this approach seems reasonable. When the object of the forecasts are rare international events, however, I think this is a mistake, or at least misleading. It implies that perfection is attainable, and that distance from perfection is what we care about. In fact, approximating perfection is not a realistic goal in many fields, and what we really care about in those situations is distance from the available alternatives. In other words, I think we should always assess accuracy in comparative terms, not absolute ones. So, the question becomes: “Compared to what?”

I can think of two situations in which we’d want to forecast international events, and the ways we assess and describe the accuracy of the results will differ across the two. First, there is basic research, where the goal is to develop and test theory. This is what most scholars are doing most of the time, and here the benchmark should be other relevant theories. We want compare predictive power across nested models or models representing competing hypotheses to see which version does a better job anticipating real-world behavior—and, by implication, explaining it.

Then, of course, there is applied research, where the goal is to support or improve some decision process. Policy-making and investing are probably the two most common ones. Here, the benchmark should be the status quo. What we want to know is: “How much does the proposed forecasting process improve on the one(s) used now?” If the status quo is unclear, that already tells you something important about the state of forecasting in that field—namely, that it probably isn’t great. Even in that case, though, I think it’s still important to pick a benchmark that’s more realistic than perfection. Depending on the rarity of the event in question, that will usually mean either random guessing (for frequent events) or base rates (for rare ones).

How we communicate our findings on predictive power will also differ across basic and applied research, or at least I think it should. This has less to do with the goals of the work than it does with the audiences at which they’re usually aimed. When the audience is other scholars, I think it’s reasonable to expect them to understand the statistics and, so, to use those. For frequent events, Brier or logarithmic scores are often best, whereas for rare events I find that AUC scores are usually more informative, and I know a lot of people like to use F-1 scores in this context, too.

In applied settings, though, we’re usually doing the work as a service to someone else who probably doesn’t know the mechanics of the relevant statistics and shouldn’t be expected to. In my experience, it’s a bad idea in these settings to try to educate your customer on the spot about things like Brier or AUC scores. They don’t need to know those statistics, and you’re liable to come across as aloof or even condescending if you presume to spend time teaching them. Instead, I’d recommend using the practical problem they’re asking you to help solve to frame your representation of your predictive power. Propose a realistic decision process—or, if you can, take the one they’re already using—and describe the results you’d get if you plugged your forecasts into it.

In applied contexts, people often will also want to know how your process performed on crucial cases and what events would have surprised it, so it’s good to be prepared to talk about those as well. These topics are germane to basic research, too, but crucial cases will be defined differently in the two contexts. For scholars, crucial cases are usually understood as most-likely and least-likely ones in relation to the theory being tested. For policy-makers and other applied audiences, the crucial cases are usually understood as the ones where surprise was or would have been costliest.

So that’s how I think about assessing and describing the accuracy of forecasts of the kinds of (rare) international events a lot of us study. Now, if you’ll indulge me, I’d like to close with a pedantic plea: Can we please reserve the terms “forecast” and “prediction” for statements about things that haven’t happened and not apply them to estimates we generate for cases with known outcomes?

This might seem like a petty concern, but it’s actually tied to the philosophy of knowledge that underpins science, or my understanding of it, anyway. Making predictions about things that haven’t already happened is a crucial part of the scientific method. To learn from prediction, we assume that a model’s forecasting power tells us something about its proximity to the “true” data-generating process. This assumption won’t always be right, but it’s proven pretty useful over the past few centuries, so I’m okay sticking with it for now. For obvious reasons, it’s much easier to make accurate “predictions” about cases with known outcomes than unknown ones, so the scientific value of the two endeavors is very different. In light of that fact, I think we should be as clear and honest with ourselves and our audiences as we can about which one we’re doing, and therefore how much we’re learning.

When we’re doing this stuff in practice, there are three basic modes: 1) in-sample fitting, 2) cross-validation (CV), and 3) forecasting. In-sample fitting is the least informative of the three and, in my opinion, really should only be used in exploratory analysis and should not be reported in finished work. It tells us a lot more about the method than the phenomenon of interest.

CV is usually more informative than in-sample fitting, but not always. Each iteration of CV on the same data set moves you a little closer to in-sample fitting, because you effectively train to the idiosyncrasies of your chosen test set. Using multiple iterations of CV may ameliorate this problem, but it doesn’t always eliminate it. And on topics where the available data have already been worked to death—as they have on many problems of interest to scholars of international affairs—cross-validation really isn’t much more informative than in-sample fitting unless you’ve got a brand-new data series you can throw at the task and are focused on learning about it.

True forecasting—making clear statements about things that haven’t happened yet and then seeing how they turn out—is uniquely informative in this regard, so I think it’s important to reserve that term for the situations where that’s actually what we’re doing. When we describe in-sample and cross-validation estimates as forecasts, we confuse our readers, and we risk confusing ourselves about how much we’re really learning.

Of course, that’s easier for some phenomena than it is for others. If your theory concerns the risk of interstate wars, for example, you’re probably (and thankfully) not going to get a lot of opportunities to test it through prediction. Rather than sweeping those issues under the rug, though, I think we should recognize them for what they are. They are not an excuse to elide the huge differences between prediction and fitting models to history. Instead, they are a big haymaker of a reminder that social science is especially hard—not because humans are uniquely unpredictable, but rather because we only have the one grand and always-running experiment to observe, and we and our work are part of it.

“No One Stayed to Count the Bodies”

If you want to understand and appreciate why, even in the age of the Internet and satellites and near-ubiquitous mobile telephony, it remains impossible to measure even the coarsest manifestations of political violence with any precision, read this blog post by Phil Hazlewood, AFP’s bureau chief in Lagos. (Warning: graphic. H/t to Ryan Cummings on Twitter.)

Hazlewood’s post focuses on killings perpetrated by Boko Haram, but the same issues arise in measuring violence committed by states. Violence sometimes eliminates some people who might describe the acts involved, and it intentionally scares many others. If you hear or see details of what happened, that’s often because the killers or their rivals for power wanted you to hear or see those details. We cannot sharply distinguish between the communication of those facts and the political intentions expressed in the violence or the reactions to it. The conversation is the message, and the violence is part of the conversation.

When you see or hear things in spite of those efforts to conceal them, you have to wonder how selection effects limit or distort the information that gets through. North Korea’s gulag system apparently contains thousands and kills some untold numbers each year. Defectors are the outside world’s main source of information about that system, but those defectors are not a random sample of victims, nor are they mechanical recording devices. Instead, they are human beings who have somehow escaped that country and who are now seeking to draw attention to and destroy that system. I do not doubt the basic truth of the gulags’ existence and the horrible things done there, but as a social scientist, I have to consider how those selection processes and motivations shape what we think we know. In the United States, we lack reliable data on fatal encounters with police. That’s partly because different jurisdictions have different capabilities for recording and reporting these incidents, but it’s also partly because some people in that system do not want us to see what they do.

For a previous post of mine on this topic, see “The Fog of War Is Patchy“.


Some Thoughts on “Alternative Academic Careers”

I’m headed to New Orleans next week for the the annual convention of the International Studies Association (ISA), and while there I’m scheduled to speak on a panel on “alternative academic careers” (Friday at 1:45 PM). To help organize my views on the subject and to share them with people who are curious but can’t attend the panel, I thought I would turn them into a blog post. So:

Let me start with some caveats. I am a white, male U.S. citizen who grew up in a family that wasn’t poor and who married a woman while in grad school. I would prefer to deal in statistics, but on this topic, I can only really speak from experience, and that experience has been conditioned by those personal characteristics. In other words, I know my view is narrow and biased, but on this topic, that view is all I’ve got, so take it for whatever you think it’s worth.

My own career trajectory has been, I think, unusual. I got my Ph.D. from Stanford in the spring of 1997. At the time, I had no academic job offers; I had a spouse who wanted to go to art school; we had two dogs and couldn’t find a place we could afford that would rent to us in the SF Bay Area (this was the leading edge of the dot-com boom); and we both had family in the DC metro area. So, we packed up and moved in with my mother-in-law in Baltimore, and I started looking for work in and around Washington.

It took me a few months to land a job as an analyst in a little branch of a big forensic accounting firm, basically writing short pieces on political risk for a newsletter that went out to the firm’s corporate clients. We moved to the DC suburbs and I did that for about a year until I got a job with a small government contractor that did research projects for the U.S. “intelligence community” and the Department of Defense. After a couple of years of that and the birth of our first son, I decided I needed a change, so I took a job writing book-length research reports on telecom firms and industry segments for a trade-news outfit. At the time, I was also doing some freelance feature writing on whatever I could successfully pitch, and I thought a full-time writing job would help me move faster in that direction.

After a couple of years of that and no serious traction on the writing front, I got a line through one of my dissertation committee members on a part-time consulting thing with a big government contractor, SAIC. I was offered and took that job, which soon evolved into a full-time, salaried position as research director for the Political Instability Task Force. I did that for 10 years, then left it at the end of 2011 to try freelancing as a social scientist. Most of my work time since then has been devoted to the Early Warning Project and the Good Judgment Project, with dribs and drabs on other things and enough free time to write this blog.

So that’s where I’m coming from. Now, here is what I think I’ve learned from those experiences, and from watching and talking to others with similar training who do related things.

First, freelancing—what I’m doing now, and what usually gets fancified as “independent consulting”—is not a realistic option for most social scientists early in their careers, and probably not for most people, period. I would not have landed either of the large assignments I’ve gotten in the past few years without the professional network and reputation I had accumulated over the previous ten. This blog and my activity on social media have helped me expand that network, but nearly all of the paid jobs I’ve done in the past few years have come through those earlier connections. Best I can tell, there are no realistic short cuts to this kind of role.

Think tanks aren’t really an option for new Ph.D.s, either. Most of the jobs in that world are for recent undergraduates or MAs on the one hand and established scholars and practitioners on the other. There are some opportunities for new Ph.Ds looking to do straight-up analysis at places like RAND and the Congressional Research Service, but the niche is tiny, and those jobs will be very hard to land.

That brings me to the segment I know best, namely, government contracting. Budget cuts mean that jobs in that market are scarcer than they were a few years ago, but there are still lots of them. In that world, though, hiring is often linked to specific roles in contracts that have already been awarded. You’re not paid to pursue your own interests or write policy briefs; you’re usually hired to do tasks X and Y on contract Z, with the expectation that you’ll also be able to apply those skills to other, similar contracts after Z runs out. So, to land these jobs, you need to have abilities that can plug into clearly-defined contract tasks, but that also make you a reasonably good risk overall. Nowadays, programming and statistical (“data science”) skills are in demand, but so are others, including fluency in foreign languages and area expertise. If you want to get a feel for what’s valued in that world, spend some time looking at job listings from the big contracting firms—places like Booz Allen Hamilton, Lockheed Martin, SAIC, and Leidos—and see what they’re asking for.

If you do that, you’ll quickly discover that having a security clearance is a huge plus. If you think you want to do government-contract work, you can give yourself a leg up by finding some way to get that clearance while you’re still in graduate school. If you can’t do that, you might consider looking for work that isn’t your first choice on substance but will lead to a clearance in hopes of making an upward or lateral move later on.

All of that said, it’s important to think carefully about the down sides of a security clearance before you pursue one. A clearance opens some doors, but it closes others, some permanently. The process can take a long time, and it can be uncomfortable. When you get a clearance, you accept some constraints on your speech, and those never entirely fall away, even if you leave that world. The fact that you currently or once held a clearance and worked with agencies that required one will also make a certain impression on some people, and that never fully goes away, either. So it’s worth thinking carefully about those long-term implications before you jump in.

If you do go to work on either side of that fence—for a contractor, or directly for the government—you need to be prepared to work on topics on which you’re not a already an expert. It’s often your analytic skills they’re after, not your topical expertise, so you should be prepared to stretch that way. Maybe you’ll occasionally or eventually work back to your original research interests, maybe you’ll discover new ones, or maybe you’ll get bored or frustrated and restart your job search. In all cases, the process will go better if you’re prepared for any of the above.

Last but not least, if you’re not going into academia, I think it can help to know that your first job after grad school does not make or break your career. I don’t have data to support this claim, but my own experience and my observation of others tells me that nonacademic careers are not nearly as path dependent as academic ones. That can be scarier in some ways, but it’s also potentially liberating. Within the nontrivial constraints of the market, you can keep reinventing yourself, and I happen to think that’s great.

A Tale of Normal Failure

When I blog about my own research, I usually describe work I’ve already completed and focus on the results. This post is about a recent effort that ended in frustration, and it focuses on the process. In writing about this aborted project, I have two hopes: 1) to reassure other researchers (and myself) that this kind of failure is normal, and 2) if I’m lucky, to get some help with this task.

This particular ball got rolling a couple of days ago when I read a blog post by Dan Drezner about one aspect of the Obama administration’s new National Security Strategy (NSS) report. A few words in the bits Dan quoted got me thinking about the worldview they represented, and how we might use natural-language processing (NLP) to study that:

At first, I was just going to drop that awkwardly numbered tweetstorm and leave it there. I had some time that afternoon, though, and I’ve been looking for opportunities to learn text mining, so I decided to see what I could do. The NSS reports only became a thing in 1987, so there are still just 16 of them, and they all try to answer the same basic questions: What threats and opportunities does the US face in the world, and what should the government do to meet them? As such, they seemed like the kind of manageable and coherent corpus that would make for a nice training exercise.

I started by checking to see if anyone had already done with earlier reports what I was hoping to do with the latest one. It turned out that someone had, and to good effect:

I promptly emailed the corresponding author to ask if they had replication materials, or even just clean versions of the texts for all previous years. I got an autoreply informing me that the author was on sabbatical and would only intermittently be reading his email. (He replied the next day to say that he would put the question to his co-authors, but that still didn’t solve my problem, and by then I’d moved on anyway.)

Without those materials, I would need to start by getting the documents in the proper format. A little Googling led me to the National Security Strategy Archive, which at the time had PDFs of all but the newest report, and that one was easy enough to find on the White House’s web site. Another search led me to a site that converts PDFs to plain text online for free. I spent the next hour or so running those reports through the converter (and playing a little Crossy Road on my phone while I waited for the jobs to finish). Once I had the reports as .txt files, I figured I could organize my work better and do other researchers a solid by putting them all in a public repository, so I set one up on GitHub (here) and cloned it to my hard drive.

At that point, I was getting excited, thinking: “Hey, this isn’t so hard after all.” In most of the work I do, getting the data is the toughest part, and I already had all the documents I wanted in the format I needed. I was just a few lines of code away from the statistics and plots and that would confirm or infirm my conjectures.

From another recent collaboration, I knew that the next step would be to use some software to ingest those .txt files, scrub them a few different ways, and then generate some word counts and maybe do some topic modeling to explore changes over time in the reports’ contents. I’d heard several people say that Python is really good at these tasks, but I’m an R guy, so I followed the lead on the CRAN Task View for natural language processing and installed and loaded the ‘tm’ package for text mining.

And that’s where the wheels started to come off of my rickety little wagon. Using the package developers’ vignette and an article they published in the Journal of Statistical Software, I started tinkering with some code. After a couple of false starts, I found that I could create a corpus and run some common preprocessing tasks on it without too much trouble, but I couldn’t get the analytical functions to run on the results. Instead, I kept getting this error message:

Error: inherits(doc, "TextDocument") is not TRUE

By then it was dinner time, so I called it a day and went to listen to my sons holler at each other across the table for a while.

When I picked the task back up the next morning, I inspected a few of the scrubbed documents and saw some strange character strings—things like ir1 instead of in and ’ where an apostrophe should be. That got me wondering if the problem lay in the encoding of those .txt files. Unfortunately, neither the files themselves nor the site that produced them tell me which encoding they use. I ran through a bunch of options, but none of them fixed the problem.

“Okay, no worries,” I thought. “I’ll use gsub() to replace those funky bits in the strings by hand.” The commands ran without a hiccup, but the text didn’t change. Stranger, when I tried to inspect documents in the R terminal, the same command wouldn’t always produce the same result. Sometimes I’d get the head, and sometimes the tail. I tried moving back a step in the process and installed a PDF converter that I could run from R, but R couldn’t find the converter, and my attempts to fix that failed.

At this point, I was about ready to quit, and I tweeted some of that frustration. Igor Brigadir quickly replied to suggest a solution, but it involved another programming language, Python, that I don’t know:

To go that route, I would need to start learning Python. That’s probably a good idea for the long run, but it wasn’t going to happen this week. Then Ken Benoit pointed me toward a new R package he’s developing and even offered to help me :

That sounded promising, so I opened R again and followed the clear instructions on the README at Ken’s repository to install the package. Of course the installation failed, probably because I’m still using R Version 3.1.1 and the package is, I suspect, written for the latest release, 3.1.2.

And that’s where I finally quit—for now. I’d hit a wall, and all my usual strategies for working through or around it had either failed or led to solutions that would require a lot more work. If I were getting paid and on deadline, I’d keep hacking away, but this was supposed to be a “fun” project for my own edification. What seemed at first like a tidy exercise had turned into a tar baby, and I needed to move on.

This cycle of frustration –> problem-solving –> frustration might seem like a distraction from the real business of social science, but in my experience, it is the real business. Unless I’m performing a variation on a familiar task with familiar data, this is normal. It might be boring to read, but then most of the day-to-day work of social science probably is, or at least looks that way to the people who aren’t doing it and therefore can’t see how all those little steps fit into the bigger picture.

So that’s my tale of minor woe. Now, if anyone who actually knows how to do text-mining in R is inspired to help me figure out what I’m doing wrong on that National Security Strategy project, please take a look at that GitHub repo and the script posted there and let me know what you see.

Demography and Democracy Revisited

Last spring on this blog, I used Richard Cincotta’s work on age structure to take another look at the relationship between democracy and “development” (here). In his predictive models of democratization, Rich uses variation in median age as a proxy for a syndrome of socioeconomic changes we sometimes call “modernization” and argues that “a country’s chances for meaningful democracy increase as its population ages.” Rich’s models have produced some unconventional predictions that have turned out well, and if you buy the scientific method, this apparent predictive power implies that the underlying theory holds some water.

Over the weekend, Rich sent me a spreadsheet with his annual estimates of median age for all countries from 1972 to 2015, so I decided to take my own look at the relationship between those estimates and the occurrence of democratic transitions. For the latter, I used a data set I constructed for PITF (here) that covers 1955–2010, giving me a period of observation running from 1972 to 2010. In this initial exploration, I focused specifically on switches from authoritarian rule to democracy, which are observed with a binary variable that covers all country-years where an autocracy was in place on January 1. That variable (rgjtdem) is coded 1 if a democratic regime came into being at some point during that calendar year and 0 otherwise. Between 1972 and 2010, 94 of those switches occurred worldwide. The data set also includes, among other things, a “clock” counting consecutive years of authoritarian rule and an indicator for whether or not the country has ever had a democratic regime before.

To assess the predictive power of median age and compare it to other measures of socioeconomic development, I used the base and caret packages in R to run 10 iterations of five-fold cross-validation on the following series of discrete-time hazard (logistic regression) models:

  • Base model. Any prior democracy (0/1), duration of autocracy (logged), and the product of the two.
  • GDP per capita. Base model plus the Maddison Project’s estimates of GDP per capita in 1990 Geary-Khamis dollars (here), logged.
  • Infant mortality. Base model plus the U.S. Census Bureau’s estimates of deaths under age 1 per 1,000 live births (here), logged.
  • Median age. Base model plus Cincotta’s estimates of median age, untransformed.

The chart below shows density plots and averages of the AUC scores (computed with ‘roc.area’ from the verification package) for each of those models across the 10 iterations of five-fold CV. Contrary to the conventional assumption that GDP per capita is a useful predictor of democratic transitions—How many papers have you read that tossed this measure into the model as a matter of course?—I find that the model with the Maddison Project measure actually makes slightly less accurate predictions than the one with duration and prior democracy alone. More relevant to this post, though, the two demographic measures clearly improve the predictions of democratic transitions relative to the base model, and median age adds a smidgen more predictive signal than infant mortality.


Of course, all of these things—national wealth, infant mortality rates, and age structures—have also been changing pretty steadily in a single direction for decades, so it’s hard to untangle the effects of the covariates from other features of the world system that are also trending over time. To try to address that issue and to check for nonlinearity in the relationship, I used Simon Wood’s mgcv package in R to estimate a semiparametric logistic regression model with smoothing splines for year and median age alongside the indicator of prior democracy and regime duration. Plots of the marginal effects of year and median age estimated from that model are shown below. As the left-hand plot shows, the time effect is really a hump in risk that started in the late 1980s and peaked sharply in the early 1990s; it is not the across-the-board post–Cold War increase that we often see covered in models with a dummy variable for years after 1991. More germane to this post, though, we still see a marginal effect from median age, even when accounting for those generic effects of time. Consistent with Cincotta’s argument and other things being equal, countries with higher median age are more likely to transition to democracy than countries with younger populations.


I read these results as a partial affirmation of modernization theory—not the whole teleological and normative package, but the narrower empirical conjecture about a bundle of socioeconomic transformations that often co-occur and are associated with a higher likelihood of attempting and sustaining democratic government. Statistical studies of this idea (including my own) have produced varied results, but the analysis I’m describing here suggests that some of the null results may stem from the authors’ choice of measures. GDP per capita is actually a poor proxy for modernization; there are a number of ways countries can get richer, and not all of them foster (or are fostered by) the socioeconomic transformations that form the kernel of modernization theory (cf. Equatorial Guinea). By contrast, demographic measures like infant mortality rates and median age are more tightly coupled to those broader changes about which Seymour Martin Lipset originally wrote. And, according to my analysis, those demographic measures are also associated with a country’s propensity for democratic transition.

Shifting to the applied forecasting side, I think these results confirm that median age is a useful addition to models of regime transitions, and it seems capture more information about those propensities than GDP (by a lot) and infant mortality (by a little). Like all slow-changing structural indicators, though, median age is a blunt instrument. Annual forecasts based on it alone would be pretty clunky, and longer-term forecasts would do well to consider other domestic and international forces that also shape (and are shaped by) these changes.

PS. If you aren’t already familiar with modernization theory and want more background, this ungated piece by Sheri Berman for Foreign Affairs is pretty good: “What to Read on Modernization Theory.”

PPS. The code I used for this analysis is now on GitHub, here. It includes a link to the folder on my Google Drive with all of the required data sets.

A Postscript on Measuring Change Over Time in Freedom in the World

After publishing yesterday’s post on Freedom House’s latest Freedom in the World report (here), I thought some more about better ways to measure what I think Freedom House implies it’s measuring with its annual counts of country-level gains and declines. The problem with those counts is that they don’t account for the magnitude of the changes they represent. That’s like keeping track of how a poker player is doing by counting bets won and bets lost without regard to their value. If we want to assess the current state of the system and compare it earlier states, the size of those gains and declines matters, too.

With that in mind, my first idea was to sum the raw annual changes in countries’ “freedom” scores by year, where the freedom score is just the sum of those 7-point political rights and civil liberties indices. Let’s imagine a year in which three countries saw a 1-point decline in their freedom scores; one country saw a 1-point gain; and one country saw a 3-point gain. Using Freedom House’s measure, that would look like a bad year, with declines outnumbering gains 3 to 2. Using the sum of the raw changes, however, it would look like a good year, with a net change in freedom scores of +1.

Okay, so here’s a plot of those sums of raw annual changes in freedom scores since 1982, when Freedom House rejiggered the timing of its survey.[1] I’ve marked the nine-year period that Freedom House calls out in its report as an unbroken run of bad news, with declines outnumbering gains every year since 2006. As the plot shows, when we account for the magnitude of those gains and losses, things don’t look so grim. In most of those nine years, losses did outweigh gains, but the net loss was rarely large, and two of the nine years actually saw net gains by this measure.

Annual global sums of raw yearly changes in Freedom House freedom scores (inverted), 1983-2014

Annual global sums of raw yearly changes in Freedom House freedom scores (inverted), 1983-2014

After I’d generated that plot, though, I worried that the sum of those raw annual changes still ignored another important dimension: population size. As I understand it, the big question Freedom House is trying to address with its annual report is: “How free is the world?” If we want to answer that question from a classical liberal perspective—and that’s where I think Freedom House is coming from—then individual people, not states, need to be our unit of observation.

Imagine a world with five countries where half the global population lives in one country and the other half is evenly divided between the other four. Now let’s imagine that the one really big country is maximally unfree while the other four countries are maximally free. If we compare scores (or changes in them) by country, things look great; 80 percent of the world is super-free! Meanwhile, though, half the world’s population lives under total dictatorship. An international relations theorist might care more about the distribution of states, but a liberal should care more about the distribution of people.

To take a look at things from this perspective, I decided to generate a scalar measure of freedom in the world system that sums country scores weighted by their share of the global population.[2] To make the result easier to interpret, I started by rescaling the country-level “freedom scores” from 14-2 to 0-10, with 10 indicating most free. A world in which all countries are fully free (according to Freedom House) would score a perfect 10 on this scale, and changes in large countries will move the index more than changes in small ones.

Okay, so here’s a plot of the results for the entire run of Freedom House’s data set, 1972–2014. (Again, 1981 is missing because that’s when Freedom House paused to align their reports with the calendar year.)  Things look pretty different than they do when we count gains and declines or even sum raw changes by country, don’t they?

A population-weighted annual scalar measure of freedom in the world, 1972-2014

A population-weighted annual scalar measure of freedom in the world, 1972-2014

The first thing that jumped out at me were those sharp declines in the mid-1970s and again in the late 1980s and early 1990s. At first I thought I must have messed up the math, because everyone knows things got a lot better when Communism crumbled in Eastern Europe and the Soviet Union, right? It turns out, though, that those swings are driven by changes in China and India, which together account for approximately one-third of the global population. In 1989, after Tienanmen Square, China’s score dropped from a 6/6 (or 1.67 on my 10-point scalar version) to 7/7 (or 0). At the time, China contained nearly one-quarter of the world’s population, so that slump more than offsets the (often-modest) gains made in the countries touched by the so-called fourth wave of democratic transitions. In 1998, China inched back up to 7/6 (0.83), and the global measure moved with it. Meanwhile, India dropped from 2/3 (7.5) to 3/4 (5.8) in 1991, and then again from 3/4 to 4/4 (5.0) in 1993, but it bumped back up to 2/4 (6.67) in 1996 and then 2/3 (7.5) in 1998. The global gains and losses produced by the shifts in those two countries don’t fully align with the conventional narrative about trends in democratization in the past few decades, but I think they do provide a more accurate measure of overall freedom in the world if we care about people instead of states, as liberalism encourages us to do.

Of course, the other thing that caught my eye in that second chart was the more-or-less flat line for the past decade. When we consider the distribution of the world’s population across all those countries where Freedom House tallies gains and declines, it’s hard to find evidence of the extended democratic recession they and others describe. In fact, the only notable downturn in that whole run comes in 2014, when the global score dropped from 5.2 to 5.1. To my mind, that recent downturn marks a worrying development, but it’s harder to notice it when we’ve been hearing cries of “Wolf!” for the eight years before.


[1] For the #Rstats crowd: I used the slide function in the package DataCombine to get one-year lags of those indices by country; then I created a new variable representing the difference between the annual score for the current and previous year; then I used ddply from the plyr package to create a data frame with the annual global sums of those differences. Script on GitHub here.

[2] Here, I used the WDI package to get country-year data on population size; used ddply to calculate world population by year; merged those global sums back into the country-year data; used those sums as the denominator in a new variable indicating a country’s share of the global population; and then used ddply again to get a table with the sum of the products of those population weights and the freedom scores. Again, script on GitHub here (same one as before).

No, Democracy Has Not Been Discarded

Freedom House released the latest iteration of its annual Freedom in the World report yesterday (PDF) with the not-so-subtle subtitle “Discarding Democracy: Return to the Iron Fist.” The report starts like this (emphasis in original):

In a year marked by an explosion of terrorist violence, autocrats’ use of more brutal tactics, and Russia’s invasion and annexation of a neighboring country’s territory, the state of freedom in 2014 worsened significantly in nearly every part of the world.

For the ninth consecutive year, Freedom in the World, Freedom House’s annual report on the condition of global political rights and civil liberties, showed an overall decline. Indeed, acceptance of democracy as the world’s dominant form of government—and of an international system built on democratic ideals—is under greater threat than at any point in the last 25 years.

Even after such a long period of mounting pressure on democracy, developments in 2014 were exceptionally grim. The report’s findings show that nearly twice as many countries suffered declines as registered gains, 61 to 33, with the number of gains hitting its lowest point since the nine-year erosion began.

Once again, I’m going to respond to the report’s release by arguing that things aren’t nearly as bad as Freedom House describes them, even according to their own data. As I see it, the nine-year trend of “mounting pressure on democracy” isn’t really much of a thing, and it certainly hasn’t brought the return of the “iron fist.”

Freedom House measures freedom on two dimensions: political rights and civil liberties. Both are measured on a seven-point scale, with lower numbers indicating more freedom. (You can read more about the methodology here.) I like to use heat maps to visualize change over time in the global distribution of countries on these scales. In these heat maps, the quantities being visualized are the proportion of countries worldwide landing in each cell of the 7 x 7 grid defined by juxtaposing those two scales. The darker the color, the higher the proportion.

Let’s start with one for 1972, the first year Freedom House observes and just before the start of the so-called third wave of democratic transitions, to establish a baseline. At this point, the landscape has two distinct peaks, in the most– and least-free corners, and authoritarian regimes actually outnumber democracies.


Okay, now let’s jump ahead to 1986, the eve of what some observers describe as the fourth wave of democratic transitions that swept the Warsaw Pact countries and sub-Saharan Africa. At this point, the world doesn’t look much different. The “least free” peak (lower left) has spread a bit, but there are still an awful lot of countries on that side of the midline, and the “most free” peak (upper right) is basically unchanged.


A decade later, though, things look pretty different. By 1995, the “most free” peak has gotten taller and broader, the “least free” peak has eroded further, and there’s now a third peak of sorts in the middle, centered on 4/4.


Jump ahead another 10 years, to 2005, and the landscape has tilted decisively toward democracy. The “least free” peak is no more, the bump in the middle has shifted up and right, and the “most free” peak now dominates the landscape.


That last image comes not long before the run of nine years of consecutive declines in freedom described in the 2015 report. From the report’s subtitle and narrative, you might think the landscape had clearly shifted—maybe not all the way back to the one we saw in the 1970s or even the 1980s, but perhaps to something more like the mid-1990s, when we still had a clear authoritarian peak and a lot of countries seemed to be sliding between the two poles. Well, here’s the actual image:


That landscape looks a lot more like 2005 than 1995, and it looks nothing like the 1970s or 1980s. The liberal democratic peak still dominates, and the authoritarian peak is still gone. The clumps at 3/3 and 6/5 seem to have slumped a little in the wrong direction, but there are not significant new accretions in any spot. In a better world, we would have seen continued migration toward the “most free” corner over the past decade, but the absence of further improvement is hardly the kind of rollback that phrases like “discarding democracy” and “return to the iron fist” seem to imply.

Freedom House’s topline message is also belied by the trend over time in its count of electoral democracies—that is, countries that hold mostly free and fair elections for the offices that actually make policy. By Freedom House’s own count, the number of electoral democracies around the world actually increased by three in 2014 to an all-time high of 125, or more than two-thirds of all countries. Here’s the online version of their chart of that trend (from this page):


Again, I’m having a hard time seeing democracy being “discarded” in that plot.

So how can both of these things be true? How can the number of electoral democracies grow over a period when annual declines in freedom scores outnumber annual gains?

The answer is that those declines are often occurring in countries that are already governed by authoritarian regimes, and they are often small in size. Meanwhile, some countries are still making jumps from autocracy to democracy that are usually larger in scale than the incremental declines and thus mostly offset the losses in the global tally. So, while those declines are surely bad for the citizens suffering through them, they rarely move countries from one side of the ledger to the other, and they have only a modest effect on the overall level of “freedom” in the system.

This year’s update on the Middle East shows what I mean. In its report, Freedom House identifies only one country in that region that made significant gains in freedom in 2014—Tunisia—against seven that saw declines: Bahrain, Egypt, Iraq, Lebanon, Libya, Syria, and Yemen. All seven of those decliners were already on the authoritarian side of the ledger going into 2014, however, and only four of the declines were large enough to move a country’s rating on one or both of the relevant indices. Meanwhile, Tunisia jumped up two points on political rights in one year, and since 2011 its combined score (political rights + civil liberties) has gone up eight points, from 12 to 4. We see similar patterns in the declines in Eurasia, where nearly all countries already clustered around the “least free” pole, and sub-Saharan Africa, where only one country moved down into Freedom House’s Not Free category (Uganda) and two returned to the set of electoral democracies after holding reasonably fair and competitive elections (Guinea-Bissau and Madgascar).

In short, I continue to believe that Freedom House’s presentation of trends over time in political rights and civil liberties is much gloomier than the world its own data portray. Part of me feels like a jerk for saying so, because I recognize that Freedom House’s messaging is meant to be advocacy, not science, and I support the goals that advocacy is meant to achieve. As a social scientist, though, I also think it’s important that our analyses and decisions be informed by as accurate a sketch of the world as we can draw, so I will keep mumbling into this particular gale.

PS. If you want Freedom House’s data in .csv format, I’ve posted a version of them—including the 2014 updates, which I entered by hand this morning—on my Google Drive, here.

PPS. If you’re curious where I think these trends might be headed in the next 10 years, see this recent post.

PPPS. The day after I ran this post, I published another in which I tried to think of better ways to measure what Freedom House purports to describe in its annual reports. You can read it here.

Why My Coup Risk Models Don’t Include Any Measures of National Militaries

For the past several years (herehere, here, and here), I’ve used statistical models estimated from country-year data to produce assessments of coup risk in countries worldwide. I rejigger the models a bit each time, but none of the models I’ve used so far has included specific features of countries’ militaries.

That omission strikes a lot of people as a curious one. When I shared this year’s assessments with the Conflict Research Group on Facebook, one group member posted this comment:

Why do none of the covariates feature any data on militaries? Seeing as militaries are the ones who stage the coups, any sort of predictive model that doesn’t account for the militaries themselves would seem incomplete.

I agree in principle. It’s the practical side that gets in the way. I don’t include features of national militaries in the models because I don’t have reliable measures of them with the coverage I need for this task.

To train and then apply these predictive models, I need fairly complete time series for all or nearly all countries of the world that extend back to at least the 1980s and have been updated recently enough to give me a useful input for the current assessment (see here for more on why that’s true). I looked again early this month and still can’t find anything like that on even the big stuff, like military budgets, size, and force structures. There are some series on this topic in the World Bank’s World Development Indicators (WDI) data set, but those series have a lot of gaps, and the presence of those gaps is correlated with other features of the models (e.g., regime type). Ditto for SIPRI. And, of course, those aren’t even the most interesting features for coup risk, like whether or not military promotions favor certain groups over others, or if there is a capable and purportedly loyal presidential guard.

But don’t take my word for it. Here’s what the Correlates of War Project says in the documentation for Version 4.0 of its widely-used data set (PDF) about its measure of military expenditures, one of two features of national militaries it tries to cover (the other is total personnel):

It was often difficult to identify and exclude civil expenditures from reported budgets of less developed nations. For many countries, including some major powers, published military budgets are a catch-all category for a variety of developmental and administrative expenses—public works, colonial administration, development of the merchant marine, construction, and improvement of harbor and navigational facilities, transportation of civilian personnel, and the delivery of mail—of dubious military relevance. Except when we were able to obtain finance ministry reports, it is impossible to make detailed breakdowns. Even when such reports were available, it proved difficult to delineate “purely” military outlays. For example, consider the case in which the military builds a road that facilitates troops movements, but which is used primarily by civilians. A related problem concerns those instances in which the reported military budget does not reflect all of the resources devoted to that sector. This usually happens when a nation tries to hide such expenditures from scrutiny; for instance, most Western scholars and military experts agree that officially reported post-1945 Soviet-bloc totals are unrealistically low, although they disagree on the appropriate adjustments.

And that’s just the part of the “Problems and Possible Errors” section about observing the numerator in a calculation that also requires a complicated denominator. And that’s for what is—in principle, at least—one of the most observable features of a country’s civil-military relations.

Okay, now let’s assume that problem magically disappears, and COW’s has nearly-complete and reliable data on military expenditures. Now we want to use models trained on those data to estimate coup risk for 2015. Whoops: COW only runs through 2010! The World Bank and SIPRI get closer to the current year—observations through 2013 are available now—but there are missing values for lots of countries, and that missingness is caused by other predictors of coup risk, such as national wealth, armed conflict, and political regime type. For example, WDI has no data on military expenditures for Eritrea and North Korea ever, and the series for Central African Republic is patchy throughout and ends in 2010. If I wanted to include military expenditures in my predictive models, I could use multiple imputation to deal with these gaps in the training phase, but then how would I generate current forecasts for these important cases? I could make guesses, but how accurate could those guesses be for a case like Eritrea or North Korea, and then am I adding signal or noise to the resulting forecasts?

Of course, one of the luxuries of applied forecasting is that the models we use can lack important features and still “work.” I don’t need the model to be complete and its parameters to be true for the forecasts to be accurate enough to be useful. Still, I’ll admit that, as a social scientist by training, I find it frustrating to have to set aside so many intriguing ideas because we simply don’t have the data to try them.

Estimating NFL Team-Specific Home-Field Advantage

This morning, I tinkered a bit with my pro-football preseason team strength survey data from 2013 and 2014 to see what other simple things I might do to improve the accuracy of forecasts derived from future versions of them.

My first idea is to go beyond a generic estimate of home-field advantage—about 3 points, according to my and everyone else’s estimates—with team-specific versions of that quantity. The intuition is that some venues confer a bigger advantage than others. For example, I would guess that Denver enjoys a bigger home-field edge than most teams because their stadium is at high altitude. The Broncos live there, so they’re used to it, but visiting teams have to adapt, and that process supposedly takes about a day for every 1,000 feet over 3,000. Some venues are louder than others, and that noise is often dialed up when visiting teams would prefer some quiet. And so on.

To explore this idea, I’m using a simple hierarchical linear model to estimate team-specific intercepts after taking preseason estimates of relative team strength into account. The line of R code used to estimate the model requires the lme4 package and looks like this:

mod1 <- lmer(score.raw ~ wiki.diff + (1 | home_team), results)


score.raw = home_score - visitor_score
wiki.diff = home_wiki - visitor_wiki

Those wiki vectors are the team strength scores estimated from preseason pairwise wiki surveys. The ‘results’ data frame includes scores for all regular and postseason games from those two years so far, courtesy of devstopfix’s NFL results repository on GitHub (here). Because the net game and strength scores are both ordered home to visitor, we can read those random intercepts for each home team as estimates of team-specific home advantage. There are probably other sources of team-specific bias in my data, so those estimates are going to be pretty noisy, because I think it’s a reasonable starting point.

My initial results are shown in the plot below, which I get with these two lines of code, the second of which requires the lattice package:

ha1 <- ranef(mod1, condVar=TRUE)

Bear in mind that the generic (fixed) intercept is 2.7, so the estimated home-field advantage for each team is what’s shown in the plot plus that number. For example, these estimates imply that my Ravens enjoy a net advantage of about 3 points when they play in Baltimore, while their division-rival Bengals are closer to 6.


In light of DeflateGate, I guess I shouldn’t be surprised to see the Pats at the top of the chart, almost a whole point higher than the second-highest team. Maybe their insanely home low fumble rate has something to do with it.* I’m also heartened to see relatively high estimates for Denver, given the intuition that started this exercise, and Seattle, which I’ve heard said enjoys an unusually large home-field edge. At the same time, I honestly don’t know what to make of the exceptionally low estimates for DC and Jacksonville, who appear from these estimates to suffer a net home-field disadvantage. That strikes me as odd and undercuts my confidence in the results.

In any case, that’s how far my tinkering took me today. If I get really bored motivated, I might try re-estimating the model with just the 2013 data and then running the 2014 preseason survey scores through that model to generate “forecasts” that I can compare to the ones I got from the simple linear model with just the generic intercept (here). The point of the exercise was to try to get more accurate forecasts from simple models, and the only way to test that is to do it. I’m also trying to decide if I need to cross these team-specific effects with season-specific effects to try to control for differences across years in the biases in the wiki survey results when estimating these team-specific intercepts. But I’m not there yet.

* After I published this post, Michael Lopez helpfully pointed me toward a better take on the Patriots’ fumble rate (here), and Mo Patel observed that teams manage their own footballs on the road, too, so that particular tweak—if it really happened—wouldn’t have a home-field-specific effect.

The State of the Art in the Production of Political Event Data

Peter Nardulli, Scott Althaus, and Matthew Hayes have a piece forthcoming in Sociological Methodology (PDF) that describes what I now see as the cutting edge in the production of political event data: machine-human hybrid systems.

If you have ever participated in the production of political event data, you know that having people find, read, and code data from news stories and other texts takes a tremendous amount of work. Even boutique data sets on narrowly defined topics for short time periods in single cases usually require hundreds or thousands of person-hours to create, and the results still aren’t as pristine as we’d like or often believe.

Contrary to my premature congratulation on GDELT a couple of years ago, however, fully automated systems are not quite ready to take over the task, either. Once a machine-coding system has been built, the data come fast and cheap, but those data are, inevitably, still pretty noisy. (On that point, see here for some of my own experiences with GDELT and here, here, here, here, and here for other relevant discussions.)

I’m now convinced that the best current solution is one that borrows strength from both approaches—in other words, a hybrid. As Nardulli, Althaus, and Hayes argue in their forthcoming article, “Machine coding is no simple substitute for human coding.”

Until fully automated approaches can match the flexibility and contextual richness of human coding, the best option for generating near-term advances in social science research lies in hybrid systems that rely on both machines and humans for extracting information from unstructured texts.

As you might expect, Nardulli & co. have built and are operating such a system—the Social, Political, and Economic Event Database (SPEED)—to code data on a bunch of interesting things, including coups and civil unrest. Their hybrid process goes beyond supervised learning, where an algorithm gets trained on a data set carefully constructed by human coders and then put in the traces to make new data from fresh material. Instead, adopt a “progressive supervised-learning system,” which basically means two things:

  1. They keep humans in the loop for all steps where the error rate from their machine-only process remains intolerably high, making the results as reliable as possible; and
  2. They use those humans’ coding decisions as new training sets to continually check and refine their algorithms, gradually shrinking the load borne by the humans and mitigating the substantial risk of concept drift that attaches to any attempt to automate the extraction of data from a constantly evolving news-media ecosystem.

I think SPEED exemplifies the state of the art in a couple of big ways. The first is the process itself. Machine-learning processes have made tremendous gains in the past several years (see here, h/t Steve Mills), but we still haven’t arrived at the point where we can write algorithms that reliably recognize and extract the information we want from the torrent of news stories coursing through the Internet. As long as that’s the case—and I expect it will be for at least another several years—we’re going to need to keep humans in the loop to get data sets we really trust and understand. (And, of course, even then the results will still suffer from biases that even a perfect coding process can’t avoid; see here for Will Moore’s thoughtful discussion of that point.)

The second way in which SPEED exemplifies the state of the art is what Nardulli, Althaus, and Hayes’ paper explicitly and implicitly tells us about the cost and data-sharing constraints that come with building and running a system of this kind on that scale. Nardulli & co. don’t report exactly how much money has been spent on SPEED so far and how much it costs to keep running it, but they do say this:

The Cline Center began assembling its news archive and developing SPEED’s workflow system in 2006, but lacked an operational cyberinfrastructure until 2009. Seven years and well over a million dollars later, the Cline Center released its first SPEED data set.

Partly because of those high costs and partly because of legal issues attached to data coded from many news stories, the data SPEED produces are not freely available to the public. The project shares some historical data sets on its web site, but the content of those sets is limited, and the near-real-time data coveted by applied researchers like me are not made public. Here’s how the authors describe their situation:

While agreements with commercial vendors and intellectual property rights prohibit the Center from distributing its news archive, efforts are being made to provide non-consumptive public access to the Center’s holdings. This access will allow researchers to evaluate the utility of the Center’s digital archive for their needs and construct a research design to realize those needs. Based on that design, researchers can utilize the Center’s various subcenters of expertise (document classification, training, coding, etc.) to implement it.

I’m not happy about those constraints, but as someone who has managed large and costly social-science research projects, I certainly understand them. I also don’t expect them to go away any time soon, for SPEED or for any similar undertaking.

So that’s the state of the art in the production of political event data: Thanks to the growth of the Internet and advances in computing hardware and software, we can now produce political event data on a scale and at a pace that would have had us drooling a decade ago, but the task still can’t be fully automated without making sacrifices in data quality that most social scientists should be uncomfortable making. The best systems we can build right now blend machine learning and automation with routine human involvement and oversight. Those systems are still expensive to build and run, and partly because of that, we should not expect their output to stream onto our virtual desktops for free, like manna raining down from digital heaven.


Get every new post delivered to your Inbox.

Join 10,739 other followers

%d bloggers like this: