If At First You Don’t Succeed

A couple of weeks ago, I blogged about a failed attempt to do some exploratory text-mining on the US National Security Strategy reports (here). That project was supposed to give me a fun way to learn the basics of text mining in R, something I’ve been eager to do of late. In writing the blog post, I had two motives: 1) to help normalize the experience of getting stuck and failing in social science and data science, and 2) to appeal for help from more experienced coders who could help get me unstuck on this particular task.

The post succeeded on both counts. I won’t pepper you with evidence on the commiseration front, but I am excited to share the results of the coding improvements. In addition to learning how to text-mine, I have also been trying to learn how to use RStudio and Shiny to build interactive apps, and this project seemed like a good one to do both. So, I’ve created an app that lets users explore this corpus in three ways:

  • Plot word counts over time to see how the use of certain terms has waxed and waned over the 28 years the reports span.
  • Generate word clouds showing the 50 most common words in each of the 16 reports.
  • Explore associations between terms by picking one and see which 10 others are most closely correlated with it in the entire corpus.

For example, here’s a plot of change over time in the relative frequency of the term ‘terror’. Its usage spikes after 9/11 and then falls sharply when Barack Obama replaces George W. Bush as president.

NSS terror time trend

That pattern contrasts sharply with references to climate, which rarely gets mentioned until the Obama presidency, when its usage spikes upward. (Note, though, that the y-axis has been rescaled from the previous chart, so this large increase still has ‘climat’ only appearing about half as often as ‘terror’.)

NSS climat time trend

And here’s a word cloud of the 50 most common terms from the first US National Security Strategy, published in 1987. Surprise! The Soviet Union dominates the monologue.

NSS 1987 word cloud

When I built an initial version of the app a couple of Sundays ago, I promptly launched it on shinyapps.io to try to show it off. Unfortunately, the Shiny server only gives you 25 hours of free usage per billing cycle, and when I tweeted about the app, it got so much attention that those hours disappeared in a little over a day!

I don’t have my own server to host this thing, and I’m not sure when Shiny’s billing cycle refreshes. So, for the moment, I can’t link to a permanently working version of the app. If anyone reading this post is interested in hosting the app on a semi-permanent basis, please drop me a line at ulfelder <at> gmail. Meanwhile, R users can launch the app from their terminals with these two lines of code, assuming the ‘shiny’ package is already installed:

library(shiny)
runGitHub("national-security-strategy", "ulfelder")

You can also find all of the texts and code used in the app and some other stuff (e.g., the nss.explore.R script also implements topic modeling) in that GitHub repository, here.

A Tale of Normal Failure

When I blog about my own research, I usually describe work I’ve already completed and focus on the results. This post is about a recent effort that ended in frustration, and it focuses on the process. In writing about this aborted project, I have two hopes: 1) to reassure other researchers (and myself) that this kind of failure is normal, and 2) if I’m lucky, to get some help with this task.

This particular ball got rolling a couple of days ago when I read a blog post by Dan Drezner about one aspect of the Obama administration’s new National Security Strategy (NSS) report. A few words in the bits Dan quoted got me thinking about the worldview they represented, and how we might use natural-language processing (NLP) to study that:

At first, I was just going to drop that awkwardly numbered tweetstorm and leave it there. I had some time that afternoon, though, and I’ve been looking for opportunities to learn text mining, so I decided to see what I could do. The NSS reports only became a thing in 1987, so there are still just 16 of them, and they all try to answer the same basic questions: What threats and opportunities does the US face in the world, and what should the government do to meet them? As such, they seemed like the kind of manageable and coherent corpus that would make for a nice training exercise.

I started by checking to see if anyone had already done with earlier reports what I was hoping to do with the latest one. It turned out that someone had, and to good effect:

I promptly emailed the corresponding author to ask if they had replication materials, or even just clean versions of the texts for all previous years. I got an autoreply informing me that the author was on sabbatical and would only intermittently be reading his email. (He replied the next day to say that he would put the question to his co-authors, but that still didn’t solve my problem, and by then I’d moved on anyway.)

Without those materials, I would need to start by getting the documents in the proper format. A little Googling led me to the National Security Strategy Archive, which at the time had PDFs of all but the newest report, and that one was easy enough to find on the White House’s web site. Another search led me to a site that converts PDFs to plain text online for free. I spent the next hour or so running those reports through the converter (and playing a little Crossy Road on my phone while I waited for the jobs to finish). Once I had the reports as .txt files, I figured I could organize my work better and do other researchers a solid by putting them all in a public repository, so I set one up on GitHub (here) and cloned it to my hard drive.

At that point, I was getting excited, thinking: “Hey, this isn’t so hard after all.” In most of the work I do, getting the data is the toughest part, and I already had all the documents I wanted in the format I needed. I was just a few lines of code away from the statistics and plots and that would confirm or infirm my conjectures.

From another recent collaboration, I knew that the next step would be to use some software to ingest those .txt files, scrub them a few different ways, and then generate some word counts and maybe do some topic modeling to explore changes over time in the reports’ contents. I’d heard several people say that Python is really good at these tasks, but I’m an R guy, so I followed the lead on the CRAN Task View for natural language processing and installed and loaded the ‘tm’ package for text mining.

And that’s where the wheels started to come off of my rickety little wagon. Using the package developers’ vignette and an article they published in the Journal of Statistical Software, I started tinkering with some code. After a couple of false starts, I found that I could create a corpus and run some common preprocessing tasks on it without too much trouble, but I couldn’t get the analytical functions to run on the results. Instead, I kept getting this error message:

Error: inherits(doc, "TextDocument") is not TRUE

By then it was dinner time, so I called it a day and went to listen to my sons holler at each other across the table for a while.

When I picked the task back up the next morning, I inspected a few of the scrubbed documents and saw some strange character strings—things like ir1 instead of in and ’ where an apostrophe should be. That got me wondering if the problem lay in the encoding of those .txt files. Unfortunately, neither the files themselves nor the site that produced them tell me which encoding they use. I ran through a bunch of options, but none of them fixed the problem.

“Okay, no worries,” I thought. “I’ll use gsub() to replace those funky bits in the strings by hand.” The commands ran without a hiccup, but the text didn’t change. Stranger, when I tried to inspect documents in the R terminal, the same command wouldn’t always produce the same result. Sometimes I’d get the head, and sometimes the tail. I tried moving back a step in the process and installed a PDF converter that I could run from R, but R couldn’t find the converter, and my attempts to fix that failed.

At this point, I was about ready to quit, and I tweeted some of that frustration. Igor Brigadir quickly replied to suggest a solution, but it involved another programming language, Python, that I don’t know:

To go that route, I would need to start learning Python. That’s probably a good idea for the long run, but it wasn’t going to happen this week. Then Ken Benoit pointed me toward a new R package he’s developing and even offered to help me :

That sounded promising, so I opened R again and followed the clear instructions on the README at Ken’s repository to install the package. Of course the installation failed, probably because I’m still using R Version 3.1.1 and the package is, I suspect, written for the latest release, 3.1.2.

And that’s where I finally quit—for now. I’d hit a wall, and all my usual strategies for working through or around it had either failed or led to solutions that would require a lot more work. If I were getting paid and on deadline, I’d keep hacking away, but this was supposed to be a “fun” project for my own edification. What seemed at first like a tidy exercise had turned into a tar baby, and I needed to move on.

This cycle of frustration –> problem-solving –> frustration might seem like a distraction from the real business of social science, but in my experience, it is the real business. Unless I’m performing a variation on a familiar task with familiar data, this is normal. It might be boring to read, but then most of the day-to-day work of social science probably is, or at least looks that way to the people who aren’t doing it and therefore can’t see how all those little steps fit into the bigger picture.

So that’s my tale of minor woe. Now, if anyone who actually knows how to do text-mining in R is inspired to help me figure out what I’m doing wrong on that National Security Strategy project, please take a look at that GitHub repo and the script posted there and let me know what you see.

Why My Coup Risk Models Don’t Include Any Measures of National Militaries

For the past several years (herehere, here, and here), I’ve used statistical models estimated from country-year data to produce assessments of coup risk in countries worldwide. I rejigger the models a bit each time, but none of the models I’ve used so far has included specific features of countries’ militaries.

That omission strikes a lot of people as a curious one. When I shared this year’s assessments with the Conflict Research Group on Facebook, one group member posted this comment:

Why do none of the covariates feature any data on militaries? Seeing as militaries are the ones who stage the coups, any sort of predictive model that doesn’t account for the militaries themselves would seem incomplete.

I agree in principle. It’s the practical side that gets in the way. I don’t include features of national militaries in the models because I don’t have reliable measures of them with the coverage I need for this task.

To train and then apply these predictive models, I need fairly complete time series for all or nearly all countries of the world that extend back to at least the 1980s and have been updated recently enough to give me a useful input for the current assessment (see here for more on why that’s true). I looked again early this month and still can’t find anything like that on even the big stuff, like military budgets, size, and force structures. There are some series on this topic in the World Bank’s World Development Indicators (WDI) data set, but those series have a lot of gaps, and the presence of those gaps is correlated with other features of the models (e.g., regime type). Ditto for SIPRI. And, of course, those aren’t even the most interesting features for coup risk, like whether or not military promotions favor certain groups over others, or if there is a capable and purportedly loyal presidential guard.

But don’t take my word for it. Here’s what the Correlates of War Project says in the documentation for Version 4.0 of its widely-used data set (PDF) about its measure of military expenditures, one of two features of national militaries it tries to cover (the other is total personnel):

It was often difficult to identify and exclude civil expenditures from reported budgets of less developed nations. For many countries, including some major powers, published military budgets are a catch-all category for a variety of developmental and administrative expenses—public works, colonial administration, development of the merchant marine, construction, and improvement of harbor and navigational facilities, transportation of civilian personnel, and the delivery of mail—of dubious military relevance. Except when we were able to obtain finance ministry reports, it is impossible to make detailed breakdowns. Even when such reports were available, it proved difficult to delineate “purely” military outlays. For example, consider the case in which the military builds a road that facilitates troops movements, but which is used primarily by civilians. A related problem concerns those instances in which the reported military budget does not reflect all of the resources devoted to that sector. This usually happens when a nation tries to hide such expenditures from scrutiny; for instance, most Western scholars and military experts agree that officially reported post-1945 Soviet-bloc totals are unrealistically low, although they disagree on the appropriate adjustments.

And that’s just the part of the “Problems and Possible Errors” section about observing the numerator in a calculation that also requires a complicated denominator. And that’s for what is—in principle, at least—one of the most observable features of a country’s civil-military relations.

Okay, now let’s assume that problem magically disappears, and COW’s has nearly-complete and reliable data on military expenditures. Now we want to use models trained on those data to estimate coup risk for 2015. Whoops: COW only runs through 2010! The World Bank and SIPRI get closer to the current year—observations through 2013 are available now—but there are missing values for lots of countries, and that missingness is caused by other predictors of coup risk, such as national wealth, armed conflict, and political regime type. For example, WDI has no data on military expenditures for Eritrea and North Korea ever, and the series for Central African Republic is patchy throughout and ends in 2010. If I wanted to include military expenditures in my predictive models, I could use multiple imputation to deal with these gaps in the training phase, but then how would I generate current forecasts for these important cases? I could make guesses, but how accurate could those guesses be for a case like Eritrea or North Korea, and then am I adding signal or noise to the resulting forecasts?

Of course, one of the luxuries of applied forecasting is that the models we use can lack important features and still “work.” I don’t need the model to be complete and its parameters to be true for the forecasts to be accurate enough to be useful. Still, I’ll admit that, as a social scientist by training, I find it frustrating to have to set aside so many intriguing ideas because we simply don’t have the data to try them.

The State of the Art in the Production of Political Event Data

Peter Nardulli, Scott Althaus, and Matthew Hayes have a piece forthcoming in Sociological Methodology (PDF) that describes what I now see as the cutting edge in the production of political event data: machine-human hybrid systems.

If you have ever participated in the production of political event data, you know that having people find, read, and code data from news stories and other texts takes a tremendous amount of work. Even boutique data sets on narrowly defined topics for short time periods in single cases usually require hundreds or thousands of person-hours to create, and the results still aren’t as pristine as we’d like or often believe.

Contrary to my premature congratulation on GDELT a couple of years ago, however, fully automated systems are not quite ready to take over the task, either. Once a machine-coding system has been built, the data come fast and cheap, but those data are, inevitably, still pretty noisy. (On that point, see here for some of my own experiences with GDELT and here, here, here, here, and here for other relevant discussions.)

I’m now convinced that the best current solution is one that borrows strength from both approaches—in other words, a hybrid. As Nardulli, Althaus, and Hayes argue in their forthcoming article, “Machine coding is no simple substitute for human coding.”

Until fully automated approaches can match the flexibility and contextual richness of human coding, the best option for generating near-term advances in social science research lies in hybrid systems that rely on both machines and humans for extracting information from unstructured texts.

As you might expect, Nardulli & co. have built and are operating such a system—the Social, Political, and Economic Event Database (SPEED)—to code data on a bunch of interesting things, including coups and civil unrest. Their hybrid process goes beyond supervised learning, where an algorithm gets trained on a data set carefully constructed by human coders and then put in the traces to make new data from fresh material. Instead, adopt a “progressive supervised-learning system,” which basically means two things:

  1. They keep humans in the loop for all steps where the error rate from their machine-only process remains intolerably high, making the results as reliable as possible; and
  2. They use those humans’ coding decisions as new training sets to continually check and refine their algorithms, gradually shrinking the load borne by the humans and mitigating the substantial risk of concept drift that attaches to any attempt to automate the extraction of data from a constantly evolving news-media ecosystem.

I think SPEED exemplifies the state of the art in a couple of big ways. The first is the process itself. Machine-learning processes have made tremendous gains in the past several years (see here, h/t Steve Mills), but we still haven’t arrived at the point where we can write algorithms that reliably recognize and extract the information we want from the torrent of news stories coursing through the Internet. As long as that’s the case—and I expect it will be for at least another several years—we’re going to need to keep humans in the loop to get data sets we really trust and understand. (And, of course, even then the results will still suffer from biases that even a perfect coding process can’t avoid; see here for Will Moore’s thoughtful discussion of that point.)

The second way in which SPEED exemplifies the state of the art is what Nardulli, Althaus, and Hayes’ paper explicitly and implicitly tells us about the cost and data-sharing constraints that come with building and running a system of this kind on that scale. Nardulli & co. don’t report exactly how much money has been spent on SPEED so far and how much it costs to keep running it, but they do say this:

The Cline Center began assembling its news archive and developing SPEED’s workflow system in 2006, but lacked an operational cyberinfrastructure until 2009. Seven years and well over a million dollars later, the Cline Center released its first SPEED data set.

Partly because of those high costs and partly because of legal issues attached to data coded from many news stories, the data SPEED produces are not freely available to the public. The project shares some historical data sets on its web site, but the content of those sets is limited, and the near-real-time data coveted by applied researchers like me are not made public. Here’s how the authors describe their situation:

While agreements with commercial vendors and intellectual property rights prohibit the Center from distributing its news archive, efforts are being made to provide non-consumptive public access to the Center’s holdings. This access will allow researchers to evaluate the utility of the Center’s digital archive for their needs and construct a research design to realize those needs. Based on that design, researchers can utilize the Center’s various subcenters of expertise (document classification, training, coding, etc.) to implement it.

I’m not happy about those constraints, but as someone who has managed large and costly social-science research projects, I certainly understand them. I also don’t expect them to go away any time soon, for SPEED or for any similar undertaking.

So that’s the state of the art in the production of political event data: Thanks to the growth of the Internet and advances in computing hardware and software, we can now produce political event data on a scale and at a pace that would have had us drooling a decade ago, but the task still can’t be fully automated without making sacrifices in data quality that most social scientists should be uncomfortable making. The best systems we can build right now blend machine learning and automation with routine human involvement and oversight. Those systems are still expensive to build and run, and partly because of that, we should not expect their output to stream onto our virtual desktops for free, like manna raining down from digital heaven.

A Few Rules of Thumb for Data Munging in Political Science

1. However hard you think it will be to assemble a data set for a particular analysis, it will be exponentially harder, with the size of the exponent determined by the scope and scale of the required data.

  • Corollary: If the data you need would cover the world (or just poor countries), they probably don’t exist.
  • Corollary: If the data you need would extend very far back in time, they probably don’t exist.
  • Corollary: If the data you need are politically sensitive, they probably don’t exist. If they do exist, you probably can’t get them. If you can get them, you probably shouldn’t trust them.

2. However reliable you think your data are, they probably aren’t.

  • Corollary: A couple of digits after decimal point is plenty. With data this noisy, what do those thousandths really mean, anyway?

3. Just because a data transformation works doesn’t mean it’s doing what you meant it to do.

4. The only really reliable way to make sure that your analysis is replicable is to have someone previously unfamiliar with the work try to replicate it. Unfortunately, a person’s incentive to replicate someone else’s work is inversely correlated with his or her level of prior involvement in the project. Ergo, this will rarely happen until after you have posted your results.

5. If your replication materials will include random parts (e.g., sampling) and you’re using R, don’t forget to set the seed for random number generation at the start. (Alas, I am living this mistake today.)

Please use the Comments to suggest additions, corrections, or modifications.

The Ghosts of Wu Chunming’s Past, Present, and Future

On a blogged recommendation from Chris Blattman, I’m now reading Factory Girls. Written by Leslie T. Chang and published in 2008, it’s a non-fiction book about the young migrant women whose labor has stoked the furnaces of China’s economic growth over the past 30 years.

One of the book’s implicit “findings” is that this migration, and the larger socioeconomic transformation of which it is a part, is a difficult but ultimately rewarding process for many. Chang writes (p. 13, emphasis in the original):

Migration is emptying villages of young people. Across the Chinese countryside, those plowing and harvesting in the fields are elderly men and women, charged with running the farm and caring for the younger children who are still in school. Money sent home by migrants is already the biggest source of wealth accumulation in rural China. Yet earning money isn’t the only reason people migrate. In surveys, migrants rank ‘seeing the world,’ ‘developing myself,’ and ‘learning new skills’ as important as increasing their incomes. In many cases, it is not crippling poverty that drives migrants out from home, but idleness. Plots of land are small and easily farmed by parents; nearby towns offer few job opportunities. There was nothing to do at home, so I went out.

That idea fits my priors, and I think there is plenty of system-level evidence to support it. Economic development carries many individual and collective costs, but the available alternatives are generally worse.

Still, as I read, I can’t help but wonder how much the impressions I take away from the book are shaped by selection bias. Like most non-fiction books written for a wide audience, Factory Girls blends reporting on specific cases—here, the experiences of certain women who have made the jump from small towns to big cities in search of paid work—with macro-level data on the systemic trends in which those cases are situated. The cases are carefully carefully and artfully reported, and it’s clear that Chang worked on and cared deeply about this project for many years.

No matter how hard the author tried, though, there’s a hitch in her research design that’s virtually impossible to overcome. Chang can only tell the stories of migrants who shared their stories with her, and these sources are not a random sample of all migrants. Even worse for attempts to generalize from those sources, there may be a correlation between the ability and desire to tell your story to a foreign reporter and the traits that make some migrants more successful than others. We don’t hear from young women who are too ashamed or humble or disinterested to tell their stories to a stranger who wants to share them with the world. We certainly can’t hear from women who have died or been successfully hidden from the reporter’s view for one reason or another. If the few sources who open up to Chang aren’t representative of the pool of young women whose lives she aims to portray, then their stories won’t be, either.

An anecdote from Wu Chunming, one of the two young women on whom the book focuses, stuck in my mind as a metaphor for the selection process that might skew our view of the process Chang means to describe. On pp. 46-47, Chang writes:

Guangdong in 1993 was even more chaotic than it is today. Migrants from the countryside flooded the streets looking for work, sleeping in bus stations and under bridges. The only way to find a job was to knock on factory doors, and Chunming and her friends were turned away from many doors before they were hired at the Guotong toy factory. Ordinary workers there made one hundred yuan a month, or about twelve dollars; to stave off hunger, they bought giant bags of instant noodles and added salt and boiling water. ‘We thought if we ever made two hundred yuan a month,’ Chunming said later, ‘we would be perfectly happy.’

After four months, Chunming jumped to another factory, but left soon after a fellow worker said her cousin knew of better jobs in Shenzhen. Chunming and a few friends traveled there, spent the night under a highway overpass, and met the girl’s cousin the next morning. He brought them to a hair salon and took them upstairs, where a heavily made-up young woman sat on a massage bed waiting for customers. Chunming was terrified at the sight. ‘I was raised very traditionally,’ she said. ‘I thought everyone in that place was bad and wanted me to be a prostitute. I thought that once I went in there, I would turn bad too.’

The girls were told that they should stay and take showers in a communal stall, but Chunming refused. She walked back down the stairs, looked out the front door, and ran, abandoning her friends and the suitcase that contained here money, a government-issued identity card, and a photograph of her mother…

‘Did you ever find out what happened to the friends you left behind in the hair salon?’ I asked.

‘No,’ she said. ‘I don’t know if it was a truly bad place or just a place where you could work as a massage girl if you wanted. But it was frightening that they would not let us leave.’

In that example, we hear Wu’s side of this story and the success that followed. What we don’t hear are the stories of the other young women who didn’t run away that day. Maybe the courage or just impulsiveness Chunming showed in that moment is something that helped her become more successful afterwards, and that also made her more likely to encounter and open up to a reporter.

Chang implicitly flags this issue for us at the end of that excerpt, and she explicitly addresses it in a “conversation” with the author that follows the text in my paperback edition. Still, Chang can’t tell us the versions of the story that she doesn’t hear. In social-scientific jargon, those other young women left behind at the hair salon are the unobserved counterfactuals to the optimistic narrative we get from Chunming. A more literary soul might describe those other girls as the ghosts of Wu Chunming’s past, present, and future. Unlike Dickens’ phantoms, though, these other lives actually happened, and yet we still can’t see them.

In a recent blog post, sociologist Zeynep Tufekci wrote about the relationship between a project’s research design and the inferences we can draw from it:

Research methods, a topic that is seemingly so dry, are the heart and soul of knowledge. Most data supports more than one theory. This does NOT mean all data supports all theories: rather, multiple explanations can fit one set of findings. Choosing the right underlying theory, an iterative process that always builds upon itself, requires thinking hard on how data selection impacts findings, and how presentation of findings lends itself to multiple theories, and how theories fit with existing worldviews, and how better research design can help us distinguish between competing explanation.

A good research project consciously grapples with these.

Like the video Tufekci critiques in her essay, Chang’s book is a research project. Factory Girls is a terrific piece of work and writing, but those of us who read it with an eye toward understanding the wider processes its stories are meant to represent should do so with caution, especially if it confirms our prior beliefs. I hope that economic development is mostly improving the lives of young women and men in China, and there is ample macro-level evidence that it is. The stories Chang relates seem to confirm that view, but a little thinking about selection effects suggests that we should expect them to do that. To really test those beliefs, we would need to trace the life courses of a wider sample of young women. As is often happens in social science, though, the cases most important to testing our mental models are also the hardest to see.

Why political scientists should predict

Last week, Hans Noel wrote a post for Mischiefs of Faction provocatively titled “Stop trying to predict the future“. I say provocatively because, if I read the post correctly, Noel’s argument deliberately refutes his own headline. Noel wasn’t making a case against forecasting. Rather, he was arguing in favor of forecasting, as long as it’s done in service of social-scientific objectives.

If that’s right, then I largely agree with Noel’s argument and would restate it as follows. Political scientists shouldn’t get sucked into bickering with their colleagues over small differences in forecast accuracy around single events, because those differences will rarely contain enough information for us to learn much from them. Instead, we should take prediction seriously as a means of testing competing theories by doing two things.

First, we should build forecasting models that clearly represent contrasting sets of beliefs about the causes and precursors of the things we’re trying to predict. In Noel’s example, U.S. election forecasts are only scientifically interesting in so far as they come from models that instantiate different beliefs about why Americans vote like they do. If, for example, a model that incorporates information about trends in unemployment consistently produces more accurate forecasts than a very similar model that doesn’t, then we can strengthen our confidence that trends in unemployment shape voter behavior. If all the predictive models use only the same inputs—polls, for example—we don’t leave ourselves much room to learn about theories from them.

In my work for the Early Warning Project, I have tried to follow this principle by organizing our multi-model ensemble around a pair of models that represent overlapping but distinct ideas about the origins of state-led mass killing. One model focuses on the characteristics of the political regimes that might perpetrate this kind of violence, while another focuses on the circumstances in which those regimes might find themselves. These models embody competing claims about why states kill, so a comparison of their predictive accuracy will give us a chance to learn something about the relative explanatory power of those competing claims. Most of the current work on forecasting U.S. elections follows this principle too, by the way, even if that’s not what gets emphasized in media coverage of their work.

Second, we should only really compare the predictive power of those models across multiple events or a longer time span, where we can be more confident that observed differences in accuracy are meaningful. This is basic statistics. The smaller the sample, the less confident we can be that it is representative of the underlying distribution(s) from which it was drawn. If we declare victory or failure in response to just one or a few bits of feedback, we risk “correcting” for an unlikely draw that dimly reflects the processes that really interest us. Instead, we should let the models run for a while before chucking or tweaking them, or at least leave the initial version running while trying out alternatives.

Admittedly, this can be hard to do in practice, especially when the events of interest are rare. All of the applied forecasters I know—myself included—are tinkerers by nature, so it’s difficult for us to find the patience that second step requires. With U.S. elections, forecasters also know that they only get one shot every two or four years, and that most people won’t hear anything about their work beyond a topline summary that reads like a racing form from the horse track. If you’re at all competitive—and anyone doing this work probably is—it’s hard not to respond to that incentive. With the Early Warning Project, I worry about having a salient “miss” early in the system’s lifespan that encourages doubters to dismiss the work before we’ve really had a chance to assess its reliability and value. We can be patient, but if our intended audiences aren’t too, then the system could fail to get the traction it deserves.

Difficult doesn’t mean impossible, however, and I’m optimistic that political scientists will increasingly use forecasting in service of their search for more useful and more powerful theories. Journal articles that take this idea seriously are still rare birds, especially on things other than U.S. elections, but you occasionally spot them (Exhibit A and B). As Drew Linzer tweeted in response to Noel’s post, “Arguing over [predictive] models is arguing over assumptions, which is arguing over theories. This is exactly what [political science] should be doing.”

Machine learning our way to better early warning on mass atrocities

For the past couple of years, I’ve been helping build a system that uses statistics and expert crowds to assess and track risks of mass atrocities around the world. Recently dubbed the Early Warning Project (EWP), this effort already has a blog up and running (here), and the EWP should finally be able to launch a more extensive public website within the next several weeks.

One of the first things I did for the project, back in 2012, was to develop a set of statistical models that assess risks of onsets of state-led mass killing in countries worldwide, the type of mass atrocities for which we have the most theory and data. Consistent with the idea that the EWP will strive to keep improving on what it does as new data, methods, and ideas become available, that piece of the system has continued to evolve over the ensuing couple of years.

You can find the first two versions of that statistical tool here and here. The latest iteration—recently codified in new-and-improved replication materials—has performed pretty well, correctly identifying the few countries that have seen onsets of state-led mass killing in the past couple of years as relatively high-risk cases before those onsets occurred. It’s not nearly as precise as we’d like—I usually apply the phrase “relatively high-risk” to the Top 30, and we’ll only see one or two events in most years—but that level of imprecision is par for the course when forecasting rare and complex political crises like these.

Of course, a solid performance so far doesn’t mean that we can’t or shouldn’t try to do even better. Last week, I finally got around to applying a couple of widely used machine learning techniques to our data to see how those techniques might perform relative to the set of models we’re using now. Our statistical risk assessments come not from a single model but from a small collection of them—a “multi-model ensemble” in applied forecasting jargon—because these collections of models usually produce more accurate forecasts than any single one can. Our current ensemble mixes two logistic regression models, each representing a different line of thinking about the origins of mass killing, with one machine-learning algorithm—Random Forests—that gets applied to all of the variables used by those theory-specific models. In cross-validation, the Random Forests forecasts handily beat the two logistic regression models, but, as is often the case, the average of the forecasts from all three does even better.

Inspired by the success of Random Forests in our current risk assessments and by the power of machine learning in another project on which I’m working, I decided last week to apply two more machine learning methods to this task: support vector machines (SVM) and the k-nearest neighbors (KNN) algorithm. I won’t explain the two techniques in any detail here; you can find good explanations elsewhere on the internet (see here and here, for example), and, frankly, I don’t understand the methods deeply enough to explain them any better.

What I will happily report is that one of the two techniques, SVM, appears to perform our forecasting task about as well as Random Forests. In five-fold cross-validation, both SVM and Random Forests both produced areas under the ROC curve (a.k.a. AUC scores) in the mid-0.80s. AUC scores range from 0.5 to 1, and a score in the mid-0.80s is pretty good for out-of-sample accuracy on this kind of forecasting problem. What’s more, when I averaged the estimates for each case from SVM and Random Forests, I got AUC scores in the mid– to upper 0.80s. That’s several points better than our current ensemble, which combines Random Forests with those logistic regression models.

By contrast, KNN did quite poorly, hovering close to the 0.5 mark that we would get with randomly generated probabilities. Still, success in one of the two experiments is pretty exciting. We don’t have a lot of forecasts to combine right now, so adding even a single high-quality model to the mix could produce real gains.

Mind you, this wasn’t a push-button operation. For one thing, I had to rework my code to handle missing data in a different way—not because SVM handles missing data differently from Random Forests, but because the functions I was using to implement the techniques do. (N.B. All of this work was done in R. I used ‘ksvm’ from the kernlab package for SVM and ‘knn3’ from the caret package for KNN.) I also got poor results from SVM in my initial implementation, which used the default settings for all of the relevant parameters. It took some iterating to discover that the Laplacian kernel significantly improved the algorithm’s performance, and that tinkering with the other flexible parameters (sigma and C for the Laplacian kernel in ksvm) had no effect or made things worse.

I also suspect that the performance of KNN would improve with more effort. To keep the comparison simple, I gave all three algorithms the same set of features and observations. As it happens, though, Random Forests and SVMs are less prone to over-fitting than KNN, which has a harder time separating the signal from the noise when irrelevant features are included. The feature set I chose probably includes some things that don’t add any predictive power, and their inclusion may be obscuring the patterns that do lie in those data. In the next go-round, I would start the KNN algorithm with the small set of features in whose predictive power I’m most confident, see if that works better, and try expanding from there. I would also experiment with different values of k, which I locked in at 5 for this exercise.

It’s tempting to spin the story of this exercise as a human vs. machine parable in which newfangled software and Big Data outdo models hand-crafted by scholars wedded to overly simple stories about the origins of mass atrocities. It’s tempting, but it would also be wrong on a couple of crucial points.

First, this is still small data. Machine learning refers to a class of analytic methods, not the amount of data involved. Here, I am working with the same country-year data set covering the world from the 1940s to the present that I have used in previous iterations of this exercise. This data set contains fewer than 10,000 observations on scores of variables and takes up about as much space on my hard drive as a Beethoven symphony. In the future, I’d like to experiment with newer and larger data sets at different levels of aggregation, but that’s not what I’m doing now, mostly because those newer and larger data sets still don’t cover enough time and space to be useful in the analysis of such rare events.

Second and more important, theory still pervades this process. Scholars’ beliefs about what causes and presages mass killing have guided my decisions about what variables to include in this analysis and, in many cases, how those variables were originally measured and the fact that data even exist on them at all. Those data-generating and variable-selection processes, and all of the expertise they encapsulate, are essential to these models’ forecasting power. In principle, machine learning could be applied to a much wider set of features, and perhaps we’ll try that some time, too. With events as rare as onsets of state-led mass killing, however, I would not have much confidence that results from a theoretically agnostic search would add real forecasting power and not just result in over-fitting.

In any case, based on these results, I will probably incorporate SVM into the next iteration of the Early Warning Project’s statistical risk assessments. Those are due out early in the spring of 2015, when all of the requisite inputs will have been updated (we hope). I think we’ll also need to think carefully about whether or not to keep those logistic regression models in the mix, and what else we might borrow from the world of machine learning. In the meantime, I’ve enjoyed getting to try out some new techniques on data I know well, where it’s a lot easier to tell if things are going wonky, and it’s encouraging to see that we can continue to get better at this hard task if we keep trying.

Deriving a Fuzzy-Set Measure of Democracy from Several Dichotomous Data Sets

In a recent post, I described an ongoing project in which Shahryar Minhas, Mike Ward, and I are using text mining and machine learning to produce fuzzy-set measures of various political regime types for all countries of the world. As part of the NSF-funded MADCOW project,* our ultimate goal is to devise a process that routinely updates those data in near-real time at low cost. We’re not there yet, but our preliminary results are promising, and we plan to keep tinkering.

One of crucial choices we had to make in our initial analysis was how to measure each regime type for the machine-learning phase of the process. This choice is important because our models are only going to be as good as the data from which they’re derived. If the targets in that machine-learning process don’t reliably represent the concepts we have in mind, then the resulting models will be looking for the wrong things.

For our first cut, we decided to use dichotomous measures of several regime types, and to base those dichotomous measures on stringent criteria. So, for example, we identified as democracies only those cases with a score of 10, the maximum, on Polity’s scalar measure of democracy. For military rule, we only coded as 1 those cases where two major data sets agreed that a regime was authoritarian and only military-led, with no hybrids or modifiers. Even though the targets of our machine-learning process were crisply bivalent, we could get fuzzy-set measures from our classifiers by looking at the probabilities of class membership they produce.

In future iterations, though, I’m hoping we’ll get a chance to experiment with targets that are themselves fuzzy or that just take advantage of a larger information set. Bayesian measurement error models offer a great way to generate those targets.

Imagine that you have a set of cases that may or may not belong in some category of interest—say, democracy. Now imagine that you’ve got a set of experts who vote yes (1) or no (0) on the status of each of those cases and don’t always agree. We can get a simple estimate of the probability that a given case is a democracy by averaging the experts’ votes, and that’s not necessarily a bad idea. If, however, we suspect that some experts are more error prone than others, and that the nature of those errors follows certain patterns, then we can do better with a model that gleans those patterns from the data and adjusts the averaging accordingly. That’s exactly what a Bayesian measurement error model does. Instead of an unweighted average of the experts’ votes, we get an inverse-error-rate-weighted average, which should be more reliable than the unweighted version if the assumption about predictable patterns in those errors is largely correct.

I’m not trained in Bayesian data analysis and don’t know my way around the software used to estimate these models, so I sought and received generous help on this task from Sean J. Taylor. I compiled yes/no measures of democracy from five country-year data sets that ostensibly use similar definitions and coding criteria:

  • Cheibub, Gandhi, and Vreeland’s Democracy and Dictatorship (DD) data set, 1946–2008 (here);
  • Boix, Miller, and Rosato’s dichotomous coding of democracy, 1800–2007 (here);
  • A binary indicator of democracy derived from Polity IV using the Political Instability Task Force’s coding rules, 1800–2013;
  • The lists of electoral democracies in Freedom House’s annual Freedom in the World reports, 1989–2013; and
  • My own Democracy/Autocracy data set, 1955–2010 (here).

Sean took those five columns of zeroes and ones and used them to estimate a model with no prior assumptions about the five sources’ relative reliability. James Melton, Stephen Meserve, and Daniel Pemstein use the same technique to produce the terrific Unified Democracy Scores. What we’re doing is a little different, though. Where their approach treats democracy as a scalar concept and estimates a composite index from several measures, we’re accepting the binary conceptualization underlying our five sources and estimating the probability that a country qualifies as a democracy. In fuzzy-set terms, this probability represents a case’s degree of membership in the democracy set, not how democratic it is.

The distinction between a country’s degree of membership in that set and its degree of democracy is subtle but potentially meaningful, and the former will sometimes be a better fit for an analytic task than the latter. For example, if you’re looking to distinguish categorically between democracies and autocracies in order to estimate the difference in some other quantity across the two sets, it makes more sense to base that split on a probabilistic measure of set membership than an arbitrarily chosen cut point on a scalar measure of democracy-ness. You would still need to choose a threshold, but “greater than 0.5” has a natural interpretation (“probably a democracy”) that suits the task in a way that an arbitrary cut point on an index doesn’t. And, of course, you could still perform a sensitivity analysis by moving the cut point around and seeing how much that choice affects your results.

So that’s the theory, anyway. What about the implementation?

I’m excited to report that the estimates from our initial measurement model of democracy look great to me. As someone who has spent a lot of hours wringing my hands over the need to make binary calls on many ambiguous regimes (Russia in the late 1990s? Venezuela under Hugo Chavez? Bangladesh between coups?), I think these estimates are accurately distinguishing the hazy cases from the rest and even doing a good job estimating the extent of that uncertainty.

As a first check, let’s take a look at the distribution of the estimated probabilities. The histogram below shows the estimates for the period 1989–2007, the only years for which we have inputs from all five of the source data sets. Voilà, the distribution has the expected shape. Most countries most of the time are readily identified as democracies or non-democracies, but the membership status of a sizable subset of country-years is more uncertain.

Estimated Probabilities of Democracy for All Countries Worldwide, 1989-2007

Estimated Probabilities of Democracy for All Countries Worldwide, 1989-2007

Of course, we can and should also look at the estimates for specific cases. I know a little more about countries that emerged from the collapse of the Soviet Union than I do about the rest of the world, so I like to start there when eyeballing regime data. The chart below compares scores for several of those countries that have exhibited more variation over the past 20+ years. Most of the rest of the post-Soviet states are slammed up against 1 (Estonia, Latvia, and Lithuania) or 0 (e.g., Uzbekistan, Turkmenistan, Tajikistan), so I left them off the chart. I also limited the range of years to the ones for which data are available from all five sources. By drawing strength from other years and countries, the model can produce estimates for cases with fewer or even no inputs. Still, the estimates will be less reliable for those cases, so I thought I would focus for now on the estimates based on a common set of “votes.”

Estimated Probability of Democracy for Selected Soviet Successor States, 1991-2007

Estimated Probability of Democracy for Selected Soviet Successor States, 1991-2007

Those estimates look about right to me. For example, Georgia’s status is ambiguous and trending less likely until the Rose Revolution of 2003, after which point it’s probably but not certainly a democracy, and the trend bends down again soon thereafter. Meanwhile, Russia is fairly confidently identified as a democracy after the constitutional crisis of 1993, but its status becomes uncertain around the passage of power from Yeltsin to Putin and then solidifies as most likely authoritarian by the mid-2000s. Finally, Armenia was one of the cases I found most difficult to code when building the Democracy/Autocracy data set for the Political Instability Task Force, so I’m gratified to see its probability of democracy oscillating around 0.5 throughout.

One nice feature of a Bayesian measurement error model is that, in addition to estimating the scores, we can also estimate confidence intervals to help quantify our uncertainty about those scores. The plot below shows Armenia’s trend line with the upper and lower bounds of a 90-percent confidence interval. Here, it’s even easier to see just how unclear this country’s democracy status has been since it regained independence. From 1991 until at least 2007, its 90-percent confidence interval straddled the toss-up line. How’s that for uncertain?

Armenia's Estimated Probability of Democracy with 90% Confidence Interval

Armenia’s Estimated Probability of Democracy with 90% Confidence Interval

Sean and I are still talking about ways to tweak this process, but I think the data it’s producing are already useful and interesting. I’m considering using these estimates in a predictive model of coup attempts and seeing if and how the results differ from ones based on the Polity index and the Unified Democracy Scores. Meanwhile, the rest of the MADCOW crew and I are now talking about applying the same process to dichotomous indicators of military rule, one-party rule, personal rule, and monarchy and then experimenting with machine-learning processes that use the results as their targets. There are lots of moving parts in our regime data-making process, and this one isn’t necessarily the highest priority, but it would be great to get to follow this path and see where it leads.

* NSF Award 1259190, Collaborative Research: Automated Real-time Production of Political Indicators

Mining Texts to Generate Fuzzy Measures of Political Regime Type at Low Cost

Political scientists use the term “regime type” to refer to the formal and informal structure of a country’s government. Of course, “government” entails a lot of things, so discussions of regime type focus more specifically on how rulers are selected and how their authority is organized and exercised. The chief distinction in contemporary work on regime type is between democracies and non-democracies, but there’s some really good work on variations of non-democracy as well (see here and here, for example).

Unfortunately, measuring regime type is hard, and conventional measures of regime type suffer from one or two crucial drawbacks.

First, many of the data sets we have now represent regime types or their components with bivalent categorical measures that sweep meaningful uncertainty under the rug. Specific countries at specific times are identified as fitting into one and only one category, even when researchers knowledgeable about those cases might be unsure or disagree about where they belong. For example, all of the data sets that distinguish categorically between democracies and non-democracies—like this one, this one, and this one—agree that Norway is the former and Saudi Arabia the latter, but they sometimes diverge on the classification of countries like Russia, Venezuela, and Pakistan, and rightly so.

Importantly, the degree of our uncertainty about where a case belongs may itself be correlated with many of the things that researchers use data on regime type to study. As a result, findings and forecasts derived from those data are likely to be sensitive to those bivalent calls in ways that are hard to understand when that uncertainty is ignored. In principle, it should be possible to make that uncertainty explicit by reporting the probability that a case belongs in a specific set instead of making a crisp yes/no decision, but that’s not what most of the data sets we have now do.

Second, virtually all of the existing measures are expensive to produce. These data sets are coded either by hand or through expert surveys, and routinely covering the world this way takes a lot of time and resources. (I say this from knowledge of the budgets for the production of some of these data sets, and from personal experience.) Partly because these data are so costly to make, many of these measures aren’t regularly updated. And, if the data aren’t regularly updated, we can’t use them to generate the real-time forecasts that offer the toughest test of our theories and are of practical value to some audiences.

As part of the NSF-funded MADCOW project*, Michael D. (Mike) Ward, Philip Schrodt, and I are exploring ways to use text mining and machine learning to generate measures of regime type that are fuzzier in a good way from a process that is mostly automated. These measures would explicitly represent uncertainty about where specific cases belong by reporting the probability that a certain case fits a certain regime type instead of forcing an either/or decision. Because the process of generating these measures would be mostly automated, they would be much cheaper to produce than the hand-coded or survey-based data sets we use now, and they could be updated in near-real time as relevant texts become available.

At this week’s annual meeting of the American Political Science Association, I’ll be presenting a paper—co-authored with Mike and Shahryar Minhas of Duke University’s WardLab—that describes preliminary results from this endeavor. Shahryar, Mike, and I started by selecting a corpus of familiar and well-structured texts describing politics and human-rights practices each year in all countries worldwide: the U.S. State Department’s Country Reports on Human Rights Practices, and Freedom House’s Freedom in the World. After pre-processing those texts in a few conventional ways, we dumped the two reports for each country-year into a single bag of words and used text mining to extract features from those bags in the form of vectorized tokens that may be grossly described as word counts. (See this recent post for some things I learned from that process.) Next, we used those vectorized tokens as inputs to a series of binary classification models representing a few different ideal-typical regime types as observed in few widely used, human-coded data sets. Finally, we applied those classification models to a test set of country-years held out at the start to assess the models’ ability to classify regime types in cases they had not previously “seen.” The picture below illustrates the process and shows how we hope eventually to develop models that can be applied to recent documents to generate new regime data in near-real time.

Overview of MADCOW Regime Classification Process

Overview of MADCOW Regime Classification Process

Our initial results demonstrate that this strategy can work. Our classifiers perform well out of sample, achieving high or very high precision and recall scores in cross-validation on all four of the regime types we have tried to measure so far: democracy, monarchy, military rule, and one-party rule. The separation plots below are based on out-of-sample results from support vector machines trained on data from the 1990s and most of the 2000s and then applied to new data from the most recent few years available. When a classifier works perfectly, all of the red bars in the separation plot will appear to the right of all of the pink bars, and the black line denoting the probability of a “yes” case will jump from 0 to 1 at the point of separation. These classifiers aren’t perfect, but they seem to be working very well.

 

prelim.democracy.svm.sepplot

prelim.military.svm.sepplot

prelim.monarchy.svm.sepplot

prelim.oneparty.svm.sepplot

Of course, what most of us want to do when we find a new data set is to see how it characterizes cases we know. We can do that here with heat maps of the confidence scores from the support vector machines. The maps below show the values from the most recent year available for two of the four regime types: 2012 for democracy and 2010 for military rule. These SVM confidence scores indicate the distance and direction of each case from the hyperplane used to classify the set of observations into 0s and 1s. The probabilities used in the separation plots are derived from them, but we choose to map the raw confidence scores because they exhibit more variance than the probabilities and are therefore easier to visualize in this form.

prelim.democracy.svmcomf.worldmap.2012

prelim.military.svmcomf.worldmap.2010

 

On the whole, cases fall out as we would expect them to. The democracy classifier confidently identifies Western Europe, Canada, Australia, and New Zealand as democracies; shows interesting variations in Eastern Europe and Latin America; and confidently identifies nearly all of the rest of the world as non-democracies (defined for this task as a Polity score of 10). Meanwhile, the military rule classifier sees Myanmar, Pakistan, and (more surprisingly) Algeria as likely examples in 2010, and is less certain about the absence of military rule in several West African and Middle Eastern countries than in the rest of the world.

These preliminary results demonstrate that it is possible to generate probabilistic measures of regime type from publicly available texts at relatively low cost. That does not mean we’re fully satisfied with the output and ready to move to routine data production, however. For now, we’re looking at a couple of ways to improve the process.

First, the texts included in the relatively small corpus we have assembled so far only cover a narrow set of human-rights practices and political procedures. In future iterations, we plan to expand the corpus to include annual or occasional reports that discuss a broader range of features in each country’s national politics. Eventually, we hope to add news stories to the mix. If we can develop models that perform well on an amalgamation of occasional reports and news stories, we will be able to implement this process in near-real time, constantly updating probabilistic measures of regime type for all countries of the world at very low cost.

Second, the stringent criteria we used to observe each regime type in constructing the binary indicators on which the classifiers are trained also appear to be shaping the results in undesirable ways. We started this project with a belief that membership in these regime categories is inherently fuzzy, and we are trying to build a process that uses text mining to estimate degrees of membership in those fuzzy sets. If set membership is inherently ambiguous in a fair number of cases, then our approximation of a membership function should be bimodal, but not too neatly so. Most cases most of the time can be placed confidently at one end of the range of degrees of membership or the other, but there is considerable uncertainty at any moment in time about a non-trivial number of cases, and our estimates should reflect that fact.

If that’s right, then our initial estimates are probably too tidy, and we suspect that the stringent operationalization of each regime type in the training data is partly to blame. In future iterations, we plan to experiment with less stringent criteria—for example, by identifying a case as military rule if any of our sources tags it as such. With help from Sean J. Taylor, we’re also looking at ways we might use Bayesian measurement error models to derive fuzzy measures of regime type from multiple categorical data sets, and then use that fuzzy measure as the target in our machine-learning process.

So, stay tuned for more, and if you’ll be at APSA this week, please come to our Friday-morning panel and let us know what you think.

* NSF Award 1259190, Collaborative Research: Automated Real-time Production of Political Indicators

  • Author

  • Follow me on Twitter

  • Follow Dart-Throwing Chimp on WordPress.com
  • Enter your email address to follow this blog and receive notifications of new posts by email.

    Join 13,615 other followers

  • Archives

%d bloggers like this: