Notes From a First Foray into Text Mining

Guess what? Text mining isn’t push-button, data-making magic, either. As Phil Schrodt likes to say, there is no Data Fairy.

data fairy meme

I’m quickly learning this point from my first real foray into text mining. Under a grant from the National Science Foundation, I’m working with Phil Schrodt and Mike Ward to use these techniques to develop new measures of several things, including national political regime type.

I wish I could say that I’m doing the programming for this task, but I’m not there yet. For the regime-data project, the heavy lifting is being done by Shahryar Minhas, a sharp and able Ph.D. student in political science at Duke University, where Mike leads the WardLab. Shahryar and I are scheduled to present preliminary results from this project at the upcoming Annual Meeting of the American Political Science Association in Washington, DC (see here for details).

When we started work on the project, I imagined a relatively simple and mostly automatic process running from location and ingestion of the relevant texts to data extraction, model training, and, finally, data production. Now that we’re actually doing it, though, I’m finding that, as always, the devil is in the details. Here are just a few of the difficulties and decision points we’ve had to confront so far.

First, the structure of the documents available online often makes it difficult to scrape and organize them. We initially hoped to include annual reports on politics and human-rights practices from four or five different organizations, but some of the ones we wanted weren’t posted online in a format we could readily scrape. At least one was scrapable but not organized by country, so we couldn’t properly group the text for analysis. In the end, we wound up with just two sets of documents in our initial corpus: the U.S. State Department’s Country Reports on Human Rights Practices, and Freedom House’s annual Freedom in the World documents.

Differences in naming conventions almost tripped us up, too. For our first pass at the problem, we are trying to create country-year data, so we want to treat all of the documents describing a particular country in a particular year as a single bag of words. As it happens, the State Department labels its human rights reports for the year on which they report, whereas Freedom House labels its Freedom in the World report for the year in which it’s released. So, for example, both organizations have already issued their reports on conditions in 2013, but Freedom House dates that report to 2014 while State dates its version to 2013. Fortunately, we knew this and made a simple adjustment before blending the texts. If we hadn’t known about this difference in naming conventions, however, we would have ended up combining reports for different years from the two sources and made a mess of the analysis.

Once ingested, those documents include some text that isn’t relevant to our task, or that is relevant but the meaning of which is tacit. Common stop words like “the”, “a”, and “an” are obvious and easy to remove. More challenging are the names of people, places, and organizations. For our regime-data task, we’re interested in the abstract roles behind some of those proper names—president, prime minister, ruling party, opposition party, and so on—rather than the names themselves, but text mining can’t automatically derive the one for the other.

For our initial analysis, we decided to omit all proper names and acronyms to focus the classification models on the most general language. In future iterations, though, it would be neat if we could borrow dictionaries developed for related tasks and use them to replace those proper names with more general markers. For example, in a report or story on Russia, Vladimir Putin might get translated into <head of government>, the FSB into <police>, and Chechen Republic of Ichkeria into <rebel group>. This approach would preserve the valuable tacit information in those names while making it explicit and uniform for the pattern-recognition stage.

That’s not all, but it’s enough to make the point. These things are always harder than they look, and text mining is no exception. In any case, we’ve now run this gantlet once and made our way to an encouraging set of initial results. I’ll post something about those results closer to the conference when the paper describing them is ready for public consumption. In the meantime, though, I wanted to share a few of the things I’ve already learned about these techniques with others who might be thinking about applying them, or who already do and can commiserate.

Forecasting Round-Up No. 4

Another in an occasional series of posts calling out interesting work on forecasting. See here, here, and here for earlier ones.

1. A gaggle of researchers at Penn State, including Phil Schrodt, have posted a new conference paper (PDF) showing how they are using computer-generated data on political interactions around the world (the oft-mentioned GDELT) to forecast various forms of political crisis with respectable accuracy.

One important finding from their research so far: models that mix dynamic data on political interactions with slow-changing data on relevant structural conditions (political, social, economic) produce more accurate forecasts than models that use only one or the other. That’s not surprising, but it is a useful confirmation nonetheless. Thanks to GDELT’s public release, I predict that we’ll see a lot more social-science modelers doing that kind of mixing in the near future.

2. Kaiser Fung reviews Predictive Analytics, a book by Eric Siegel. I haven’t read it, but Kaiser’s review makes me think it would be a good addition to my short list of recommended readings for forecasters.

3. Finally, the 2013 edition of the Failed States Index (FSI) is now up on Foreign Policy‘s web site (here). I call it out here to air a few grievances.

First, it makes me a little crazy that it’s hard to pin down exactly what this index is supposed to do. Is FSI meant to summarize recent conditions or to help forecast new troubles down the road? In their explication of the methodology behind it, the makers of the FSI acknowledge that it’s the largely former but also slide into describing it as an early-warning tool. And what exactly is “state failure,” anyway? They never quite say, which makes it hard to use the index as either a snapshot or a forecast.

Second, as I’ve said before on this blog, I’m also not a big fan of indices that roll up so many different things into a single value on the basis of assumptions alone. Statistical models also combine a lot of information, but they do so with weights that are derived from a systematic exploration of empirical evidence. FSI simply assumes all of its 12 components are equally relevant when there’s ample opportunity to check that assumption against the historical record. Maybe some of the index’s components are more informative than others, so why not use models to try to find out?

Last but not least, on the way FSI is presented, I think the angry reactions it elicits (see comments on previous editions or my Twitter feed whenever FSI is released) are a useful reminder of the risks of presenting rank-ordered lists based on minor variations in imprecise numbers. People spend a lot of time venting about relatively small differences between states (e.g., “Why is Ethiopia two notches higher than Syria?”) when those aren’t very informative, and aren’t really meant to be. I’ve run into the same problem when I’ve posted statistical forecasts of things like coup attempts and nonviolent uprisings, and I’m increasingly convinced that those rank-ordered lists are a distraction. To use the results without fetishizing the numbers, we might do better to focus on the counter-intuitive results (surprises) and on cases whose scores change a lot across iterations.

Challenges in Measuring Violent Conflict, Syria Edition

As part of a larger (but, unfortunately, gated) story on how the terrific new Global Data on Events, Language, and Tone (GDELT) might help social scientists forecast violent conflicts, the New Scientist recently posted some graphics using GDELT to chart the ongoing civil war in Syria. Among those graphics was this time-series plot of violent events per day in Syria since the start of 2011:

Syrian Conflict   New Scientist

Based on that chart, the author of the story (not the producers of GDELT, mind you) wrote:

As Western leaders ponder intervention, the resulting view suggests that the violence has subsided in recent months, from a peak in the third quarter of 2012.

That inference is almost certainly wrong, and why it’s wrong underscores one of the fundamental challenges in using event data—whether it’s collected and coded by software or humans or some combination thereof—to observe the dynamics of violent conflict.

I say that inference is almost certainly wrong because concurrent data on deaths and refugees suggest that violence in Syria has only intensified in past year. One of the most reputable sources on deaths from the war is the Syria Tracker. A screenshot of their chart of monthly counts of documented killings is shown below. Like GDELT, their data also identify a sharp increase in violence in late 2012. Unlike GDELT, their data indicate that the intensity of the violence has remained very high since then, and that’s true even though the process of documenting killings inevitably lags behind the actual violence.

Syria Tracker monthly death counts

We see a similar pattern in data from the U.N. High Commissioner on Refugees (UNHCR) on people fleeing the fighting in Syria. If anything, the flow of refugees has only increased in 2013, suggesting that the violence in Syria is hardly abating.

UNHCR syria refugee plot

The reason GDELT’s count of violent events has diverged from other measures of the intensity of the violence in Syria in recent months is probably something called “media fatigue.” Data sets of political events generally depend on news sources to spot events of interest, and it turns out that news coverage of large-scale political violence follows a predictable arc. As Deborah Gerner and Phil Schrodt describe in a paper from the late 1990s, press coverage of a sustained and intense conflicts is often high when hostilities first break out but then declines steadily thereafter. That decline can happen because editors and readers get bored, burned out, or distracted. It can also happen because the conflict gets so intense that it becomes, in a sense, too dangerous to cover. In the case of Syria, I suspect all of these things are at work.

My point here isn’t to knock GDELT, which is still recording scores or hundreds of events in Syria every day, automatically, using open-source code, and then distributing those data to the public for free. Instead, I’m just trying to remind would-be users of any data set of political events to infer with caution. Event counts are one useful way to track variation over time in political processes we care about, but they’re only one part of the proverbial elephant, and they are inevitably constrained by the limitations of the sources from which they draw. To get a fuller sense of the beast, we need as often as possible to cross-reference those event data with other sources of information. Each of the sources I’ve cited here has its own blind spots and selection biases, but a comparison of trends from all three—and, importantly, an awareness of the likely sources of those biases—is enough to give me confidence that the civil war in Syria is only continuing to intensify. That says something important about Syria, of course, but it also says something important about the risks of drawing conclusions from event counts alone.

PS. For a great discussion of other sources of bias in the study of political violence, see Stathis Kalyvas’ 2004 essay on “The Urban Bias in Research on Civil Wars” (PDF).

Road-Testing GDELT as a Resource for Monitoring Atrocities

As I said here a few weeks ago, I think the Global Dataset on Events, Location, and Tone (GDELT) is a fantastic new resource that really embodies some of the ways in which technological changes are coming together to open lots of new doors for social-scientific research. GDELT’s promise is obvious: more than 200 million political events from around the world over the past 30 years, all spotted and coded by well-trained software instead of the traditional armies of undergrad RAs, and with daily updates coming online soon. Or, as Adam Elkus’ t-shirt would have it, “200 million observations. Only one boss.”

BUT! Caveat emptor! Like every other data-collection effort ever, GDELT is not alchemy, and it’s important that people planning to use the data, or even just to consume analysis based on it, understand what its limitations are.

I’m starting to get a better feel for those limitations from my own efforts to use GDELT to help observe atrocities around the world, as part of a consulting project I’m doing for the U.S. Holocaust Memorial Museum’s Center for the Prevention of Genocide. The core task of that project is to develop plans for a public early-warning system that would allow us to assess the risk of onsets of atrocities in countries worldwide more accurately and earlier than current practice.

When I heard about GDELT last fall, though, it occurred to me that we could use it (and similar data sets in the pipeline) to support efforts to monitor atrocities as well. The CAMEO coding scheme on which GDELT is based includes a number of event types that correspond to various forms of violent attack and other variables indicating who was doing attacking whom. If we could develop a filter that reliably pulled events of interest to us from the larger stream of records, we could produce something like a near-real time bulletin on recent violence against civilians around the world. Our record would surely have some blind spots—GDELT only tracks a limited number of news sources, and some atrocities just don’t get reported, period—but I thought it would reliably and efficiently alert us to new episodes of violence against civilians and help us identify trends in ongoing ones.

Well, you know what they say about plans and enemies and first contact. After digging into GDELT, I still think we can accomplish those goals, but it’s going to take more human effort than I originally expected. Put bluntly, GDELT is noisier than I had anticipated, and for the time being the only way I can see to sharpen that signal is to keep a human in the loop.

Imagine (fantasize?) for a moment that there’s a perfect record somewhere of all the political interactions GDELT is trying to identify. For kicks, let’s call it the Encyclopedia Eventum (EE). Like any detection system, GDELT can mess up in two basic ways: 1) errors of omission, in which GDELT fails to spot something that’s in the EE; and 2) errors of commission, in which it mistakenly records an event that isn’t in the EE (or, relatedly, is in the EE but in a different place). We might also call these false negatives and false positives, respectively.

At this point, I can’t say anything about how often GDELT is making errors of omission, because I don’t have that Encyclopedia Eventum handy. A more realistic strategy for assessing the rate of errors of omission would involve comparing a subset of GDELT to another event data set that’s known to be a fairly reliable measure for some time and place of something GDELT is meant to track—say, protest and coercion in Europe—and see how well they match up, but that’s not a trivial task, and I haven’t tried it yet.

Instead, the noise I’m seeing is on the other side of that coin: the errors of commission, or false positives. Here’s what I mean:

To start developing my atrocities-monitoring filter, I downloaded the reduced and compressed version of GDELT recently posted on the Penn State Event Data Project page and pulled the tab-delimited text files for a couple of recent years. I’ve worked with event data before, so I’m familiar with basic issues in their analysis, but every data set has its own idiosyncrasies. After trading emails with a few CAMEO pros and reading Jay Yonamine’s excellent primer on event aggregation strategies, I started tinkering with a function in R that would extract the subset of events that appeared to involve lethal force against civilians. That function would involve rules to select on three features: event type, source (the doer), and target.

  • Event Type. For observing atrocities, type 20 (“Engage in Unconventional Mass Violence”) was an obvious choice. Based on advice from those CAMEO pros, I also focused on 18 (“Assault”) and 19 (“Fight”) but was expecting that I would need to be more restrictive about the subtypes, sources, and targets in those categories to avoid errors of commission.
  • Source. I’m trying to track violence by state and non-state agents, so I focused on GOV (government), MIL (Military), COP (police), and intelligence agencies (SPY) for the former and REB (militarized opposition groups) and SEP (separatist groups) for the latter. The big question mark was how to handle records with just a country code (e.g., “SYR” for Syria) and no indication of the source’s type. My CAMEO consultants told me these would usually refer in some way to the state, so I should at least consider including them.
  • Target. To identify violence against civilians, I figured I would get the most mileage out of the OPP (non-violent political opposition), CVL (“civilians,” people in general), and REF (refugees) codes, but I wanted to see if the codes for more specific non-state actors (e.g., LAB for labor, EDU for schools or students, HLH for health care) would also help flag some events of interest.

After tinkering with the data a bit, I decided to write to separate functions, one for events with state perpetrators and another for events with non-state perpetrators. If you’re into that sort of thing, you can see the state-perpetrator version of that filtering function on Github, here.

When I ran the more than 9 million records in the “2011.reduced.txt” file through that function, I got back 2,958 events. So far, so good. As soon as I started poking around in the results, though, I saw a lot of records that looked . The current release of GDELT doesn’t include text from or links to the source material, so it’s hard to say for sure what real-world event any one record describes. Still, some of the perpetrator-and-target combos looked odd to me, and web searches for relevant stories either came up empty or reinforced my suspicions that the records were probably errors of commission. Here are a few examples, showing the date, event type, source, and target:

  • 1/8/2011 193 USAGOV USAMED. Type 193 is “Fight with small arms and light weapons,” but I don’t think anyone from the U.S. government actually got in a shootout or knife fight with American journalists that day. In fact, that event-source-target combination popped up a lot in my subset.
  • 1/9/2011 202 USAMIL VNMCVL. Taken on its face, this record says that U.S. military forces killed Vietnamese civilians on January 9, 2011. My hunch is that the story on which this record is based was actually talking about something from the Vietnam War.
  • 4/11/2011 202 RUSSPY POLCVL. This record seems to suggest that Russian intelligence agents “engaged in mass killings” of Polish civilians in central Siberia two years ago. I suspect the story behind this record was actually talking about the Kaytn Massacre and associated mass deportations that occurred in April 1940.

That’s not to say that all the records looked wacky. Interleaved with these suspicious cases were records representing exactly the kinds of events I was trying to find. For example, my filter also turned up a 202 GOV SYRCVL for June 10, 2011, a day on which one headline blared “Dozens Killed During Syrian Protests.”

Still, it’s immediately clear to me that GDELT’s parsing process is not quite at the stage where we can peruse the codebook like a menu, identify the morsels we’d like to consume, phone our order in, and expect to have exactly the meal we imagined waiting for us when we go to pick it up. There’s lots of valuable information in there, but there’s plenty of chaff, too, and for the time being it’s on us as researchers to take time to try to sort the two out. This sorting will get easier to do if and when the posted version adds information about the source article and relevant text, but “easier” in this case will still require human beings to review the results and do the cross-referencing.

Over time, researchers who work on specific topics—like atrocities, or interstate war, or protest activity in specific countries—will probably be able to develop supplemental coding rules and tweak their filters to automate some of what they learn. I’m also optimistic that the public release of GDELT will accelerate improvements the software and dictionaries it uses, expanding its reach while shrinking the error rates. In the meantime, researchers are advised to stick to the same practices they’ve always used (or should have, anyway): take time to get to know your data; parse it carefully; and, when there’s no single parsing that’s obviously superior, check the sensitivity of your results to different permutations.

PS. If you have any suggestions on how to improve the code I’m using to spot potential atrocities or otherwise improve the monitoring process I’ve described, please let me know. That’s an ongoing project, and even marginal improvements in the fidelity of the filter would be a big help.

PPS. For more on these issues and the wider future of automated event coding, see this ensuing post from Phil Schrodt on his blog.

The Future of Political Science Just Showed Up

I recently wrote about how data sets just starting to come online are going to throw open doors to projects that political scientists have been hoping to do for a while but haven’t had the evidence to handle. Well, one of those shiny new trains just pulled into the station: the Global Dataset of Events, Language, and Tone, a.k.a. GDELT, is now in the public domain.

GDELT is the primarily work of Kalev Leetaru, a a University Fellow at the University of Illinois Graduate School of Library and Information Science, but its intellectual and practical origins—and its journey into the public domain—also owe a lot to the great Phil Schrodt. The data set includes records summarizing more than 200 million events that have occurred around the world from 1979 to the present. Those records are created by software that grabs and parses news reports from a number of international sources, including Agence France Press, the Associated Press, and Xinhua. Each record indicates who did or said what to whom, where, and when.

The “did what” part of each record is based on the CAMEO coding scheme, which sorts actions into a fairly detailed set of categories covering many different forms of verbal and material cooperation and conflict, from public statements of support to attacks with weapons of mass destruction. The “who” and “to whom” parts use carefully constructed dictionaries to identify specific actors and targets by type and proper name. So, for example, “Philippine soldiers” gets identified as Philippines military (PHLMIL), while “Philippine Secretary of Agriculture” gets tagged as Philippine government (PHLGOV). The “where” part uses place names and other clues in the stories to geolocate each event as specifically as possible.

I try to avoid using words like “revolutionary” when talking about the research process, but in this case I think it fits. I suspect this is going to be the data set that launches a thousand dissertations. As Josh Keating noted on his War of Ideas blog at Foreign Policy,

Similar event databases have been built for particular regions, and DARPA has been working along similar lines for the Pentagon with a project known as ICEWS, but for a publicly accessible program…GDELT is unprecedented in it geographic and historic scale.

To Keating’s point about the data set’s scale, I would add two other ways that GDELT is a radical departure from past practice in the discipline. First, it’s going to be updated daily (watch this space). Second, it’s freely available to the public.

Yes, you read that right: a global data set summarizing all sorts of political cooperation and conflict with daily updates is now going to available to anyone with an Internet connection at no charge. As in: FREE. As I said in a tweet-versation about GDELT this afternoon, contractors have been trying for years (and probably succeeding) to sell closed systems like this to the U.S. government for hundreds of thousands or millions of dollars. If I’m not mistaken, that market just crashed, or at the very least shrank by a whole lot.

GDELT isn’t perfect, of course. I’ve already been tinkering with it a bit as part of a project I’m doing for the Holocaust Museum’s Center for the Prevention of Genocide, on monitoring and predicting mass atrocities, and the data on the “Engage in Unconventional Mass Violence” events I’m hoping to use as a marker of atrocities look more reliable in some cases than others. Still, getting a data set of this size and quality in the public domain is a tremendous leap forward for empirical political science, and the fact that it’s open will allow lots of other smart people to find the flaws and work on eliminating or mitigating them.

Last but not least, I think it’s worth noting that GDELT was made possible, in part, through support from the National Science Foundation. It may be free to you, and it’s orders of magnitude cheaper to produce than the artisanal, hand-crafted event data of yesteryear (like, yesterday). But that doesn’t mean it’s been free to develop, produce, or share, and you can thank the NSF for helping various parts of that process happen.

Some Suggested Readings for Political Forecasters

A few people have recently asked me to recommend readings on political forecasting for people who aren’t already immersed in the subject. Since the question keeps coming up, I thought I’d answer with a blog post. Here, in no particular order, are books (and one article) I’d suggest to anyone interested in the subject.

Thinking, Fast and Slow, by Daniel Kahneman. A really engaging read on how we think, with special attention to cognitive biases and heuristics. I think forecasters should read it in hopes of finding ways to mitigate the effects of these biases on their own work, and of getting better at spotting them in the thinking of others.

Numbers Rule Your World, by Kaiser Fung. Even if you aren’t going to use statistical models to forecast, it helps to think statistically, and Fung’s book is the most engaging treatment of that topic that I’ve read so far.

The Signal and the Noise, by Nate Silver. A guided tour of how forecasters in a variety of fields do their work, with some useful general lessons on the value of updating and being an omnivorous consumer of relevant information.

The Theory that Would Not Die, by Sharon Bertsch McGrayne. A history of Bayesian statistics in the real world, including successful applications to some really hard prediction problems, like the risk of accidents with atomic bombs and nuclear power plants.

The Black Swan, by Nicholas Nassim Taleb. If you can get past the derisive tone—and I’ll admit, I initially found that hard to do—this book does a great job explaining why we should be humble about our ability to anticipate rare events in complex systems, and how forgetting that fact can hurt us badly.

Expert Political Judgment: How Good Is It? How Can We Know?, by Philip Tetlock. The definitive study to date on the limits of expertise in political forecasting and the cognitive styles that help some experts do a bit better than others.

Counterfactual Thought Experiments in World Politics, edited by Philip Tetlock and Aaron Belkin. The introductory chapter is the crucial one. It’s ostensibly about the importance of careful counterfactual reasoning to learning from history, but it applies just as well to thinking about plausible futures, an important skill for forecasting.

The Foundation Trilogy, by Isaac Asimov. A great fictional exploration of the Modernist notion of social control through predictive science. These books were written half a century ago, and it’s been more than 25 years since I read them, but they’re probably more relevant than ever, what with all the talk of Big Data and the Quantified Self and such.

The Perils of Policy by P-Value: Predicting Civil Conflicts,” by Michael Ward, Brian Greenhill, and Kristin Bakke. This one’s really for practicing social scientists, but still. The point is that the statistical models we typically construct for hypothesis testing often won’t be very useful for forecasting, so proceed with caution when switching between tasks. (The fact that they often aren’t very good for hypothesis testing, either, is another matter. On that and many other things, see Phil Schrodt’s “Seven Deadly Sins of Contemporary Quantitative Political Analysis.“)

I’m sure I’ve missed a lot of good stuff and would love to hear more suggestions from readers.

And just to be absolutely clear: I don’t make any money if you click through to those books or buy them or anything like that. The closest thing I have to a material interest in this list are ongoing professional collaborations with three of the authors listed here: Phil Tetlock, Phil Schrodt, and Mike Ward.

Big Data Won’t Kill the Theory Star

A few years ago, Wired editor Chris Anderson trolled the scientific world with an essay called “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete.” After talking about the fantastic growth in the scale and specificity of data that was occurring at the time—and that growth has only gotten a lot faster since—Anderson argued that

Petabytes allow us to say: “Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.

In other words, with data this rich, theory becomes superfluous.

Like many of my colleagues, I think Anderson is wrong about the increasing irrelevance of theory. Mark Graham explains why in a year-old post on the Guardian‘s Datablog:

We may one day get to the point where sufficient quantities of big data can be harvested to answer all of the social questions that most concern us. I doubt it though. There will always be digital divides; always be uneven data shadows; and always be biases in how information and technology are used and produced.

And so we shouldn’t forget the important role of specialists to contextualise and offer insights into what our data do, and maybe more importantly, don’t tell us.

At the same time, I also worry that we’re overreacting to Anderson and his ilk by dismissing Big Data as nothing but marketing hype.  From my low perch in one small corner of the social-science world, I get the sense that anyone who sounds excited about Big Data is widely seen as either a fool or a huckster. As Christopher Zorn wrote on Twitter this morning, “‘Big data is dead” is the geek-hipster equivalent of ‘I stopped liking that band before you even heard of them.'”

Of course, I say that as one of those people who’s really excited about the social-scientific potential these data represent. I think a lot of people who dismiss Big Data as marketing hype misunderstand the status quo in social science. If you don’t regularly try to use data to test and develop hypotheses about things like stasis and change in political institutions or the ebb and flow of political violence around the world, you might not realize how scarce and noisy the data we have now really are. On many things our mental models tell us to care about, we simply don’t have reliable measures.

Take, for example, the widely held belief that urban poverty and unemployment drive political unrest in poor countries. Is this true? Well, who knows? For most poor countries, the data we have on income are sparse and often unreliable, and we don’t have any data on unemployment, ever. And that’s at the national level. The micro-level data we’d need to link individuals’ income and employment status to their participation in political mobilization and violence? Apart from a few projects on specific cases (e.g., here and here), fuggeddaboudit.

Lacking the data we need to properly test our models, we fill the space with stories. As Daniel Kahneman describes on p. 201 of Thinking, Fast and Slow,

You cannot help dealing with the limited information you have as if it were all there is to know. You build the best possible story from the information available to you, and if it is a good story, you believe it. Paradoxically, it is easier to construct a coherent story when you know little, when there are fewer pieces to fit into the puzzle. Our comforting conviction that the world makes sense rests on a secure foundation: our almost unlimited ability to ignore our ignorance.

When that’s the state of the art, more data can only make things better. Sure, some researchers will poke around in these data sets until they find “statistically significant” associations and then pretend that’s what they expected to find the whole time. But, as Phil Schrodt points out, plenty of people are already doing that now.

Meanwhile, other researchers with important but unproven ideas about social phenomena will finally get a chance to test and refine those ideas in ways they’d never been able to do before. Barefoot empiricism will play a role, too, but science has always been an iterative process that way, bouncing around between induction and deduction until it hits on something that works. If the switch from data-poor to data-rich social science brings more of that, I feel lucky to be present for its arrival.

It’s Not Just The Math

This week, statistics-driven political forecasting won a big slab of public vindication after the U.S. election predictions of an array of number-crunching analysts turned out to be remarkably accurate. As John Sides said over at the Monkey Cage, “2012 was the Moneyball election.” The accuracy of these forecasts, some of them made many months before Election Day,

…shows us that we can use systematic data—economic data, polling data—to separate momentum from no-mentum, to dispense with the gaseous emanations of pundits’ “guts,” and ultimately to forecast the winner.  The means and methods of political science, social science, and statistics, including polls, are not perfect, and Nate Silver is not our “algorithmic overlord” (a point I don’t think he would disagree with). But 2012 has showed how useful and necessary these tools are for understanding how politics and elections work.

Now I’ve got a short piece up at Foreign Policy explaining why I think statistical forecasts of world politics aren’t at the same level and probably won’t be very soon. I hope you’ll read the whole thing over there, but the short version is: it’s the data. If U.S. electoral politics is a data hothouse, most of international politics is a data desert. Statistical models make very powerful forecasting tools, but they can’t run on thin air, and the density and quality of the data available for political forecasting drops off precipitously as you move away from U.S. elections.

Seriously: you don’t have to travel far in the data landscape to start running into trouble. In a piece posted yesterday, Stephen Tall asks rhetorically why there isn’t a British Nate Silver and then explains that it’s because “we [in the U.K.] don’t have the necessary quality of polls.” And that’s the U.K., for crying out loud. Now imagine how things look in, say, Ghana or Sierra Leone, both of which are holding their own national elections this month.

Of course, difficult does not mean impossible. I’m a bit worried, actually, that some readers of that Foreign Policy piece will hear me saying that most political forecasting is still stuck in the Dark Ages, when that’s really not what I meant. I think we actually do pretty well with statistical forecasting on many interesting problems in spite of the dearth of data, as evidenced by the predictive efforts of colleagues like Mike Ward and Phil Schrodt and some of the work I’ve posted here on things like coups and popular uprisings.

I’m also optimistic that the global spread of digital connectivity and associated developments in information-processing hardware and software are going to help fill some of those data gaps in ways that will substantially improve our ability to forecast many political events. I haven’t seen any big successes along those lines yet, but the changes in the enabling technologies are pretty radical, so it’s plausible that the gains in data quality and forecasting power will happen in big leaps, too.

Meanwhile, while we wait for those leaps to happen, there are some alternatives to statistical models that can help fill some of the gaps. Based partly on my own experiences and partly on my read of relevant evidence (see here, here, and here for a few tidbits), I’m now convinced that prediction markets and other carefully designed systems for aggregating judgments can produce solid forecasts. These tools are most useful in situations where the outcome isn’t highly predictable but relevant information is available to those who dig for it. They’re somewhat less useful for forecasting the outcomes of decision processes that are idiosyncratic and opaque, like North Korean government or even the U.S. Supreme Court. There’s no reason to let the perfect be the enemy of the good, but we should use these tools with full awareness of their limitations as well as their strengths.

More generally, though, I remain convinced that, when trying to forecast political events around the world, there’s a complexity problem we will never overcome no matter how many terabytes of data we produce and consume, how fast our processors run, and how sophisticated our methods become. Many of the events that observers of international politics care about are what Nassim Nicholas Taleb calls “gray swans”—“rare and consequential, but somewhat predictable, particularly to those who are prepared for them and have the tools to understand them.”

These events are hard to foresee because they bubble up from a complex adaptive system that’s constantly evolving underfoot. The patterns we think we discern in one time and place can’t always be generalized to others, and the farther into the future we try to peer, the thinner those strands get stretched. Events like these “are somewhat tractable scientifically,” as Taleb puts it, but we should never expect to predict their arrival the way we can foresee the outcomes of more orderly processes like U.S. elections.

In Defense of Political Science and Forecasting

Under the headline “Political Scientists Are Lousy Forecasters,” today’s New York Times includes an op-ed by Jacqueline Stevens that takes a big, sloppy swipe at most of the field. The money line:

It’s an open secret in my discipline: in terms of accurate political predictions (the field’s benchmark for what counts as science), my colleagues have failed spectacularly and wasted colossal amounts of time and money.

As she sees it, this poor track record is an inevitability. Referencing the National Science Foundation‘s history of funding research in which she sees little value, Stevens writes:

Government can—and should—assist political scientists, especially those who use history and theory to explain shifting political contexts, challenge our intuitions and help us see beyond daily newspaper headlines. Research aimed at political prediction is doomed to fail. At least if the idea is to predict more accurately than a dart-throwing chimp.

I don’t have much time to write today, so I was glad to see this morning that Henry Farrell has already penned a careful rebuttal that mirrors my own reactions. On the topic of predictions in particular, Farrell writes:

The claim here—that “accurate political prediction” is the “field’s benchmark for what counts as science” is quite wrong. There really isn’t much work at all by political scientists that aspires to predict what will happen in the future…It is reasonable to say that the majority position in political science is a kind of soft positivism, which focuses on the search for law-like generalizations. But that is neither a universal benchmark (I, for one, don’t buy into it), nor indeed, the same thing as accurate prediction, except where strong covering laws (of the kind that few political scientists think are generically possible) can be found.

To Farrell’s excellent rebuttals, I would add a couple of things.

First and most important, there’s a strong case to be made that political scientists don’t engage in enough forecasting and really ought to do more of it. Contrary to Stevens’ assertion in that NYT op-ed, most political scientists eschew forecasting, seeing description and explanation as the goals of their research instead. As Phil Schrodt argues in “Seven Deadly Sins of Quantitative Political Science” (PDF), however, to the extent that we see our discipline as a form of science, political scientists ought to engage in forecasting, because prediction is an essential part of the scientific method.

Explanation in the absence of prediction is not somehow scienti cally superior to predictive analysis, it isn’t scienti c at all! It is, instead, “pre-scientific.”

In a paper on predicting civil conflicts, Mike Ward, Brian Greenhill, and Kristin Bakke help to explain why:

Scholars need to make and evaluate predictions in order to improve our models. We have to be willing to make predictions explicitly – and plausibly be wrong, even appear foolish – because our policy prescriptions need to be undertaken with results that are drawn from robust models that have a better chance of being correct. The whole point of estimating risk models is to be able to apply them to specific cases. You wouldn’t expect your physician to tell you that all those cancer risk factors from smoking don’t actually apply to you. Predictive heuristics provide a useful, possibly necessary, strategy that may help scholars and policymakers guard against erroneous recommendations.

Second, I think Stevens actually gets the historical record wrong. It drives me crazy when I see people take the conventional wisdom about a topic—say, the possibility of the USSR’s collapse, or a wave of popular uprisings in Middle East and North Africa—and turn it into a blanket statement that “no one predicted X.” Most of the time, we don’t really know what most people would have predicted, because they weren’t asked to make predictions. The absence of a positive assertion that X will happen is not the same thing as a forecast that X will not happen. In fact, in at least one of the cases Stevens discusses—the USSR’s collapse—we know that some observers did forecast its eventual collapse, albeit usually without much specificity about the timing of that event.

More generally, I think it’s fair to say that, on just about any topic, there will be a distribution of forecasts—from high to low, impossible to inevitable, and so on. Often, that distribution will have a clear central tendency, as did expectations about the survival of authoritarian regimes in the USSR or the Arab world, but that central tendency should not be confused with a consensus. Instead, this divergence of expectations is precisely where the most valuable information will be found. Eventually, some of those predictions will prove correct while others will not, and, as Phil and Mike and co. remind us, that variation in performance tells us something very useful about the power of the explanatory models—quantitative, qualitative, it doesn’t really matter—from which they were derived.

PS. For smart rebuttals to other aspects of Steven’s jeremiad, see Erik Voeten’s post at the Monkey Cage and Steve Saideman’s rejoinder at Saideman’s Semi-Spew.

PPS. Stevens provides some context for her op-ed on her own blog, here. (I would have added this link sooner, but I’ve just seen it myself.)

PPPS. For some terrific ruminations on uncertainty, statistics, and scientific knowledge, see this latecomer response from Anton Strezhnev.

Follow

Get every new post delivered to your Inbox.

Join 6,645 other followers

%d bloggers like this: