Measuring Trends in Human Rights Practices

I wrote a thing for Foreign Policy‘s Democracy Lab on evidence that widely-used data on human rights practices understate the improvements that have occurred around the world in the past few decades:

“It’s Getting Better All The Time”

The idea for the piece came from reading Chris Fariss’s  May 2014 article in American Political Science Review and then digging around in the other work he and others have done on the topic. It’s hard to capture the subtleties of a debate as technical as this one in a short piece for a general audience, so if you’re really interested in the subject, I would encourage you to read further. See especially the other relevant papers on Chris’s Publications page and the 2013 article by Anne Marie Clark and Kathryn Sikkink.

In the piece, I report that “some human rights scholars see Fariss’ statistical adjustments as a step in the right direction.” Among others I asked, Christian Davenport wrote to me that he agrees with Fariss about how human rights reporting has evolved over time, and what that implies for measurement of these trends. And Will Moore described Fariss’s estimates in an email as a “dramatic improvement” over previous measures. As it happens, Will is working with Courtenay Conrad on a data set of allegations of torture incidents around the world from specific watchdog groups (see here). Like Chris, Will presumes that the information we see about human rights violations is incomplete, so he encourages researchers to treat available information as a biased sample and use statistical models to better estimate the underlying conditions of concern.

When I asked David Cingranelli, one of the co-creators of what started out at the Cingranelli and Richards (CIRI) data set, for comment, he had this to say (and more, but I’ll just quote this bit here):

I’m not convinced that either the “human rights information paradox” or the “changing standard of accountability” produce a systematic bias in CIRI data. More importantly, the evidence presented by Clark and Sikkink and the arguments made by Chris Fariss do not convince me that there is a better alternative to the CIRI method of data recording that would be less likely to suffer from biases and imprecision. The CIRI method is not perfect, but it provides an optimal trade-off between data precision and transparency of data collection. Statistically advanced indexes (scores) might improve the precision but would for sure significantly reduce the ability of scholars to understand and replicate the data generation process.  Overall, the empirical research would suffer from such modifications.

I hope this piece draws wider attention to this debate, which interests me in two ways. The first is the substance: How have human rights practices changed over time? I don’t think Fariss’ findings settle that question in some definitive and permanent way, but they did convince me that the trend in the central tendency over the past 30 or 40 years is probably better than the raw data imply.

The second way this debate interests me is as another example of the profound challenges involved in measuring political behavior. As is the case with violent conflict and other forms of contentious politics, almost every actor at every step in the process of observing human rights practices has an agenda—who doesn’t?—and those agendas shape what information bubbles up, how it gets reported, and how it gets summarized as numeric data. The obvious versions of this are the attempts by violators to hide their actions, but activists and advocates also play important roles in selecting and shaping information about human rights practices. And, of course, there are also technical and practical features of local and global political economies that filter and alter the transmission of this information, including but certainly not limited to language and cultural barriers and access to communications technologies.

This blog post is now about half as long as the piece it’s meant to introduce, so I’ll stop here. If you work in this field or otherwise have insight into these issues and want to weigh in, please leave a comment here or at Democracy Lab.

Halloween, Quantified

Some parents dress up for Halloween. Some throw parties. In our house, we—well, I, really; my wife was bemused, my younger son vaguely interested, and my elder son embarrassed—I collect and chart the data.

First, the flow of trick-or-treaters. The figure below shows counts in 15-minute bins of kids who came to our door for candy. The first arrival, a little girl in a fairy/princess costume, showed up around 5:50 PM, well before sunset. The deluge came an hour later, when a mob from a party next door blended with an uptick in other arrivals. The other peak came almost an hour after that and probably had a much higher median age than the earlier one. The final handful strolled through around 8:40, right when we were shutting down so we could fetch and drop off our own teenage boys from other parts of town.


This year, I also tallied which candy the trick-or-treaters chose. The figure below plots the resulting data. If the line ends early, it means we ran out of that kind of candy. As my wife predicted, the kids’ taste is basically the inverse of ours, which, as one costumed adult chaperoning his child pointed out, is “perfect.”


To collect the data, I sat on my front porch in a beach chair with a notepad, greeted the arriving kids, asked them to pick one, and then jotted tick marks as they left. Colleague Ali Necamp suggested that I put the candies in separate containers to make it easier to track who took what; I did, and she was right. Only a couple of people asked me why the candies were laid out in bins, and I clearly heard one kid approaching the house ask, “Mommy, why is that man sitting on the porch?”

Measurement Is Hard, Especially of Politics, and Everything Is Political

If occasional readers of this blog remember only one thing from their time here, I’d like it to be this: we may be getting better at measuring political things around the world, but huge gaps remain, sometimes on matters that seem basic or easy to see, and we will never close those gaps completely.

Two items this week reminded me of this point. The first came from the World Bank, which blogged that only about half of the countries they studied for a recent paper had “adequate” data on poverty. As a chart from an earlier World Bank blog post showed, the number of countries suffering from “data deprivation” on this topic has declined since the early 1990s, but it’s still quite large. Also notice that the period covered by the 2015 study ends in 2011. So, in addition to “everywhere”, we’ve still got serious problems with the “all the time” part of the Big Data promise, too.

The other thing that reminded me of data gaps was a post on the Lowy Institute’s Interpreter blog about Myanmar’s military, the Tatmadaw. According to Andrew Selth,

Despite its dominance of Burma’s national affairs for decades, the Tatmadaw remains in many respects a closed book. Even the most basic data is beyond the reach of analysts and other observers. For example, the Tatmadaw’s current size is a mystery, although most estimates range between 300,000 and 350,000. Official statistics put Burma’s defence expenditure this year at 3.7 % of GDP, but the actual level is unknown.

This kind of situation may be especially pernicious. It looks like we have data—350,000 troops, 3.7 percent of GDP—but the subject-matter expert knows that those data are not reliable. For those of us trying to do cross-national analysis of things like conflict dynamics or coup risk, the temptation to plow ahead with the numbers we have is strong, but we shouldn’t trust the inferences we draw from them.

The size and capability of a country’s military are obviously political matters. It’s not hard to imagine why governments might want to mislead others about the true values of those statistics.

Measuring poverty might seem less political and thus more amenable to technical fixes or workarounds, but that really isn’t true. At each step in the measurement process, the people being observed or doing the observing may have reasons to obscure or mislead. Survey respondents might not trust their observers; they may fear the personal or social consequences of answering or not answering certain ways, or just not like the intrusion. When the collection is automated, they may develop ways to fool the routines. Local officials who sometimes oversee the collection of those data may be tempted to fudge numbers that affect their prospects for promotion or permanent exile. National governments might seek to mislead other governments as a way to make their countries look stronger or weaker than they really are—stronger to deter domestic and international adversaries or get a leg up in ideological competitions, or weaker to attract aid or other help.

As social scientists, we dream of data sets that reliably track all sorts of human behavior. Our training should also make us sensitive to the many reasons why that dream is impossible and, in many cases, undesirable. Measurement begets knowledge; knowledge begets power; and struggles over power will never end.

Big Data Doesn’t Automatically Produce Better Predictions

At FiveThirtyEight, Neil Payne and Rob Arthur report on an intriguing puzzle:

In an age of unprecedented baseball data, we somehow appear to be getting worse at knowing which teams are — and will be — good.

Player-level predictions are as good if not better than they used to be, but team-level predictions of performance are getting worse. Payne and Arthur aren’t sure why, but they rank a couple of trends in the industry — significant changes in the age structure of the league’s players and, ironically, the increased use of predictive analytics in team management — among the likely culprits.

This story nicely illustrates a fact that breathless discussions of the power of “Big Data” often elide: more and better data don’t automatically lead to more accurate predictions. Observation and prediction are interrelated, but the latter does not move in lock step with the former. At least two things can weaken the link between those two steps in the analytical process.

First, some phenomena are just inherently difficult or impossible to predict with much accuracy. That’s not entirely true of baseball; as Payne and Arthur show, team-level performance predictions have been pretty good in the past. It is true of many other phenomena or systems, however. Take earthquakes; we can now detect and record these events with tremendous precision, but we’re still pretty lousy at anticipating when they’ll occur and how strong they will be. So far, better observation hasn’t led to big gains in prediction.

Second, the systems we’re observing sometimes change, even as we get better at observing them. This is what Payne and Arthur imply is occurring in baseball when they identify trends in the industry as likely explanations for a decline in the predictive power of models derived from historical data. It’s like trying to develop a cure for a disease that’s evolving rapidly as you work on it; the cure you develop in the lab might work great on the last version you captured, but by the time you deploy it, the disease has evolved further, and the treatment doesn’t have much effect.

I wonder if this is also the trajectory social science will follow over the next few decades. Right now, we’re getting hit by the leading edge of what will probably be a large and sustained flood tide of new data on human behavior.  That inflow is producing some rather optimistic statements about how predictable human behavior in general, and sometimes politics in particular, will become as we discover deeper patterns in those data.

I don’t share that confidence. A lot of human behavior is predictably routine, and a torrent of new behavioral data will almost certainly make us even better at predicting these actions and events. For better or for worse, though, those routines are not especially interesting or important to most political scientists. Political scientists are more inclined to focus on “high” politics, which remains fairly opaque, or on system-level outcomes like wars and revolutions that emerge from changes in individual-level behavior in non-obvious ways. I suspect we’ll get a little better at predicting these things as we accumulate richer data on various parts of those systems, but I am pretty sure we won’t ever get great at it. The processes are too complex, and the systems themselves are constantly evolving, maybe even at an accelerating rate.

Finding the Right Statistic

Earlier this week, Think Progress reported that at least five black women have died in police custody in the United States since mid-July. The author of that post, Carimah Townes, wrote that those deaths “[shine] an even brighter spotlight on the plight of black women in the criminal justice system and [fuel] the Black Lives Matter movement.” I saw the story on Facebook, where the friend who posted it inferred that “a disproportionate percentage of those who died in jail are from certain ethnic minorities.”

As a citizen, I strongly support efforts to draw attention to implicit and explicit racism in the U.S. criminal justice system, and in the laws that system is supposed to enforce. The inequality of American justice across racial and ethnic groups is a matter of fact, not opinion, and its personal and social costs are steep.

As a social scientist, though, I wondered how much the number in that Think Progress post — five — really tells us. To infer bias, we need to make comparisons to other groups. How many white women died in police custody during that same time? What about black men and white men? And so on for other subsets of interest.

Answering those questions would still get us only partway there, however. To make the comparisons fair, we would also need to know how many people from each of those groups passed through police custody during that time. In epidemiological jargon, what we want are incidence rates for each group: the number of cases from some period divided by the size of the population during that period. Here, cases are deaths, and the population of interest is the number of people from that group who spent time in police custody.

I don’t have those data for the United States for second half of July, and I doubt that they exist in aggregate at this point. What we do have now, however, is a U.S. Department of Justice report from October 2014 on mortality in local jails and state prisons (PDF). This isn’t exactly what we’re after, but it’s close.

So what do those data say? Here’s an excerpt from Table 6, which reports the “mortality rate per 100,000 local jail inmates by selected decedent characteristics, 2000–2012”:

                    2008     2009     2010     2011     2012
By Sex
Male                 123      129      125      123      129
Female               120      120      124      122      123

Race/Hispanic Origin
White                185      202      202      212      220
Black/Afr. Am.       109      100      102       94      109
Hispanic/Latino       70       71       58       67       60
Other                 41       53       36       28       31

Given what we know about the inequality of American justice, these figures surprised me. According to data assembled by the DOJ, the mortality rate of blacks in local jails in those recent years was about half the rate for whites. For Latinos, it was about one-third the rate for whites.

That table got me wondering why those rates were so different from what I’d expected. Table 8 in the same report offers some clues. It provides death rates by cause for each of those same subgroups for the whole 13-year period. According to that table, white inmates committed suicide in local jails at a much higher rate than blacks and Latinos: 80 per 100,000 versus 14 and 25, respectively. Those figures jibe with ones on suicide rates for the general population. White inmates also died from heart disease and drug and alcohol intoxication at a higher rate than their black and Latino counterparts. In short, it looks like whites are more likely than blacks or Latinos to die while in local jails, mostly because they are much more likely to commit suicide there.

These statistics tell us nothing about whether or not racism or malfeasance played a role in the deaths of any of those five black women mentioned in the Think Progress post. They also provide a woefully incomplete picture of the treatment of different racial and ethnic groups by police and the U.S. criminal justice system. For example and as FiveThirtyEight reported just a few days ago, DOJ statistics also show that the rate of arrest-related deaths by homicide is almost twice as high for blacks as whites — 3.4 per 100,000 compared to 1.8. In many parts of the U.S., blacks convicted of murder are more likely than their white counterparts to get the death penalty, even when controlling for similarities in the crimes involved and especially when the victims were white (see here). A 2013 Pew Research Center Study found that, in 2010, black men were six times as likely as white men to be incarcerated in federal, state and local jails.

Bearing all of that in mind, what I hope those figures do is serve as a simple reminder that, when mustering evidence of a pattern, it’s important to consider the right statistic for the question. Raw counts will rarely be that statistic. If we want to make comparisons across groups, we need to think about differences in group size and other factors that might affect group exposure, too.


The Armed Conflict Location & Event Data Project, a.k.a. ACLED, produces up-to-date event data on certain kinds of political conflict in Africa and, as of 2015, parts of Asia. In this post, I’m not going to dwell on the project’s sources and methods, which you can read about on ACLED’s About page, in the 2010 journal article that introduced the project, or in the project’s user’s guides. Nor am I going to dwell on the necessity of using all political event data sets, including ACLED, with care—understanding the sources of bias in how they observe events and error in how they code them and interpreting (or, in extreme cases, ignoring) the resulting statistics accordingly.

Instead, my only aim here is to share an R script I’ve written that largely automates the process of downloading and merging ACLED’s historical and current Africa data and then creates a new data frame with counts of events by type at the country-month level. If you use ACLED in R, this script might save you some time and some space on your hard drive.

You can find the R script on GitHub, here.

The chief problem with this script is that the URLs and file names of ACLED’s historical and current data sets change with every update, so the code will need to be modified each time that happens. If the names were modular and the changes to them predictable, it would be easy to rewrite the code to keep up with those changes automatically. Unfortunately, they aren’t, so the best I can do for now is to give step-by-step instructions in comments embedded in the script on how to update the relevant four fields by hand. As long as the basic structure of the .csv files posted by ACLED doesn’t change, though, the rest should keep working.

[UPDATE: I revised the script so it will scrape the link addresses from the ACLED website and parse the file names from them. The new version worked after ACLED updated its real-time file earlier today, when the old version would have broken. Unless ACLED changes its file-naming conventions or the structure of its website, the version should work for the rest of 2015. In case it does fail, instructions on how to hard-code a workaround are included as comments at the bottom of the script.]

It should also be easy to adapt the part of the script that generates country-month event counts to slice the data even more finely, or to count by something other than event type. To do that, you would just need to add variables to the group_by() part of the block of code that produces the object For example, if you wanted to get counts of events by type at the level of the state or province, you would revise that line to read group_by(gwno, admin1, year, month, event_type). Or, if you wanted country-month counts of events by the type(s) of actor involved, you could use group_by(gwno, year, month, interaction) and then see this user’s guide to decipher those codes. You get the drift.

The script also shows a couple of examples of how to use ‘gglot2’ to generate time-series plots of those monthly counts. Here’s one I made of monthly counts of battle events by country for the entire period covered by ACLED as of this writing: January 1997–June 2015. A production-ready version of this plot would require some more tinkering with the size of the country names and the labeling of the x-axis, but the kind of small-multiples chart offers a nice way to explore the data before analysis.

Monthly counts of battle events, January 1997-June 2015

Monthly counts of battle events, January 1997-June 2015

If you use the script and find flaws in it or have ideas on how to make it work better or do more, please email me at ulfelder <at> gmail <dot> com.

Another Tottering Step Toward a New Era of Data-Making

Ken Benoit, Drew Conway, Benjamin Lauderdale, Michael Laver, and Slava Mikhaylov have an article forthcoming in the American Political Science Review that knocked my socks off when I read it this morning. Here is the abstract from the ungated version I saw:

Empirical social science often relies on data that are not observed in the field, but are transformed into quantitative variables by expert researchers who analyze and interpret qualitative raw sources. While generally considered the most valid way to produce data, this expert-driven process is inherently difficult to replicate or to assess on grounds of reliability. Using crowd-sourcing to distribute text for reading and interpretation by massive numbers of non-experts, we generate results comparable to those using experts to read and interpret the same texts, but do so far more quickly and flexibly. Crucially, the data we collect can be reproduced and extended transparently, making crowd-sourced datasets intrinsically reproducible. This focuses researchers’ attention on the fundamental scientific objective of specifying reliable and replicable methods for collecting the data needed, rather than on the content of any particular dataset. We also show that our approach works straightforwardly with different types of political text, written in different languages. While findings reported here concern text analysis, they have far-reaching implications for expert-generated data in the social sciences.

The data-making strategy they develop is really innovative, and the cost of implementing is, I estimate from the relevant tidbits in the paper, 2–3 orders of magnitude lower than the cost of the traditional expert-centric approach. In other words, this is potentially a BIG DEAL for social-science data-making, which, as Sinan Aral reminds us, is a BIG DEAL for doing better social science.

That said, I do wonder how much structure is baked into the manifesto-coding task that isn’t there in most data-making problems, and that makes it especially well suited to the process the authors develop. In the exercise the paper describes:

  1. The relevant corpus (party manifestos) is self-evident, finite, and not too large;
  2. The concepts of interest (economic vs. social policy, left vs. right) are fairly intuitive; and
  3. The inferential task is naturally “fractal”; that is, the concepts of interest inhere in individual sentences (and maybe even words) as well as whole documents.

None of those attributes holds when it comes to coding latent socio-political structural features like de facto forms of government (a.k.a. regime type) or whether or not a country is in a state of civil war. These features are fundamental to analyses of international politics, but the high cost of producing them means that we sometimes don’t get them at all, and when we do, we usually don’t get them updated as quickly or as often as we would need to do more dynamic analysis and prediction. Maybe it’s my lack of imagination, but I can’t quite see how to extend the authors’ approach to those topics without stretching it past the breaking point. I can think of ways to keep the corpus manageable, but the concepts are not as intuitive, and the inferential task is not fractal. Ditto for coding event data, where I suspect that 2 from the list above would mostly hold; 3 would sometimes hold; but 1 absolutely would not.*

In short, I’m ga-ga about this paper and the directions in which it points us, but I’m not ready yet to declare imminent victory in the struggle to drag political science into a new and much healthier era of data-making. (Fool me once…)

* If you think I’m overlooking something here, please leave a comment explaining how you think it might be do-able.

Visualizing Strike Activity in China

In my last post, I suggested that the likelihood of social unrest in China is probably higher than a glance at national economic statistics would suggest, because those statistics conceal the fact that economic malaise is hitting some areas much harder than others and local pockets of unrest can have national effects (ask Mikhail Gorbachev about that one). Near the end of the post, I effectively repeated this mistake by showing a chart that summarized strike activity over the past few years…at the national level.

So, what does the picture look like if we disaggregate that national summary?

The best current data on strike activity in China come from China Labour Bulletin (CLB), a Hong Kong–based NGO that collects incident reports from various Chinese-language sources, compiles them in a public data set, and visualizes them in an online map. Those data include a few fields that allow us to disaggregate our analysis, including the province in which an incident occurred (Location), the industry involved (Industry), and the claims strikers made (Demands). On May 28, I downloaded a spreadsheet with data for all available dates (January 2011 to the present) for all types of incidents and wrote an R script that uses small multiples to compare strike activity across groups within each of those categories.

First, here’s the picture by province. This chart shows that Guangdong has been China’s most strike-prone province over the past several years, but several other provinces have seen large increases in labor unrest in the past two years, including Henan, Hebei, Hubei, Shandong, Sichuan, and Jiangsu. Right now, I don’t have monthly or quarterly province-level data on population size and economic growth to model the relationship among these things, but a quick eyeballing of the chart from the FT in my last post indicates that these more strike-prone provinces skew toward the lower end of the range of recent GDP growth rates, as we would expect.


Now here’s the picture by industry. This chart makes clear that almost all of the surge in strike activity in the past year has come from two sectors: manufacturing and construction. Strikes in the manufacturing sector have been trending upward for a while, but the construction sector really got hit by a wave in just the past year that crested around the time of the Lunar New Year in early 2015. Other sectors also show signs of increased activity in recent months, though, including services, mining, and education, and the transportation sector routinely contributes a non-negligible slice of the national total.


And, finally, we can compare trends over time in strikers’ demands. This analysis took a little more work, because the CLB data on Demands do not follow best coding practices in which a set of categories is established a priori and each demand is assigned to one of those categories. In the CLB data, the Demands field is a set of comma-delimited phrases that are mostly but not entirely standardized (e.g., “wage arrears” and “social security” but also “reduction of their operating territory” and “gas-filing problem and too many un-licensed cars”). So, to aggregate the data on this dimension, I created a few categories of my own and used searches for regular expressions to find records that belonged in them. For example, all events for which the Demands field included “wage arrear”, “pay”, “compensation”, “bonus” or “ot” got lumped together in a Pay category, while events involving claims marked as “social security” or “pension” got combined in a Social Security category (see the R script for details).

The results appear below. As CLB has reported, almost all of the strike activity in China is over pay, usually wage arrears. There’s been an uptick in strikes over layoffs in early 2015, but getting paid better, sooner, or at all for work performed is by far the chief concern of strikers in China, according to these data.


In closing, a couple of caveats.

First, we know these data are incomplete, and we know that we don’t know exactly how they are incomplete, because there is no “true” record to which they can be compared. It’s possible that the apparent increase in strike activity in the past year or two is really the result of more frequent reporting or more aggressive data collection on a constant or declining stock of events.

I doubt that’s what’s happening here, though, for two reasons. One, other sources have reported the Chinese government has actually gotten more aggressive about censoring reports of social unrest in the past two years, so if anything we should expect the selection bias from that process to bend the trend in the opposite direction. Two, theory derived from historical observation suggests that strike activity should increase as the economy slows and the labor market tightens, and the observed data are consistent with those expectations. So, while the CLB data are surely incomplete, we have reason to believe that the trends they show are real.

Second, the problem I originally identified at the national level also applies at these levels. China’s provinces are larger than many countries in the world, and industry segments like construction and manufacturing contain a tremendous variety of activities. To really escape the ecological fallacy, we would need to drill down much further to the level of specific towns, factories, or even individuals. As academics would say, though, that task lies beyond the scope of the current blog post.

An Applied Forecaster’s Bad Dream

This is the sort of thing that freaks me out every time I’m getting ready to deliver or post a new set of forecasts:

In its 2015 States of Fragility report, the Organization for Economic Co-operation and Development (OECD) decided to complicate its usual one-dimensional list of fragile states by assessing five dimensions of fragility: Violence, Justice, Institutions, Economic Foundations and Resilience…

Unfortunately, something went wrong during the calculations. In my attempts to replicate the assessment, I found that the OECD misclassified a large number of states.

That’s from a Monkey Cage post by Thomas Leo Scherer, published today. Here, per Scherer, is why those errors matter:

Recent research by Judith Kelley and Beth Simmons shows that international indicators are an influential policy tool. Indicators focus international attention on low performers to positive and negative effect. They cause governments in poorly ranked countries to take action to raise their scores when they realize they are being monitored or as domestic actors mobilize and demand change after learning how they rate versus other countries. Given their potential reach, indicators should be handled with care.

For individuals or organizations involved in scientific or public endeavors, the best way to mitigate that risk is transparency. We can and should argue about concepts, measures, and model choices, but given a particular set of those elements, we should all get essentially the same results. When one or more of those elements is hidden, we can’t fully understand what the reported results represent, and researchers who want to improve the design by critiquing and perhaps extending it are forced to box shadows. Also, individuals and organizations can double– and triple-check their own work, but errors are almost inevitable. When getting the best possible answers matters more than the risk of being seen making mistakes, then transparency is the way to go. This is why the Early Warning Project shares the data and code used to produce its statistical risk assessments in a public repository, and why Reinhart and Rogoff probably (hopefully?) wish they’d done something similar.

Of course, even though transparency improves the probability of catching errors and improving on our designs, it doesn’t automatically produce those goods. What’s more, we can know that we’re doing the right thing and still dread the public discovery of an error. Add to that risk the near-certainty of other researchers scoffing at your terrible code, and it’s easy see why even the best practices won’t keep you from breaking out in a cold sweat each time you hit “Send” or “Publish” on a new piece of work.


The Myth of Comprehensive Data

“What about using Twitter sentiment?”

That suggestion came to me from someone at a recent Data Science DC meetup, after I’d given a short talk on assessing risks of mass atrocities for the Early Warning Project, and as the next speaker started his presentation on predicting social unrest. I had devoted the first half of my presentation to a digression of sorts, talking about how the persistent scarcity of relevant public data still makes it impossible to produce global forecasts of rare political crises—things like coups, insurgencies, regime breakdowns, and mass atrocities—that are as sharp and dynamic as we would like.

The meetup wasn’t the first time I’d heard that suggestion, and I think all of the well-intentioned people who have made it to me have believed that data derived from Twitter would escape or overcome those constraints. In fact, the Twitter stream embodies them. Over the past two decades, technological, economic, and political changes have produced an astonishing surge in the amount of information available from and about the world, but that surge has not occurred evenly around the globe.

Think of the availability of data as plant life in a rugged landscape, where dry peaks are places of data scarcity and fertile valleys represent data-rich environments. The technological developments of the past 20 years are like a weather pattern that keeps dumping more and more rain on that topography. That rain falls unevenly across the landscape, however, and it doesn’t have the same effect everywhere it lands. As a result, plants still struggle to grow on many of those rocky peaks, and much of the new growth occurs where water already collected and flora were already flourishing.

The Twitter stream exemplifies this uneven distribution of data in a couple of important ways. Take a look at the map below, a screenshot I took after letting Tweetping run for about 16 hours spanning May 6–7, 2015. The brighter the glow, the more Twitter activity Tweetping saw.

tweetping 1530 20150506 to 0805 20150507

Some of the spatial variation in that map reflects differences in the distribution of human populations, but not all of it. Here’s a map of population density, produced by Daysleeper using data from CEISIN (source). If you compare this one to the map of Twitter usage, you’ll see that they align pretty well in Europe, the Americas, and some parts of Asia. In Africa and other parts of Asia, though, not so much. If it were just a matter of population density, then India and eastern China should burn brightest, but they—and especially China—are relatively dark compared to “the West.” Meanwhile, in Africa, we see pockets of activity, but there are whole swathes of the continent that are populated as or more densely than the brighter parts of South America, but from which we see virtually no Twitter activity.

world population density map

So why are some pockets of human settlement less visible than others? Two forces stand out: wealth and politics.

First and most obvious, access to Twitter depends on electricity and telecommunications infrastructure and gadgets and literacy and health and time, all of which are much scarcer in poorer parts of the world than they are in richer places. The map below shows lights at night, as seen from space by U.S. satellites 20 years ago and then mapped by NASA (source). These light patterns are sometimes used as a proxy for economic development (e.g., here).


This view of the world helps explain some of the holes in our map of Twitter activity, but not all of it. For example, many of the densely populated parts of Africa don’t light up much at night, just as they don’t on Tweetping, because they lack the relevant infrastructure and power production. Even 20 years ago, though, India and China looked much brighter through this lens than they do on our Twitter usage map.

So what else is going on? The intensity and character of Twitter usage also depends on freedoms of information and speech—the ability and desire to access the platform and to speak openly on it—and this political layer keeps other areas in the dark in that Tweetping map. China, North Korea, Cuba, Ethiopia, Eritrea—if you’re trying to anticipate important political crises, these are all countries you would want to track closely, but Twitter is barely used or unavailable in all of them as a direct or indirect consequence of public policy. And, of course, there are also many places where Twitter is accessible and used but censorship distorts the content of the stream. For example, Saudi Arabia lights up pretty well on the Twitter-usage map, but it’s hard to imagine people speaking freely on it when a tweet can land you in prison.

Clearly, wealth and political constraints still strongly shape the view of the world we can get from new data sources like Twitter. Contrary to the heavily-marketed myth of “comprehensive data,” poverty and repression continue to hide large swathes of the world out of our digital sight, or to distort the glimpses we get of them.

Unfortunate for efforts to forecast rare political crises, those two structural features that so strongly shape the production and quality of data also correlate with the risks we want to anticipate. The map below shows the Early Warning Project‘s most recent statistical assessments of the risk of onsets of state-led mass-killing episodes. Now flash back to the visualization of Twitter usage above, and you’ll see that many of the countries colored most brightly on this map are among the darkest on that one. Even in 2015, the places about which we most need more information to sharpen our forecasts of rare political crises are the ones that are still hardest to see.

Statistically, this is the second-worst of all possible worlds, the worst one being the total absence of information. Data are missing not at random, and the processes producing those gaps are the same ones that put places at greater risk of mass atrocities and other political calamities. This association means that models we estimate with those data will often be misleading. There are ways to mitigate these problems, but they aren’t necessarily simple, cheap, or effective, and that’s before we even start in on the challenges of extracting useful measures from something as heterogeneous and complex as the Twitter stream.

So that’s what I see when I hear people suggest that social media or Google Trends or other forms of “digital exhaust” have mooted the data problems about which I so often complain. Lots of organizations are spending a lot of money trying to overcome these problems, but the political and economic topography producing them does not readily yield. The Internet is part of this complex adaptive system, not a space outside it, and its power to transform that system is neither as strong nor as fast-acting as many of us—especially in the richer and freer parts of the world—presume.

%d bloggers like this: