The Myth of Comprehensive Data

“What about using Twitter sentiment?”

That suggestion came to me from someone at a recent Data Science DC meetup, after I’d given a short talk on assessing risks of mass atrocities for the Early Warning Project, and as the next speaker started his presentation on predicting social unrest. I had devoted the first half of my presentation to a digression of sorts, talking about how the persistent scarcity of relevant public data still makes it impossible to produce global forecasts of rare political crises—things like coups, insurgencies, regime breakdowns, and mass atrocities—that are as sharp and dynamic as we would like.

The meetup wasn’t the first time I’d heard that suggestion, and I think all of the well-intentioned people who have made it to me have believed that data derived from Twitter would escape or overcome those constraints. In fact, the Twitter stream embodies them. Over the past two decades, technological, economic, and political changes have produced an astonishing surge in the amount of information available from and about the world, but that surge has not occurred evenly around the globe.

Think of the availability of data as plant life in a rugged landscape, where dry peaks are places of data scarcity and fertile valleys represent data-rich environments. The technological developments of the past 20 years are like a weather pattern that keeps dumping more and more rain on that topography. That rain falls unevenly across the landscape, however, and it doesn’t have the same effect everywhere it lands. As a result, plants still struggle to grow on many of those rocky peaks, and much of the new growth occurs where water already collected and flora were already flourishing.

The Twitter stream exemplifies this uneven distribution of data in a couple of important ways. Take a look at the map below, a screenshot I took after letting Tweetping run for about 16 hours spanning May 6–7, 2015. The brighter the glow, the more Twitter activity Tweetping saw.

tweetping 1530 20150506 to 0805 20150507

Some of the spatial variation in that map reflects differences in the distribution of human populations, but not all of it. Here’s a map of population density, produced by Daysleeper using data from CEISIN (source). If you compare this one to the map of Twitter usage, you’ll see that they align pretty well in Europe, the Americas, and some parts of Asia. In Africa and other parts of Asia, though, not so much. If it were just a matter of population density, then India and eastern China should burn brightest, but they—and especially China—are relatively dark compared to “the West.” Meanwhile, in Africa, we see pockets of activity, but there are whole swathes of the continent that are populated as or more densely than the brighter parts of South America, but from which we see virtually no Twitter activity.

world population density map

So why are some pockets of human settlement less visible than others? Two forces stand out: wealth and politics.

First and most obvious, access to Twitter depends on electricity and telecommunications infrastructure and gadgets and literacy and health and time, all of which are much scarcer in poorer parts of the world than they are in richer places. The map below shows lights at night, as seen from space by U.S. satellites 20 years ago and then mapped by NASA (source). These light patterns are sometimes used as a proxy for economic development (e.g., here).

earth_lights

This view of the world helps explain some of the holes in our map of Twitter activity, but not all of it. For example, many of the densely populated parts of Africa don’t light up much at night, just as they don’t on Tweetping, because they lack the relevant infrastructure and power production. Even 20 years ago, though, India and China looked much brighter through this lens than they do on our Twitter usage map.

So what else is going on? The intensity and character of Twitter usage also depends on freedoms of information and speech—the ability and desire to access the platform and to speak openly on it—and this political layer keeps other areas in the dark in that Tweetping map. China, North Korea, Cuba, Ethiopia, Eritrea—if you’re trying to anticipate important political crises, these are all countries you would want to track closely, but Twitter is barely used or unavailable in all of them as a direct or indirect consequence of public policy. And, of course, there are also many places where Twitter is accessible and used but censorship distorts the content of the stream. For example, Saudi Arabia lights up pretty well on the Twitter-usage map, but it’s hard to imagine people speaking freely on it when a tweet can land you in prison.

Clearly, wealth and political constraints still strongly shape the view of the world we can get from new data sources like Twitter. Contrary to the heavily-marketed myth of “comprehensive data,” poverty and repression continue to hide large swathes of the world out of our digital sight, or to distort the glimpses we get of them.

Unfortunate for efforts to forecast rare political crises, those two structural features that so strongly shape the production and quality of data also correlate with the risks we want to anticipate. The map below shows the Early Warning Project‘s most recent statistical assessments of the risk of onsets of state-led mass-killing episodes. Now flash back to the visualization of Twitter usage above, and you’ll see that many of the countries colored most brightly on this map are among the darkest on that one. Even in 2015, the places about which we most need more information to sharpen our forecasts of rare political crises are the ones that are still hardest to see.

ewp.sra.world.2014

Statistically, this is the second-worst of all possible worlds, the worst one being the total absence of information. Data are missing not at random, and the processes producing those gaps are the same ones that put places at greater risk of mass atrocities and other political calamities. This association means that models we estimate with those data will often be misleading. There are ways to mitigate these problems, but they aren’t necessarily simple, cheap, or effective, and that’s before we even start in on the challenges of extracting useful measures from something as heterogeneous and complex as the Twitter stream.

So that’s what I see when I hear people suggest that social media or Google Trends or other forms of “digital exhaust” have mooted the data problems about which I so often complain. Lots of organizations are spending a lot of money trying to overcome these problems, but the political and economic topography producing them does not readily yield. The Internet is part of this complex adaptive system, not a space outside it, and its power to transform that system is neither as strong nor as fast-acting as many of us—especially in the richer and freer parts of the world—presume.

Leave a comment

12 Comments

  1. Grant

     /  May 7, 2015

    There’s also the problem of alternatives being used. In China, even if they abruptly made freedom of speech a real thing, I suspect that many users would stick with Weibo. So you would need to use more than just Twitter anyway.

    Also you’d need to be able to translate all of the different languages and different versions of the languages (not to mention slang and abbreviations) well enough to understand what people were even saying.

    Reply
    • I was going to make this exact point. China has an equivalent to just about every form of social media and blogging site out there.

      I also saw a presentation by IOM at the SAS Global Forum last month on using Twitter to respond to Typhoon Haiyan and better plan refugee camps. It was interesting, but when I asked if they looked at tweets in Tagalog, they said they only looked at English language tweets – and that doesn’t even take into account the many dialects on various islands.

      Reply
  2. In the words of Otto von Bismarck, politics is “the art of the possible”, and predicting the possible is evidently far more difficult than predicting what is somewhat certain to happen.

    Surely, Twitter (or Facebook, Google+ et cetera) can provide lawmakers with significant information about situations around the world, especially when tweets are written by users from distant areas that are not directly monitored.
    However, owning a potentially infinite amount of data needs plenty of men, skills and time to analyze them in order to select those truly worthy of being taken into account.

    Last but not least, as this article rightly notes, Twitter, just like other social networking sites, has an audience whose size is still too small. In my opinion, it would be vague probability calculation rather than plausible forecasting.

    Reply
  3. Much of the promise of urban innovation has been hung on big data. You rightfully note that “big data” is not comprehensive or even representative. Still there are solutions. Governmental administrative data can fill gaps, and triangulation of multiple datasets can provide insights. Yet, there will always be “distorted glimpses” in our “digital sight.” as you note. New ways to connect those who are outside of the mainstream, our vision, and our twitter feed must be developed. My work has focused on building sidewalks between “big data” and real life. The last mile is often the most difficult and most expensive. but there is great value in the investment.

    Reply
  4. Interesting post. The data probably demonstrates visually survival of the monetarily fittest and the disproportionate influence of the wealthy in the Twitter-sphere.

    Reply
  5. Reblogged this on My Oracle OLTP and Warehousing Adventures and commented:

    Using truly large data sources makes one rethink the RDBMS – we are presently playing with both Oracle, transactional and warehouse databases and HADOOP. Will update here on our progress, lack of or pure gobsmacked hilarity of our attempts at intelligence…

    Reply
  6. Reblogged this on David Rayner and commented:
    Ubiquitus and fast can get close to predicting chaotic?

    Reply
  7. Reblogged this on dyke writer and commented:

    That quantity and quality have an inverse relationship I think is something that all Mass Data Analyzers hit that bump and going with your own observational punchline or visceral rhetorical is better than looking for a random person to do better than you, the data expert.

    Reply
  8. Jack

     /  October 12, 2015

    Reblogged this on The Missal.

    Reply
  1. Weekly Links | Political Violence @ a Glance
  2. The Myth of Comprehensive Data – Simbiotica's Blog
  3. Measurement Is Hard, Especially of Politics, and Everything Is Political | Dart-Throwing Chimp

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: