Playing Telephone with Data Science

You know the telephone game, where a bunch of people sit in a circle or around a table and pass a whispered sentence from person to person until it comes back to the one who started it and they say the version they heard out loud and you all crack up at how garbled it got?

Well, I wonder if John Beieler is cracking up or crying right now, because the same thing is happening with a visualization he created using data from the recently released Global Dataset on Events, Language, and Tone, a.k.a. GDELT.

Back at the end of July, John posted a terrific animated set of maps of protest activity worldwide since 1979. In a previous post on a single slice of the data used in that animation, John was careful to attach a number of caveats to the work: the maps only include events covered in the sources GDELT scours, GDELT sometimes codes events that didn’t happen, GDELT sometimes struggles to put events in their proper geographic location, event labels in the CAMEO event classification scheme GDELT uses doesn’t always mean what you think they mean, counts of events don’t tell you anything about the size or duration of the events being counted, etc., etc.  In the blogged cover letter for the animated series, John added one more very important caveat about the apparent increase in the incidence of protest activity over time:

When dealing with the time-series of data, however, one additional, and very important, point also applies. The number of events recorded in GDELT grows exponentially over time, as noted in the paper introducing the dataset. This means that over time there appears to be a steady increase in events, but this should not be mistaken as a rise in the actual amount of behavior X (protest behavior in this case). Instead, due to changes in reporting and the digital recording of news stories, it is simply the case that there are more events of every type over time. In some preliminary work that is not yet publicly released, protest behavior seems to remain relatively constant over time as a percentage of the total number of events. This means that while there was an explosion of protest activity in the Middle East, and elsewhere, during the past few years, identifying visible patterns is a tricky endeavor due to the nature of the underlying data.

John’s post deservedly caught the eye of J. Dana Stuster, an assistant editor at Foreign Policy, who wrote a bit about it last week. Stuster’s piece was careful to repeat many of John’s caveats, but the headline—“Mapped: Every Protest on the Planet since 1979″—got sloppy, essentially shedding several of the most important qualifiers. As John had taken pains to note, what we see in the maps is not all that there is, and some of what’s shown in the maps didn’t really happen.

Well, you can probably where this is going.  Not long after that Foreign Policy piece appeared, I saw this tweet from Chelsea Clinton:

In fewer than 140 characters, Clinton impressively managed to put back the caveat Foreign Policy had dropped in its headline about press coverage vs. reality, but the message had already been garbled, and now it was going viral. Fast forward to this past weekend, when the phrase “Watch a Jaw-dropping Visualization of Every Protest since 1979” made repeated appearances in my Twitter timeline. This next iteration came from Ultraculture blogger Jason Louv, and it included this bit:

Also fruitful: Comparing this data with media coverage and treatment of protest. Why is it easy to think of the 1960s and 70s as a time of dissent and our time as a more ordered, controlled and conformist period when the data so clearly shows that there is no comparison in how much protest there is now compared to then? Media distortion much?

So now we get a version that ignores both the caveat about GDELT’s coverage not being exhaustive or perfect and the related one about the apparent increase in protest volume over time being at least in part an artifact of “changes in reporting and the digital recording of news stories.” What started out as a simple proof-of-concept exercise—“The areas that are ‘bright’ are those that would generally be expected to be so,” John wrote in his initial post—had been twisted into a definitive visual record of protest activity around the world in the past 35 years.

As someone who thinks that GDELT is an analytical gusher and believes that it’s useful and important to make work like this accessible to broader audiences, I don’t know what to learn from this example. John was as careful as could be, but the work still mutated as it spread. How do you prevent this from happening, or at least mitigate the damage when it does?

If anyone’s got some ideas, I’d love to hear them.

Leave a comment

12 Comments

  1. Jonas

     /  August 26, 2013

    The medium is the message. We live in the telephone game medium now – every message is going to be distorted like this. People just have to sort it out after the initial wave of publicity is over. That’s the flip side of being fast with information. People will be forming opinions and taking action based on misunderstanding.

    Reply
  2. I think the only way to pull it off is probably to embed the caveats in the most easily replicated form of the presentation. So have a little box in your animation with an abbreviated list or the like.

    Not that I practice this, but maybe I should.

    Reply
    • Thanks. Brendan Nyhan suggested something similar on Twitter earlier today. It sounds like a good idea to me.

      Reply
    • Jonas

       /  August 28, 2013

      I think the incentive is to promote virality over caveats/accuracy. If you do that, you’ll just be drowned out by people who are less caveated. Doesn’t mean it isn’t worthwhile to do for your own integrity, but the problem will continue.

      Also, bots will extract your presentation and simplify it and take your clicks.

      Reply
  3. Felix

     /  August 27, 2013

    I get that this was an exercise in proof of concept, and a very valuable one. You ask what can be learned from the public reaction. How about this: don’t publish online a time-series of counts on a super-sexy topic if the data shouldn’t be read as counts.

    Reply
  1. Playing Telephone with Data Science | Data Nerd...
  2. Playing Telephone with Data Science | analyticalsolution
  3. GDELT is not the data fairy | asecondmouse
  4. Wednesday Addams Linkage » Duck of Minerva
  5. Playing Telephone with Data Science | Restore -...
  6. Links: Good Abstracts, News from the Blogs, Misunderstood Data - IR Blog

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 7,185 other followers

%d bloggers like this: