Last May, I wrote a post about my preliminary efforts to use a new data set called GDELT to monitor reporting on atrocities around the world in near-real time. Those efforts represent one part of the work I’m doing on a public early-warning system for the U.S. Holocaust Memorial Museum’s Center for the Prevention of Genocide, and they have continued in fits and starts over the ensuing eight months. With help from Dartmouth’s Dickey Center, Palantir, and the GDELT crew, we’ve made a lot of progress. I thought I’d post an update now because I’m excited about the headway we’ve made; I think others might benefit from seeing what we’re doing; and I hope this transparency can help us figure out how to do this task even better.
So, let’s cut to the chase: Here is a screenshot of an interactive map locating the nine events captured in GDELT in the first week of January 2014 that looked like atrocities to us and occurred in a place that the Google Maps API recognized when queried. (One event was left off the map because Google Maps didn’t recognize its reported location.) The size of the bubbles corresponds to the number of civilian deaths, which in this map range from one to 31. To really get a feel for what we’re trying to do, though, head over to the original visualization on CartoDB (here), where you can zoom in and out and click on the bubbles to see a hyperlink to the story from which each event was identified.
Looks simple, right? Well, it turns out it isn’t, not by a long shot.
As this blog’s regular readers know, GDELT uses software to scour the web for new stories about political interactions all around the world and parses those stories to identify and record information about who did or said what to whom, when, and where. It currently covers the period 1979-present and is now updated every day, and each of those daily updates contains some 100,000-140,000 new records. Miraculously and crucial to a non-profit pilot project like ours, GDELT is also available for free.
The nine events plotted in the map above were sifted from the tens of thousands of records GDELT dumped on us in the first week of 2014. Unfortunately, that data-reduction process is only partially automated.
The first step in that process is the quickest. As originally envisioned back in May, we are using an R script (here) to download GDELT’s daily update file and sift it for events that look, from the event type and actors involved, like they might involve what we consider to be an atrocity—that is, deliberate, deadly violence against one or more noncombatant civilians in the context of a wider political conflict.
Unfortunately, the stack of records that filtering script returns—something like 100-200 records per day—still includes a lot of stuff that doesn’t interest us. Some records are properly coded but involve actions that don’t meet our definition of an atrocity (e.g., clashes between rioters and police or rebels and troops); some involve atrocities but are duplicates of events we’ve already captured; and some are just miscoded (e.g., a mention of the film industry “shooting” movies that gets coded as soldiers shooting civilians).
After we saw how noisy our data set would be if we stopped screening there, we experimented with a monitoring system that would acknowledge GDELT’s imperfections and try to work with them. As Phil Schrodt recommended at the recent GDELT DC Hackathon, we looked to “embrace the suck.” Instead of trying to use GDELT to generate a reliable chronicle of atrocities around the world, we would watch for interesting and potentially relevant perturbations in the information stream, noise and all, and those perturbations would produce alerts that users of our system could choose to investigate further. Working with Palantir, we built a system that would estimate country-specific prior moving averages of daily event counts returned by our filtering script and would generate an alert whenever a country’s new daily count landed more than two standard deviations above or below that average.
That system sounded great to most of the data pros in our figurative room, but it turned out to be a non-starter with some other constituencies of importance to us. The issue was credibility. Some of the events causing those perturbations in the GDELT stream were exactly what we were looking for, but others—a pod of beached whales in Brazil, or Congress killing a bill on healthcare reform—were laughably far from the mark. If our supposedly high-tech system confused beached whales and Congressional procedures for mass atrocities, we would risk undercutting the reputation for reliability and technical acumen that we are striving to achieve.
So, back to the drawing board we went. To separate the signal from the static and arrive at something more like that valid chronicle we’d originally envisioned, we decided that we needed to add a second, more laborious step to our data-reduction process. After our R script had done its work, we would review each of the remaining records by hand to decide if it belonged in our data set or not and, when necessary, to correct any fields that appeared to have been miscoded. While we were at it, we would also record the number of deaths each event produced. We wrote a set of rules to guide those decisions; had two people (a Dartmouth undergraduate research assistant and I) apply those rules to the same sets of daily files; and compared notes and made fixes. After a few iterations of that process over a few months, we arrived at the codebook we’re using now (here).
This process radically reduces the amount of data involved. Each of those two steps drops us down multiple orders of magnitude: from 100,000-140,000 records in the daily updates, to about 150 in our auto-filtered set, to just one or two in our hand-filtered set. The figure below illustrates the extent of that reduction. In effect, we’re treating GDELT as a very powerful but error-prone search and coding tool, a source of raw ore that needs refining to become the thing we’re after. This isn’t the only way to use GDELT, of course, but for our monitoring task as presently conceived, it’s the one that we think will work best.
Once that second data-reduction step is done, we still have a few tasks left to enable the kind of mapping and analysis we aim to do. We want to trim the data set to keep only the atrocities we’ve identified, and we need to consolidate the original and corrected fields in those remaining records and geolocate them. All of that work gets done with a second R script (here), which is applied to the spreadsheet the coder saves after completing her work. The much smaller file that script produces is then ready to upload to a repository where it can be combined with other days’ outputs to produce the global chronicle our monitoring project aims to produce.
From start to finish, each daily update now takes about 45 minutes, give or take 15. We’d like to shrink that further if we can but don’t see any real opportunities to do so at the moment. Perhaps more important, we still have to figure out the bureaucratic procedures that will allow us to squeeze daily updates from a “human in the loop” process in a world where there are weekends and holidays and people get sick and take vacations and sometimes even quit. Finally, we also have not yet built the dashboard that will display and summarize and provide access to these data on our program’s web site, which we expect to launch some time this spring.
We know that the data set this process produces will be incomplete. I am 100-percent certain that during the first week of January 2014, more than 10 events occurred around the world that met our definition of an atrocity. Unfortunately, we can only find things where GDELT looks, and even a scan of every news story produced every day everywhere in the world would fail to see the many atrocities that never make the news.
On the whole, though, I’m excited about the progress we’ve made. As soon as we can launch it, this monitoring process should help advocates and analysts more efficiently track atrocities globally in close to real time. As our data set grows, we also hope it will serve as the foundation for new research on forecasting, explaining, and preventing this kind of violence. Even with its evident shortcomings, we believe this data set will prove to be useful, and as GDELT’s reach continues to expand, so will ours.
PS For a coda discussing the great ideas people had in response to this post, go here.
[Erratum: The original version of this post said there were about 10,000 records in each daily update from GDELT. The actual figure is 100,000-140,000. The error has been corrected and the illustration of data reduction updated accordingly.]