Last May, I wrote a post about my preliminary efforts to use a new data set called GDELT to monitor reporting on atrocities around the world in near-real time. Those efforts represent one part of the work I’m doing on a public early-warning system for the U.S. Holocaust Memorial Museum’s Center for the Prevention of Genocide, and they have continued in fits and starts over the ensuing eight months. With help from Dartmouth’s Dickey Center, Palantir, and the GDELT crew, we’ve made a lot of progress. I thought I’d post an update now because I’m excited about the headway we’ve made; I think others might benefit from seeing what we’re doing; and I hope this transparency can help us figure out how to do this task even better.
So, let’s cut to the chase: Here is a screenshot of an interactive map locating the nine events captured in GDELT in the first week of January 2014 that looked like atrocities to us and occurred in a place that the Google Maps API recognized when queried. (One event was left off the map because Google Maps didn’t recognize its reported location.) The size of the bubbles corresponds to the number of civilian deaths, which in this map range from one to 31. To really get a feel for what we’re trying to do, though, head over to the original visualization on CartoDB (here), where you can zoom in and out and click on the bubbles to see a hyperlink to the story from which each event was identified.
Looks simple, right? Well, it turns out it isn’t, not by a long shot.
As this blog’s regular readers know, GDELT uses software to scour the web for new stories about political interactions all around the world and parses those stories to identify and record information about who did or said what to whom, when, and where. It currently covers the period 1979-present and is now updated every day, and each of those daily updates contains some 100,000-140,000 new records. Miraculously and crucial to a non-profit pilot project like ours, GDELT is also available for free.
The nine events plotted in the map above were sifted from the tens of thousands of records GDELT dumped on us in the first week of 2014. Unfortunately, that data-reduction process is only partially automated.
The first step in that process is the quickest. As originally envisioned back in May, we are using an R script (here) to download GDELT’s daily update file and sift it for events that look, from the event type and actors involved, like they might involve what we consider to be an atrocity—that is, deliberate, deadly violence against one or more noncombatant civilians in the context of a wider political conflict.
Unfortunately, the stack of records that filtering script returns—something like 100-200 records per day—still includes a lot of stuff that doesn’t interest us. Some records are properly coded but involve actions that don’t meet our definition of an atrocity (e.g., clashes between rioters and police or rebels and troops); some involve atrocities but are duplicates of events we’ve already captured; and some are just miscoded (e.g., a mention of the film industry “shooting” movies that gets coded as soldiers shooting civilians).
After we saw how noisy our data set would be if we stopped screening there, we experimented with a monitoring system that would acknowledge GDELT’s imperfections and try to work with them. As Phil Schrodt recommended at the recent GDELT DC Hackathon, we looked to “embrace the suck.” Instead of trying to use GDELT to generate a reliable chronicle of atrocities around the world, we would watch for interesting and potentially relevant perturbations in the information stream, noise and all, and those perturbations would produce alerts that users of our system could choose to investigate further. Working with Palantir, we built a system that would estimate country-specific prior moving averages of daily event counts returned by our filtering script and would generate an alert whenever a country’s new daily count landed more than two standard deviations above or below that average.
That system sounded great to most of the data pros in our figurative room, but it turned out to be a non-starter with some other constituencies of importance to us. The issue was credibility. Some of the events causing those perturbations in the GDELT stream were exactly what we were looking for, but others—a pod of beached whales in Brazil, or Congress killing a bill on healthcare reform—were laughably far from the mark. If our supposedly high-tech system confused beached whales and Congressional procedures for mass atrocities, we would risk undercutting the reputation for reliability and technical acumen that we are striving to achieve.
So, back to the drawing board we went. To separate the signal from the static and arrive at something more like that valid chronicle we’d originally envisioned, we decided that we needed to add a second, more laborious step to our data-reduction process. After our R script had done its work, we would review each of the remaining records by hand to decide if it belonged in our data set or not and, when necessary, to correct any fields that appeared to have been miscoded. While we were at it, we would also record the number of deaths each event produced. We wrote a set of rules to guide those decisions; had two people (a Dartmouth undergraduate research assistant and I) apply those rules to the same sets of daily files; and compared notes and made fixes. After a few iterations of that process over a few months, we arrived at the codebook we’re using now (here).
This process radically reduces the amount of data involved. Each of those two steps drops us down multiple orders of magnitude: from 100,000-140,000 records in the daily updates, to about 150 in our auto-filtered set, to just one or two in our hand-filtered set. The figure below illustrates the extent of that reduction. In effect, we’re treating GDELT as a very powerful but error-prone search and coding tool, a source of raw ore that needs refining to become the thing we’re after. This isn’t the only way to use GDELT, of course, but for our monitoring task as presently conceived, it’s the one that we think will work best.
Once that second data-reduction step is done, we still have a few tasks left to enable the kind of mapping and analysis we aim to do. We want to trim the data set to keep only the atrocities we’ve identified, and we need to consolidate the original and corrected fields in those remaining records and geolocate them. All of that work gets done with a second R script (here), which is applied to the spreadsheet the coder saves after completing her work. The much smaller file that script produces is then ready to upload to a repository where it can be combined with other days’ outputs to produce the global chronicle our monitoring project aims to produce.
From start to finish, each daily update now takes about 45 minutes, give or take 15. We’d like to shrink that further if we can but don’t see any real opportunities to do so at the moment. Perhaps more important, we still have to figure out the bureaucratic procedures that will allow us to squeeze daily updates from a “human in the loop” process in a world where there are weekends and holidays and people get sick and take vacations and sometimes even quit. Finally, we also have not yet built the dashboard that will display and summarize and provide access to these data on our program’s web site, which we expect to launch some time this spring.
We know that the data set this process produces will be incomplete. I am 100-percent certain that during the first week of January 2014, more than 10 events occurred around the world that met our definition of an atrocity. Unfortunately, we can only find things where GDELT looks, and even a scan of every news story produced every day everywhere in the world would fail to see the many atrocities that never make the news.
On the whole, though, I’m excited about the progress we’ve made. As soon as we can launch it, this monitoring process should help advocates and analysts more efficiently track atrocities globally in close to real time. As our data set grows, we also hope it will serve as the foundation for new research on forecasting, explaining, and preventing this kind of violence. Even with its evident shortcomings, we believe this data set will prove to be useful, and as GDELT’s reach continues to expand, so will ours.
PS For a coda discussing the great ideas people had in response to this post, go here.
[Erratum: The original version of this post said there were about 10,000 records in each daily update from GDELT. The actual figure is 100,000-140,000. The error has been corrected and the illustration of data reduction updated accordingly.]
Felix Haass (@felixhaass)
/ January 14, 2014Jay, this is fascinating work and I am certain you and your team will come up with a system that will be useful for both academics and practitioners.
Just a quick suggestion that came to my mind when I saw this:
and this comment in your codebook:
I don’t know much about this stuff, but your problem looks a lot like the challenges email spam filters face, i.e. a classical machine learning problem: how can your email program automatically classify certain messages as spam while ignoring others? I’m pretty certain that there is a way to “teach” an algorithm many of the decisions you as coders take. Some of these techniques are already applied in the core GDELT script, if I am not mistaken, but you would need to write a script that fits your specific needs for atrocity detection. Since you have a pretty uniform data source (news texts) this should be possible, although you’d probably need to teach the algorithm some of the key words manually (e.g. to exclude film shooting and related terms). Although this might not automatize your procedure completely, you still could end up with a pre-categorization (similar to spam filters) that speeds up the process.
There are many online tutorials and books on the topic, from what I gather form #rstats twitter feed and r-bloggers.com, since I am not really an expert on the subject. 🙂 I haven’t read the book, but Machine Learning with R has received some pretty good reviews on r-bloggers. The sample chapter actually deals with creating a spam filter. Maybe you can adopt the example scripts to fit your tasks.
dartthrowingchimp
/ January 14, 2014Thanks very much, Felix. In fact, we have talked quite a lot about trying to do what you propose and continue to kick it around.
The big challenge is that the data-generating process really isn’t fixed. I’m not talking about the evolution of GDELT itself, although that’s not a trivial issue, either. Instead, I’m talking about variation over time in the locations and actors involved in the “production” of atrocities, the chief features on which we could try to train a machine-learning algorithm. The places where atrocities are occurring now and the actors committing and being killed in those events probably won’t be the same in six months or a year, so an algorithm that learns to spot the January 2014 atrocities will probably work poorly on the July 2014 batch and even worse on the January 2015 set. Without many years and a jillion examples to train on, I don’t see any way to finesse this problem.
If we had a higher tolerance for errors of omission, this wouldn’t matter so much, but the occasional “surprises” are actually the events we’re most interested in capturing. For this system’s intended audience, knowing that there’s another massacre in Nigeria will often be less important than knowing there’s been a “first” massacre in Niger, but any automated second layer we build would probably be good at seeing the former and liable to miss the latter. Maybe this is solvable if we get into natural-language processing of the source articles, but at that point we’re basically rebuilding an atrocities-specific GDELT from the ground up, and that’s definitely beyond our resources.
Felix Haass (@felixhaass)
/ January 14, 2014I see the challenge. I hadn’t considered that. Of course the most important atrocity events are precisely the ones which you would’ve missed by training your data on previous events; thanks for clarifying.
But instead of building an atrocity detection filter, would it then be easier to create a GDELT / atrocity spam filter that filters out non-atrocity events, such as film shootings and “killings” of Congress bills? I’m assuming this is easier, but then, I don’t know how many of these false positives actually occur and in what variety. But it might still help you sifting through the daily updates, although it would probably fall short of completely automatizing the task.
John Beieler
/ January 15, 2014I think it would be completely possible to go from news stories -> filtered list -> human coded events using machine learning techniques and I’ve discussed this with Jay. In fact, this is exactly what the Militarized Interstate Dispute data does (download news articles, run a SVM over the stories to create a reduced list, humans code from the reduced list).
In addition to the issues already pointed out, though, you have to figure out how well the algos would work and what would form the corpus you’re pulling from. Do you want to use GDELT as a link aggregator? Scrape your own sites? Something else? Those are all very different things that present their own issues. Once you have the stories, you have to find an algo that will actually be able to pick out the atrocities. Let’s assume that the true number is a couple orders of magnitude greater than the 10 atrocities that Jay observed in the current data (I know this isn’t right, but for the sake of illustration…) that’s still only .1% positive observations on the response variable per week. Not a whole lot for an algo to leverage.
Again, I think this approach could probably work, but it presents a pretty difficult problem.
Oral Hazard
/ January 15, 2014Machine learning algorithms are precisely what the Intelligence Community has used for many years in enabling high-throughput digital scanning of satellite images. This really does seem like a perfect application for it.
It also occurs to me that the lexicon of mass killings involves neologisms, euphemisms and cultural idioms, and thus may be a daunting challenge in real time. To use an easy example: shoah/holocaust/Final Solution/Endlösung.
dartthrowingchimp
/ January 15, 2014Yeah, but the IC has a slightly bigger budget than we do. (c;
alexhanna
/ January 14, 2014The issues Jay raises with using machine learning may be mitigated by possibly replacing formal names of places with generic nouns. Both Niger and Nigeria becomes . Then the features extracted from training set are focused on other words, eg. verbs associated with atrocities.
But yeah, Jay is also right, that it is still tantamount to rebuilding GDELT from scratch. GDELT is already applying a filter from its own dictionary-based coding of news text. At which point you also need to start building a huge training set, by hand, of atrocities from all news sources which GDELT is intaking on a daily basis, a filtering problem which is not very feasible.
Grant
/ January 14, 2014I’m not sure what Niger and Nigeria become from your post.
alexhanna
/ January 14, 2014Oh, whoops. Don’t try to use HTML-style tags within WordPress, I guess. I wrote [country_name].
caseyhilland
/ January 14, 2014Interesting read. Any reason you didn’t use the GDELTtools package to at the very least fetch the data? I know mileage has sort of varied with the package, so I’m curious about your reasoning for using the extensive fetch code. Or was the package just not available when this was put together? Thanks.
dartthrowingchimp
/ January 14, 2014The package wasn’t available when I started this project, and once I had something that worked, I didn’t see any reason to mess with it.
John Beieler
/ January 15, 2014Casey, what issues were you having with GDELTtools? If you don’t mind, could you shoot me an email at jub270 [at] psu [dot] edu? We’re constantly trying to improve and expand the toolset that surrounds GDELT and I’d love to hear any pain points you have with the existing tools.
(Sorry to hijack the post, Jay)
dartthrowingchimp
/ January 15, 2014Hey, no worries, John. Constructive conversation is the whole idea.
Jim Walsh
/ January 14, 2014Very interesting work. Have you thought about crowdsourcing the human piece? Seems to me like a lot of people might be willing to volunteer their time for this important issue–human rights activists and NGO types, area experts, professors and their students (who might even get some credit and learn about coding). If you had a large enough cadre of volunteers, could assign many (10 or more?) to each day’s data and generate some sort of average or modal response. Would need someone to organize the volunteers, and I’m not sure how this would be implemented online, but might be do-able. Thanks.
dartthrowingchimp
/ January 15, 2014Thanks, Jim. We are using crowds in a number of ways on other parts of our larger early-warning system and have talked a little about the idea you table here.
I think this approach is exactly right in principle, but I have two major concerns about how it would work in practice.
The first is the recruiting problem, which is a bit of a Catch-22: people are less inclined to do this if they don’t believe the system works, but it’s hard to convince them that the system works if you don’t already have a crowd involved.
The second is the principle-agent problem. Our goal is to make the data as reliable and valid as possible, but there are surely people out there who would bring different goals to the process that could confound ours. Imagine, for example, that Absurdistan appears in the filtered-but-not-yet-coded data to be committing atrocities, but citizens (or paid agents) of Absurdistan don’t like that idea and so organize to vote those events out of the data set. Maybe our project would be too far under the radar for anyone to bother, but our ambitions are larger than that, and the deeply political nature of our subject matter works against us here.
All of that said, I do hope we get a chance at least to experiment with something like this in the future. On Twitter, someone suggested using Mechanical Turk, too, which is similar in spirit and would probably resolve the politicization part.
So many great ideas, so little time…