I love doing research in the Internet Age. As I’d hoped it would, my post yesterday on the latest iteration of our atrocities-monitoring system in the works has already sparked a lot of really helpful responses. Some of those responses are captured in comments on the post, but not all of them are. So, partly as a public good and partly for my own record-keeping, I thought I’d write a coda to that post enumerating the leads it generated and some of my reactions to them.
Give the Machines Another Shot at It
As a way to reduce or even eliminate the burden placed on our human(s) in the loop, several people suggested something we’ve been considering for a while: use machine-learning techniques to develop classifiers that can be used to further reduce the data left after our first round of filtering. These classifiers could consider all of the features in GDELT, not just the event and actor types we’re using in our R script now. If we’re feeling really ambitious, we could go all the way back to the source stories and use natural-language processing to look for additional discriminatory power there. This second round might not eliminate the need for human review, but it certainly could lighten the load.
The comment threads on this topic (here and here) nicely capture what I see as the promise and likely limitations of this strategy, so I won’t belabor it here. For now, I’ll just note that how well this would work is an empirical question, and it’s one we hope to get a chance to answer once we’ve accumulated enough screened data to give those classifiers a fighting chance.
Leverage GDELT’s Global Knowledge Graph
Related to the first idea, GDELT co-creator Kalev Leetaru has suggested on a couple of occasions that we think about ways to bring the recently-created GDELT Global Knowledge Graph (GKG) to bear on our filtering task. As Kalev describes in a post on the GDELT blog, GKG consists of two data streams, one that records mentions of various counts and another that captures connections in each day’s news between “persons, organizations, locations, emotions, themes, counts, events, and sources.” That second stream in particular includes a bunch of data points that we can connect to specific event records and thus use as additional features in the kind of classifiers described under the previous header. In response to my post, Kalev sent this email to me and a few colleagues:
I ran some very very quick numbers on the human coding results Jay sent me where a human coded 922 articles covering 9 days of GDELT events and coded 26 of them as atrocities. Of course, 26 records isn’t enough to get any kind of statistical latch onto to build a training model, but the spectral response of the various GKG themes is quite informative. For events tagged as being an atrocity, themes such as ETHNICITY, RELIGION, HUMAN_RIGHTS, and a variety of functional actors like Villagers, Doctors, Prophets, Activists, show up in the top themes, whereas in the non-atrocities the roles are primarily political leaders, military personnel, authorities, etc. As just a simple example, the HUMAN_RIGHTS theme appeared in just 6% of non-atrocities, but 30% of atrocities, while Activists show up in 33% of atrocities compared with just 4% of non-atrocities, and the list goes on.
Again, 26 articles isn’t enough to build a model on, but just glancing over the breakdown of the GKG themes for the two there is a really strong and clear breakage between the two across the entire set of themes, and the breakdown fits precisely what baysean classifiers like (they are the most accurate for this kind of separation task and outperform SVM and random forest).
So, Jay, the bottom line is that if you can start recording each day the list of articles that you guys review and the ones you flag as an atrocity and give me a nice dataset over time, should be pretty easy to dramatically filter these down for you at the very least.
As I’ve said throughout this process, its not that event data can’t do what is needed, its that often you have to bring additional signals into the mix to accomplish your goals when the thing you’re after requires signals beyond what the event records are capturing.
What Kalev suggests at the end there—keep a record of all the events we review and the decisions we make on them—is what we’re doing now, and I hope we can expand on his experiment in the next several months.
Jim Walsh left a thoughtful comment suggesting that we crowdsource the human coding:
Seems to me like a lot of people might be willing to volunteer their time for this important issue–human rights activists and NGO types, area experts, professors and their students (who might even get some credit and learn about coding). If you had a large enough cadre of volunteers, could assign many (10 or more?) to each day’s data and generate some sort of average or modal response. Would need someone to organize the volunteers, and I’m not sure how this would be implemented online, but might be do-able.
As I said in my reply to him, this is an approach we’ve considered but rejected for now. We’re eager to take advantage of the wisdom of interested crowds and are already doing so in big ways on other parts of our early-warning system, but I have two major concerns about how well it would work for this particular task.
The first is the recruiting problem, and here I see a Catch-22: people are less inclined to do this if they don’t believe the system works, but it’s hard to convince them that the system works if we don’t already have a crowd involved to make it go. This recruiting problem becomes especially acute in a system with time-sensitive deliverables. If we promise daily updates, we need to produce daily updates, and it’s hard to do that reliably if we depend on self-organized labor.
My second concern is the principal-agent problem. Our goal is to make reliable and valid data in a timely way, but there are surely people out there who would bring goals to the process that might not align with ours. Imagine, for example, that Absurdistan appears in the filtered-but-not-yet-coded data to be committing atrocities, but citizens (or even paid agents) of Absurdistan don’t like that idea and so organize to vote those events out of the data set. It’s possible that our project would be too far under the radar for anyone to bother, but our ambitions are larger than that, so we don’t want to assume that will be true. If we succeed at attracting the kind of attention we hope to attract, the deeply political and often controversial nature of our subject matter would make crowdsourcing this task more vulnerable to this kind of failure.
Use Mechanical Turk
Both of the concerns I have about the downsides of crowdsourcing the human-coding stage could be addressed by Ryan Briggs’ suggestion via Twitter to have Amazon Mechanical Turk do it. A hired crowd is there when you need it and (usually) doesn’t bring political agendas to the task. It’s also relatively cheap, and you only pay for work performed.
Thanks to our collaboration with Dartmouth’s Dickey Center, the marginal cost of the human coding isn’t huge, so it’s not clear that Mechanical Turk would offer much advantage on that front. Where it could really help is in routinizing the daily updates. As I mentioned in the initial post, when you depend on human action and have just one or a few people involved, it’s hard to establish a set of routines that covers weekends and college breaks and sick days and is robust to periodic changes in personnel. Primarily for this reason, I hope we’ll be able to run an experiment with Mechanical Turk where we can compare its cost and output to what we’re paying and getting now and see if this strategy might make sense for us.
Don’t Forget About Errors of Omission
Last but not least, a longtime colleague had this to say in an email reacting to the post (hyperlinks added):
You are effectively describing a method for reducing errors of commission, events coded by GDELT as atrocities that, upon closer inspection, should not be. It seems like you also need to examine errors of omission. This is obviously harder. Two possible opportunities would be to compare to either [the PITF Worldwide Atrocities Event Data Set] or to ACLED. There are two questions. Is GDELT “seeing” the same source info (and my guess is that it is and more, though ACLED covers more than just English sources and I’m not sure where GDELT stands on other languages). Then if so (and there are errors of omission) why aren’t they showing up (coded as different types of events or failed to trigger any coding at all)[?]
It’s true that our efforts so far have focused almost exclusively on avoiding errors of commission, with the important caveat that it’s really our automated filtering process, not GDELT, that commits most of these errors. The basic problem for us is that GDELT, or really the CAMEO scheme on which it’s based, wasn’t designed to spot atrocities per se. As a result, most of what we filter out in our human-coding second stage aren’t things that were miscoded by GDELT. Instead, they’re things that were properly coded by GDELT as various forms of violent action but upon closer inspection don’t appear to involve the additional features of atrocities as we define them.
Of course, that still leaves us with this colleague’s central concern about errors of omission, and on that he’s absolutely right. I have experimented with different actor and event-type criteria to make sure we’re not missing a lot of events of interest in GDELT, but I haven’t yet compared what we’re finding in GDELT to what related databases that use different sources are seeing. Once we accumulate a few month’s worth of data, I think this is something we’re really going to need to do.
Stay tuned for Take 3…