“Supercomputer predicts revolution,” shouts a recent headline on the BBC News Technology site.
Feeding a supercomputer with news stories could help predict major world events, according to US research. A study, based on millions of articles, charted deteriorating national sentiment ahead of the recent revolutions in Libya and Egypt. While the analysis was carried out retrospectively, scientists say the same processes could be used to anticipate upcoming conflict. The system also picked up early clues about Osama Bin Laden’s location.
As is often the case in these gee-whiz stories, the headline makes a stronger claim than the ensuing story, but in this case, the headline accurately reflects the optimism of the project’s author. In a paper describing this project in the peer-reviewed online journal First Monday, the University of Illinois’ Kalev Leetaru claims that, “Pooling together the global tone of all news mentions of a country over time appears to accurately forecast its near–term stability, including predicting the revolutions in Egypt, Tunisia, and Libya, conflict in Serbia, and the stability of Saudi Arabia.”
That’s an ambitious claim. As someone who’s spent the past 10 years working on models to forecast political instability, I know how hard it is to make accurate predictions about these big but rare events. (See this previous post for more on that point.)
Unfortunately, it’s also a claim that the facts don’t seem to support. To evaluate the forecasting power of Leetaru’s measures, it’s not enough to look after the fact at cases where various events of interest occurred and show that a measure trended in some relevant way beforehand, as Leetaru does in his paper. That process bears no resemblance to the forecaster’s real-world challenges. In practice, forecasting is a real-time exercise, not a retrospective one. To be useful, forecasters can’t just say, “I think something bad is increasingly likely to happen.” Instead, they need to say what might happen, and how likely it is to happen, either in absolute terms or relative to other cases or other possible outcomes.
To use this media-sentiment measure as a predictive tool, then, we would have to specify in advance a) what we are trying to predict, b) where we are trying to predict it, and c) how the value of the measure relates to the likelihood of that event’s occurrence. Leetaru implicitly specifies the first and second elements by emphasizing the measure’s global relevance and applying to several Arab countries where governments recently have or have not been toppled by popular uprisings. From that, I gather that it’s meant to be useful in all countries worldwide, and that the author believes it can help predict successful rebellions. Fine.
Where things get slippery is in relating the value of the measure to the likelihood of a successful rebellion. Shown below is Figure 5 from Leetaru’s paper. As the caption notes, the figure plots variation in the tone of media coverage of Tunisia, measured in standard deviations. After some caveats about the scarcity of reporting on Tunisia, Leetaru points out that “the two–week period prior to Tunisian President Ben Ali’s resignation was the sixth–most negative period in the last 30 years.”
Figure 5. Tone of country–level coverage mentioning Tunisia, Summary of World Broadcasts, January 1979–March 2011 (December 2010 is 1–17 December). Y–axis is Z–scores (standard deviations from mean).
When I look at that plot, I see a much weaker signal than Leetaru apparently does. There are at least 10 months in the past 30 years when tone dropped lower than it did in early 2011, just before Ben Ali was overthrown. That’s a fair number of false positives to go with the one true positive we get when we cherry-pick a warning threshold to fit the event we already know. (In actual practice, we would need to specify that threshold in advance.)
Leetaru sees an even stronger signal in the plot of his data on Egypt, shown below. He writes, “Only twice in the last 30 years has the global tone about Egypt dropped more than three standard deviations below average: January 1991 (the U.S. aerial bombardment of Iraqi troops in Kuwait) and 1–24 January 2011, ahead of the mass uprising. The only other period of sharp negative moment was March 2003, the launch of the U.S. invasion of neighboring Iraq.”
Figure 2: Tone of coverage mentioning Egypt, Summary of World Broadcasts January 1979–March 2011 (January 2011 is 1–24 January). Y axis is Z–scores (standard deviations from mean).
I agree that the signal is stronger in this case, but how would we have known in advance that -3 was the magic number for prediction in Egypt while -2 was the relevant value for Tunisia? If we apply the cherry-picked Tunisia threshold of -2 to the Egypt data, we see at least a dozen months in the past 30 years when the tone of Egypt coverage got that low. If we apply the cherry-picked Egypt threshold of -3 to the Tunisia data, then the tone of Tunisia coverage never got low enough to warn of an impending revolution.
We could repeat this process for all of the cases Leetaru covers in his paper — he shows similar plots for Libya, Saudi Arabia (a non-event), and the Balkans — but I think that’s enough to make the point. When forecasting rare events, it’s easy to be right almost all of the time by saying that the event of interest will never happen, but it’s not very useful. The hard part is spotting the approaching wolf at the right times without also crying “wolf” so often the rest of the time that your audience gets desensitized to those warnings. Without access to the original data, it’s impossible for me to evaluate rigorously the forecasting power of Leetaru’s measure of media tone. Just by eyeballing those plots, though, I think we can see that the predictive signal in that measure is not nearly as distinct from the noise as the author’s strong concluding statement claims.
My aim here isn’t to denigrate Leetaru’s project, which seems to have produced a powerful and potentially very useful tool for measuring the tone of media coverage. To be fair to Leetaru, the numbers of potential false positives (warnings followed by no event) we see in those two plots are not huge, and it does look like the measure is doing a good job discriminating in a general way between periods of broader stability and instability. Instead, my point is to show that those measures are not nearly as sharp as the paper (and the credulous reporting on it) suggests. That’s okay; after all, this a really hard problem. But there are standard ways to evaluate forecasts’ accuracy, and it would be nice to see them used more often in papers that claim predictive accuracy and press coverage of them.
POSTSCRIPT: After writing this post, I emailed Prof. Leetaru to bring it to his attention and give him a chance to respond. He replied the same day and told me that future publications will address the questions I raise about forecasting accuracy. “The current paper was designed to introduce the overall method and approach,” he wrote. “The next phase of the work is in demonstrating the approach applied to a wide range of countries from across the world and illustrating its application in a realtime forecasting environment when run over 30 years of data.” I look forward to seeing those papers and will discuss the results here when I do.