2014 NFL Football Season Predictions

Professional (American) football season starts tonight when the Green Bay Packers visit last year’s champs, the Seattle Seahawks, for a Thursday-night opener thing that still seems weird to me. (SUNDAY, people. Pro football is played on Sunday.) So, who’s likely to win?

With the final preseason scores from our pairwise wiki survey in hand, we can generate a prediction for that game, along with all 255 other regular-season contests on the 2014 schedule. As I described in a recent post, this wiki survey offers a novel way to crowdsource the problem of estimating team strength before the season starts. We can use last year’s preseason survey data and game results to estimate a simple statistical model that accounts for two teams’ strength differential and home-field advantage. Then, we can apply that model to this year’s survey results to get game-level forecasts.

In the last post, I used the initial model estimates to generate predicted net scores (home minus visitor) and confidence intervals. This time, I thought I’d push it a little further and use predictive simulations. Following Gelman and Hill’s Data Analysis Using Regression and Multilevel/Hierarchical Models (2009), I generated 1,000 simulated net scores for each game and then summarized the distributions of those scores to get my statistics of interest.

The means of those simulated net scores for each game represent point estimates of the outcome, and the variance of those distributions gives us another way to compute confidence intervals. Those means and confidence intervals closely approximate the ones we’d get from a one-shot application of the predictive model to the 2014 survey results, however, so there’s no real new information there.

What we can do with those distributions that is new is compute win probabilities. The share of simulated net scores above 0 gives us an estimate of the probability of a home-team win, and 1 minus that estimate gives us the probability of a visiting-team win.

A couple of pictures make this idea clearer. First, here’s a histogram of the simulated net scores for tonight’s Packers-Seahawks game. The Packers fared pretty well in the preseason wiki survey, ranking 5th overall with a score of 77.5 out of 100. The defending-champion Seahawks got the highest score in the survey, however—a whopping 92.6—and they have home-field advantage, which is worth about 3.1 points on average, according  to my model. In my predictive simulations, 673 of the 1,000 games had a net score above 0, suggesting a win probability of 67%, or 2:1 odds, in favor of the Seahawks. The mean predicted net score is 5.8, which is pretty darn close to the current spread of -5.5.

Seattle Seahawks.Green Bay Packers

Things look a little tighter for the Bengals-Ravens contest, which I’ll be attending with my younger son on Sunday in our once-annual pilgrimage to M&T Bank Stadium. The Ravens wound up 10th in the wiki survey with a score of 60.5, but the Bengals are just a few rungs down the ladder, in 13th, with a score of 54.7. Add in home-field advantage, though, and the simulations give the Ravens a win probability of 62%, or about 3:2 odds. Here, the mean net score is 3.6, noticeably higher than the current spread of -1.5 but on the same side of the win/loss line. (N.B. Because the two teams’ survey scores are so close, the tables turn when Cincinnati hosts in Week 8, and the predicted probability of a home win is 57%.)

Baltimore Ravens.Cincinnati Bengals

Once we’ve got those win probabilities ginned up, we can use them to move from game-level to season-level forecasts. It’s tempting to think of the wiki survey results as season-level forecasts already, but what they don’t do is account for variation in strength of schedule. Other things being equal, a strong team with a really tough schedule might not be expected to do much better than a mediocre team with a relatively easy schedule. The model-based simulations refract those survey results through the 2014 schedule to give us a clearer picture of what we can expect to happen on the field this year.

The table below (made with the handy ‘textplot’ command in R’s gplots package) turns the predictive simulations into season-level forecasts for all 32 teams.* I calculated two versions of a season summary and juxtaposed them to the wiki survey scores and resulting rankings. Here’s what’s in the table:

  • WikiRank shows each team’s ranking in the final preseason wiki survey results.
  • WikiScore shows the score on which that ranking is based.
  • WinCount counts the number of games in which each team has a win probability above 0.5. This process gives us a familiar number, the first half of a predicted W-L record, but it also throws out a lot of information by treating forecasts close to 0.5 the same as ones where we’re more confident in our prediction of the winner.
  • WinSum, is the sum of each team’s win probabilities across the 16 games. This expected number of wins is a better estimate of each team’s anticipated results than WinCount, but it’s also a less familiar one, so I thought I would show both.

Teams appear in the table in descending order of WinSum, which I consider the single-best estimate in this table of a team’s 2014 performance. It’s interesting (to me, anyway) to see how the rank order changes from the survey to the win totals because of differences in strength of schedule. So, for example, the Patriots ranked 4th in the wiki survey, but they get the second-highest expected number of wins this year (9.8), just behind the Seahawks (9.9). Meanwhile, the Steelers scored 16th in the wiki survey, but they rank 11th in expected number of wins with an 8.4. That’s a smidgen better than the Cincinnati Bengals (8.3) and not much worse than the Baltimore Ravens (9.0), suggesting an even tighter battle for the AFC North division title than the wiki survey results alone.

2014 NFL Season-Level Forecasts from 1,000 Predictive Simulations Using Preseason Wiki Survey Results and Home-Field Advantage

2014 NFL Season-Level Forecasts from 1,000 Predictive Simulations Using Preseason Wiki Survey Results and Home-Field Advantage

There are a lot of other interesting quantities we could extract from the results of the game-level simulations, but that’s all I’ve got time to do now. If you want to poke around in the original data and simulation results, you can find them all in a .csv on my Google Drive (here). I’ve also posted a version of the R script I used to generate the game-level and season-level forecasts on Github (here).

At this point, I don’t have plans to try to update the forecasts during the season, but I will be seeing how the preseason predictions fare and occasionally reporting the results here. Meanwhile, if you have suggestions on other ways to use these data or to improve these forecasts, please leave a comment here on the blog.

* The version of this table I initially posted had an error in the WikiRank column where 18 was skipped and the rankings ran to 33. This version corrects that error. Thanks to commenter C.P. Liberatore for pointing it out.

What are all these violent images doing to us?

Early this morning, I got up, made some coffee, sat down at my desk, and opened Twitter to read the news and pass some time before I had to leave for a conference. One of the first things I saw in my timeline was a still from a video of what was described in the tweet as an ISIS fighter executing a group of Syrian soldiers. The soldiers lay on their stomachs in the dirt, mostly undressed, hands on their heads. They were arranged in a tightly packed row, arms and legs sometimes overlapping. The apparent killer stood midway down the row, his gun pointed down, smoke coming from its barrel.

That experience led me to this pair of tweets:

tweet 1

tweet 2

If you don’t use Twitter, you probably don’t know that, starting in 2013, Twitter tweaked its software so that photos and other images embedded in tweets would automatically appear in users’ timelines. Before that change, you had to click on a link to open an embedded image. Now, if you follow someone who appends an image to his or her tweet, you instantly see the image when the tweet appears in your timeline. The system also includes a filter of sorts that’s supposed to inform you before showing media that may be sensitive, but it doesn’t seem to be very reliable at screening for violence, and it can be turned off.

As I said this morning, I think the automatic display of embedded images is great for sharing certain kinds of information, like data visualizations. Now, tweets can become charticles.

I am increasingly convinced, though, that this feature becomes deeply problematic when people choose to share disturbing images. After I tweeted my complaint, Werner de Pooter pointed out a recent study on the effects of frequent exposure to graphic depictions of violence on the psychological health of journalists. The study’s authors found that daily exposure to violent images was associated with higher scores on several indices of psychological distress and depression. The authors conclude:

Given that good journalism depends on healthy journalists, news organisations will need to look anew at what can be done to offset the risks inherent in viewing User Generated Content material [which includes graphic violence]. Our findings, in need of replication, suggest that reducing the frequency of exposure may be one way to go.

I mostly use Twitter to discover stories and ideas I don’t see in regular news outlets, to connect with colleagues, and to promote my own work. Because I study political violence and atrocities, a fair share of my feed deals with potentially disturbing material. Where that material used to arrive only as text, it increasingly includes photos and video clips of violent or brutal acts as well. I am starting to wonder how routine exposure to those images may be affecting my mental health. The study de Pooter pointed out has only strengthened that concern.

I also wonder if the emotional power of those images is distorting our collective sense of the state of the world. Psychologists talk about the availability heuristic, a cognitive shortcut in which the ease of recalling examples of certain things drives our expectations about the likelihood or risk of those things. As Daniel Kahneman describes on p. 138 of Thinking, Fast and Slow,

Unusual events (such as botulism) attract disproportionate attention and are consequently perceived as less unusual than they really are. The world in our heads is not a precise replica of reality; our expectations about the frequency of events are distorted by the prevalence and emotional intensity of the messages to which we are exposed.

When those images of brutal violence pop into our view, they grab our attention, pack a lot of emotional intensity, and are often to hard to shake. The availability heuristic implies that frequent exposure to those images leads us to overestimate the threat or risk of things associated with them.

This process could even be playing some marginal role in a recent uptick in stories about how the world is coming undone. According to Twitter, its platform now has more than 270 million monthly active users. Many journalists and researchers covering world affairs probably fall in that 270 million. I suspect that those journalists and researchers spend more time watching their timelines than the average user, and they are probably more likely to turn off that “sensitive content” warning, too.

Meanwhile, smartphones and easier Internet access make it increasingly likely that acts of violence will be recorded and then shared through those media, and Twitter’s default settings now make it more likely that we see them when they are. Presumably, some of the organizations perpetrating this violence—and, sometimes, ones trying to mobilize action to stop it—are aware of the effects these images can have and deliberately push them to us to try to elicit that response.

As a result, many writers and analysts are now seeing much more of this material than they used to, even just a year or two ago. Whatever the actual state of the world, this sudden increase in exposure to disturbing material could be convincing many of us that the world is scarier and therefore more dangerous than ever before.

This process could have larger consequences. For example, lately I’ve had trouble getting thoughts of James Foley’s killing out of my mind, even though I never watched the video of it. What about the journalists and policymakers and others who did see those images? How did that exposure affect them, and how much is that emotional response shaping the public conversation about the threat the Islamic State poses and how our governments should respond to it?

I’m not sure what to do about this problem. As an individual, I can choose to unfollow people who share these images or spend less time on Twitter, but both of those actions carry some professional costs as well. The thought of avoiding these images also makes me feel guilty, as if I am failing the people whose suffering they depict and the ones who could be next. By hiding from those images, do I become complicit in the wider violence and injustice they represent?

As an organization, Twitter could decide to revert to the old no-show default, but that almost certainly won’t happen. I suspect this isn’t an issue for the vast majority of users, and it’s hard to imagine any social-media platform retreating from visual content as sites like Instagram and Snapchat grow quickly. Twitter could also try to remove embedded images that contain potentially disturbing material. As a fan of unfettered speech, though, I don’t find that approach appealing, either, and the unreliability of the current warning system suggests it probably wouldn’t work so well anyway.

In light of all that uncertainty, I’ll conclude with an observation instead of a solution: this is one hell of a huge psychological experiment we’re running right now, and its consequences for our own mental health and how we perceive the world around us may be more substantial than we realize.

Deriving a Fuzzy-Set Measure of Democracy from Several Dichotomous Data Sets

In a recent post, I described an ongoing project in which Shahryar Minhas, Mike Ward, and I are using text mining and machine learning to produce fuzzy-set measures of various political regime types for all countries of the world. As part of the NSF-funded MADCOW project,* our ultimate goal is to devise a process that routinely updates those data in near-real time at low cost. We’re not there yet, but our preliminary results are promising, and we plan to keep tinkering.

One of crucial choices we had to make in our initial analysis was how to measure each regime type for the machine-learning phase of the process. This choice is important because our models are only going to be as good as the data from which they’re derived. If the targets in that machine-learning process don’t reliably represent the concepts we have in mind, then the resulting models will be looking for the wrong things.

For our first cut, we decided to use dichotomous measures of several regime types, and to base those dichotomous measures on stringent criteria. So, for example, we identified as democracies only those cases with a score of 10, the maximum, on Polity’s scalar measure of democracy. For military rule, we only coded as 1 those cases where two major data sets agreed that a regime was authoritarian and only military-led, with no hybrids or modifiers. Even though the targets of our machine-learning process were crisply bivalent, we could get fuzzy-set measures from our classifiers by looking at the probabilities of class membership they produce.

In future iterations, though, I’m hoping we’ll get a chance to experiment with targets that are themselves fuzzy or that just take advantage of a larger information set. Bayesian measurement error models offer a great way to generate those targets.

Imagine that you have a set of cases that may or may not belong in some category of interest—say, democracy. Now imagine that you’ve got a set of experts who vote yes (1) or no (0) on the status of each of those cases and don’t always agree. We can get a simple estimate of the probability that a given case is a democracy by averaging the experts’ votes, and that’s not necessarily a bad idea. If, however, we suspect that some experts are more error prone than others, and that the nature of those errors follows certain patterns, then we can do better with a model that gleans those patterns from the data and adjusts the averaging accordingly. That’s exactly what a Bayesian measurement error model does. Instead of an unweighted average of the experts’ votes, we get an inverse-error-rate-weighted average, which should be more reliable than the unweighted version if the assumption about predictable patterns in those errors is largely correct.

I’m not trained in Bayesian data analysis and don’t know my way around the software used to estimate these models, so I sought and received generous help on this task from Sean J. Taylor. I compiled yes/no measures of democracy from five country-year data sets that ostensibly use similar definitions and coding criteria:

  • Cheibub, Gandhi, and Vreeland’s Democracy and Dictatorship (DD) data set, 1946–2008 (here);
  • Boix, Miller, and Rosato’s dichotomous coding of democracy, 1800–2007 (here);
  • A binary indicator of democracy derived from Polity IV using the Political Instability Task Force’s coding rules, 1800–2013;
  • The lists of electoral democracies in Freedom House’s annual Freedom in the World reports, 1989–2013; and
  • My own Democracy/Autocracy data set, 1955–2010 (here).

Sean took those five columns of zeroes and ones and used them to estimate a model with no prior assumptions about the five sources’ relative reliability. James Melton, Stephen Meserve, and Daniel Pemstein use the same technique to produce the terrific Unified Democracy Scores. What we’re doing is a little different, though. Where their approach treats democracy as a scalar concept and estimates a composite index from several measures, we’re accepting the binary conceptualization underlying our five sources and estimating the probability that a country qualifies as a democracy. In fuzzy-set terms, this probability represents a case’s degree of membership in the democracy set, not how democratic it is.

The distinction between a country’s degree of membership in that set and its degree of democracy is subtle but potentially meaningful, and the former will sometimes be a better fit for an analytic task than the latter. For example, if you’re looking to distinguish categorically between democracies and autocracies in order to estimate the difference in some other quantity across the two sets, it makes more sense to base that split on a probabilistic measure of set membership than an arbitrarily chosen cut point on a scalar measure of democracy-ness. You would still need to choose a threshold, but “greater than 0.5″ has a natural interpretation (“probably a democracy”) that suits the task in a way that an arbitrary cut point on an index doesn’t. And, of course, you could still perform a sensitivity analysis by moving the cut point around and seeing how much that choice affects your results.

So that’s the theory, anyway. What about the implementation?

I’m excited to report that the estimates from our initial measurement model of democracy look great to me. As someone who has spent a lot of hours wringing my hands over the need to make binary calls on many ambiguous regimes (Russia in the late 1990s? Venezuela under Hugo Chavez? Bangladesh between coups?), I think these estimates are accurately distinguishing the hazy cases from the rest and even doing a good job estimating the extent of that uncertainty.

As a first check, let’s take a look at the distribution of the estimated probabilities. The histogram below shows the estimates for the period 1989–2007, the only years for which we have inputs from all five of the source data sets. Voilà, the distribution has the expected shape. Most countries most of the time are readily identified as democracies or non-democracies, but the membership status of a sizable subset of country-years is more uncertain.

Estimated Probabilities of Democracy for All Countries Worldwide, 1989-2007

Estimated Probabilities of Democracy for All Countries Worldwide, 1989-2007

Of course, we can and should also look at the estimates for specific cases. I know a little more about countries that emerged from the collapse of the Soviet Union than I do about the rest of the world, so I like to start there when eyeballing regime data. The chart below compares scores for several of those countries that have exhibited more variation over the past 20+ years. Most of the rest of the post-Soviet states are slammed up against 1 (Estonia, Latvia, and Lithuania) or 0 (e.g., Uzbekistan, Turkmenistan, Tajikistan), so I left them off the chart. I also limited the range of years to the ones for which data are available from all five sources. By drawing strength from other years and countries, the model can produce estimates for cases with fewer or even no inputs. Still, the estimates will be less reliable for those cases, so I thought I would focus for now on the estimates based on a common set of “votes.”

Estimated Probability of Democracy for Selected Soviet Successor States, 1991-2007

Estimated Probability of Democracy for Selected Soviet Successor States, 1991-2007

Those estimates look about right to me. For example, Georgia’s status is ambiguous and trending less likely until the Rose Revolution of 2003, after which point it’s probably but not certainly a democracy, and the trend bends down again soon thereafter. Meanwhile, Russia is fairly confidently identified as a democracy after the constitutional crisis of 1993, but its status becomes uncertain around the passage of power from Yeltsin to Putin and then solidifies as most likely authoritarian by the mid-2000s. Finally, Armenia was one of the cases I found most difficult to code when building the Democracy/Autocracy data set for the Political Instability Task Force, so I’m gratified to see its probability of democracy oscillating around 0.5 throughout.

One nice feature of a Bayesian measurement error model is that, in addition to estimating the scores, we can also estimate confidence intervals to help quantify our uncertainty about those scores. The plot below shows Armenia’s trend line with the upper and lower bounds of a 90-percent confidence interval. Here, it’s even easier to see just how unclear this country’s democracy status has been since it regained independence. From 1991 until at least 2007, its 90-percent confidence interval straddled the toss-up line. How’s that for uncertain?

Armenia's Estimated Probability of Democracy with 90% Confidence Interval

Armenia’s Estimated Probability of Democracy with 90% Confidence Interval

Sean and I are still talking about ways to tweak this process, but I think the data it’s producing are already useful and interesting. I’m considering using these estimates in a predictive model of coup attempts and seeing if and how the results differ from ones based on the Polity index and the Unified Democracy Scores. Meanwhile, the rest of the MADCOW crew and I are now talking about applying the same process to dichotomous indicators of military rule, one-party rule, personal rule, and monarchy and then experimenting with machine-learning processes that use the results as their targets. There are lots of moving parts in our regime data-making process, and this one isn’t necessarily the highest priority, but it would be great to get to follow this path and see where it leads.

* NSF Award 1259190, Collaborative Research: Automated Real-time Production of Political Indicators

Mining Texts to Generate Fuzzy Measures of Political Regime Type at Low Cost

Political scientists use the term “regime type” to refer to the formal and informal structure of a country’s government. Of course, “government” entails a lot of things, so discussions of regime type focus more specifically on how rulers are selected and how their authority is organized and exercised. The chief distinction in contemporary work on regime type is between democracies and non-democracies, but there’s some really good work on variations of non-democracy as well (see here and here, for example).

Unfortunately, measuring regime type is hard, and conventional measures of regime type suffer from one or two crucial drawbacks.

First, many of the data sets we have now represent regime types or their components with bivalent categorical measures that sweep meaningful uncertainty under the rug. Specific countries at specific times are identified as fitting into one and only one category, even when researchers knowledgeable about those cases might be unsure or disagree about where they belong. For example, all of the data sets that distinguish categorically between democracies and non-democracies—like this one, this one, and this one—agree that Norway is the former and Saudi Arabia the latter, but they sometimes diverge on the classification of countries like Russia, Venezuela, and Pakistan, and rightly so.

Importantly, the degree of our uncertainty about where a case belongs may itself be correlated with many of the things that researchers use data on regime type to study. As a result, findings and forecasts derived from those data are likely to be sensitive to those bivalent calls in ways that are hard to understand when that uncertainty is ignored. In principle, it should be possible to make that uncertainty explicit by reporting the probability that a case belongs in a specific set instead of making a crisp yes/no decision, but that’s not what most of the data sets we have now do.

Second, virtually all of the existing measures are expensive to produce. These data sets are coded either by hand or through expert surveys, and routinely covering the world this way takes a lot of time and resources. (I say this from knowledge of the budgets for the production of some of these data sets, and from personal experience.) Partly because these data are so costly to make, many of these measures aren’t regularly updated. And, if the data aren’t regularly updated, we can’t use them to generate the real-time forecasts that offer the toughest test of our theories and are of practical value to some audiences.

As part of the NSF-funded MADCOW project*, Michael D. (Mike) Ward, Philip Schrodt, and I are exploring ways to use text mining and machine learning to generate measures of regime type that are fuzzier in a good way from a process that is mostly automated. These measures would explicitly represent uncertainty about where specific cases belong by reporting the probability that a certain case fits a certain regime type instead of forcing an either/or decision. Because the process of generating these measures would be mostly automated, they would be much cheaper to produce than the hand-coded or survey-based data sets we use now, and they could be updated in near-real time as relevant texts become available.

At this week’s annual meeting of the American Political Science Association, I’ll be presenting a paper—co-authored with Mike and Shahryar Minhas of Duke University’s WardLab—that describes preliminary results from this endeavor. Shahryar, Mike, and I started by selecting a corpus of familiar and well-structured texts describing politics and human-rights practices each year in all countries worldwide: the U.S. State Department’s Country Reports on Human Rights Practices, and Freedom House’s Freedom in the World. After pre-processing those texts in a few conventional ways, we dumped the two reports for each country-year into a single bag of words and used text mining to extract features from those bags in the form of vectorized tokens that may be grossly described as word counts. (See this recent post for some things I learned from that process.) Next, we used those vectorized tokens as inputs to a series of binary classification models representing a few different ideal-typical regime types as observed in few widely used, human-coded data sets. Finally, we applied those classification models to a test set of country-years held out at the start to assess the models’ ability to classify regime types in cases they had not previously “seen.” The picture below illustrates the process and shows how we hope eventually to develop models that can be applied to recent documents to generate new regime data in near-real time.

Overview of MADCOW Regime Classification Process

Overview of MADCOW Regime Classification Process

Our initial results demonstrate that this strategy can work. Our classifiers perform well out of sample, achieving high or very high precision and recall scores in cross-validation on all four of the regime types we have tried to measure so far: democracy, monarchy, military rule, and one-party rule. The separation plots below are based on out-of-sample results from support vector machines trained on data from the 1990s and most of the 2000s and then applied to new data from the most recent few years available. When a classifier works perfectly, all of the red bars in the separation plot will appear to the right of all of the pink bars, and the black line denoting the probability of a “yes” case will jump from 0 to 1 at the point of separation. These classifiers aren’t perfect, but they seem to be working very well.

 

prelim.democracy.svm.sepplot

prelim.military.svm.sepplot

prelim.monarchy.svm.sepplot

prelim.oneparty.svm.sepplot

Of course, what most of us want to do when we find a new data set is to see how it characterizes cases we know. We can do that here with heat maps of the confidence scores from the support vector machines. The maps below show the values from the most recent year available for two of the four regime types: 2012 for democracy and 2010 for military rule. These SVM confidence scores indicate the distance and direction of each case from the hyperplane used to classify the set of observations into 0s and 1s. The probabilities used in the separation plots are derived from them, but we choose to map the raw confidence scores because they exhibit more variance than the probabilities and are therefore easier to visualize in this form.

prelim.democracy.svmcomf.worldmap.2012

prelim.military.svmcomf.worldmap.2010

 

On the whole, cases fall out as we would expect them to. The democracy classifier confidently identifies Western Europe, Canada, Australia, and New Zealand as democracies; shows interesting variations in Eastern Europe and Latin America; and confidently identifies nearly all of the rest of the world as non-democracies (defined for this task as a Polity score of 10). Meanwhile, the military rule classifier sees Myanmar, Pakistan, and (more surprisingly) Algeria as likely examples in 2010, and is less certain about the absence of military rule in several West African and Middle Eastern countries than in the rest of the world.

These preliminary results demonstrate that it is possible to generate probabilistic measures of regime type from publicly available texts at relatively low cost. That does not mean we’re fully satisfied with the output and ready to move to routine data production, however. For now, we’re looking at a couple of ways to improve the process.

First, the texts included in the relatively small corpus we have assembled so far only cover a narrow set of human-rights practices and political procedures. In future iterations, we plan to expand the corpus to include annual or occasional reports that discuss a broader range of features in each country’s national politics. Eventually, we hope to add news stories to the mix. If we can develop models that perform well on an amalgamation of occasional reports and news stories, we will be able to implement this process in near-real time, constantly updating probabilistic measures of regime type for all countries of the world at very low cost.

Second, the stringent criteria we used to observe each regime type in constructing the binary indicators on which the classifiers are trained also appear to be shaping the results in undesirable ways. We started this project with a belief that membership in these regime categories is inherently fuzzy, and we are trying to build a process that uses text mining to estimate degrees of membership in those fuzzy sets. If set membership is inherently ambiguous in a fair number of cases, then our approximation of a membership function should be bimodal, but not too neatly so. Most cases most of the time can be placed confidently at one end of the range of degrees of membership or the other, but there is considerable uncertainty at any moment in time about a non-trivial number of cases, and our estimates should reflect that fact.

If that’s right, then our initial estimates are probably too tidy, and we suspect that the stringent operationalization of each regime type in the training data is partly to blame. In future iterations, we plan to experiment with less stringent criteria—for example, by identifying a case as military rule if any of our sources tags it as such. With help from Sean J. Taylor, we’re also looking at ways we might use Bayesian measurement error models to derive fuzzy measures of regime type from multiple categorical data sets, and then use that fuzzy measure as the target in our machine-learning process.

So, stay tuned for more, and if you’ll be at APSA this week, please come to our Friday-morning panel and let us know what you think.

* NSF Award 1259190, Collaborative Research: Automated Real-time Production of Political Indicators

The Worst World EVER…in the Past 5 or 10 Years

A couple of months ago, the head of the UN’s refugee agency announced that, in 2013, “the number of people displaced by violent conflict hit the highest level since World War II,” and he noted that the number was still growing in 2014.

A few days ago, under the headline “Countries in Crisis at Record High,” Foreign Policy‘s The Cable reported that the UN’s Inter-Agency Standing Committee for the first time ever had identified four situations worldwide—Syria, Iraq, South Sudan, and Central African Republic—as level 3 humanitarian emergencies, its highest (worst) designation.

Today, the Guardian reported that “last year was the most dangerous on record for humanitarian workers, with 155 killed, 171 seriously wounded and 134 kidnapped as they attempted to help others in some of the world’s most dangerous places.'”

If you read those stories, you might infer that the world has become more insecure than ever, or at least the most insecure it’s been since the last world war. That would be reasonable, but probably also wrong.  These press accounts of record-breaking trends are often omitting or underplaying a crucial detail: the data series on which these claims rely don’t extend very far into the past.

In fact, we don’t know how the current number of displaced persons compares to all years since World War II, because the UN only has data on that since 1989. In absolute terms, the number of refugees worldwide is now the largest it’s been since record-keeping began 25 years ago. Measured as a share of global population, however, the number of displaced persons in 2013 had not yet matched the peak of the early 1990s (see the Addendum here).

The Cable accurately states that having four situations designated as level-3 humanitarian disasters by the UN is “unprecedented,” but we only learn late in the story that the system which makes these designations has only existed for a few years. In other words, unprecedented…since 2011.

Finally, while the Guardian correctly reports that 2013 was the most dangerous year on record for aid workers, it fails to note that those records only reach back to the late 1990s.

I don’t mean to make light of worrisome trends in the international system or any of the terrible conflicts driving them. From the measures I track—see here and here, for example, and here for an earlier post on causes—I’d say that global levels of instability and violent conflict are high and waxing, but they have not yet exceeded the peaks we saw in the early 1990s and probably the 1960s. Meanwhile, the share of states worldwide that are electoral democracies remains historically high, and the share of the world’s population living in poverty has declined dramatically in the past few decades. The financial crisis of 2008 set off a severe and persistent global recession, but that collapse could have been much worse, and institutions of global governance deserve some credit for helping to stave off an even deeper failure.

How can all of these things be true at the same time? It’s a bit like climate change. Just as one or even a few unusually cool years wouldn’t reverse or disprove the clear long-term trend toward a hotter planet, an extended phase of elevated disorder and violence doesn’t instantly undo the long-term trends toward a more peaceful and prosperous human society. We are currently witnessing (or suffering) a local upswing in disorder that includes numerous horrific crises, but in global historical terms, the world has not fallen apart.

Of course, if it’s a mistake to infer global collapse from these local trends, it’s also a mistake to infer that global collapse is impossible from the fact that it hasn’t occurred already. The war that is already consuming Syria and Iraq is responsible for a substantial share of the recent increase in refugee flows and casualties, and it could spread further and burn hotter for some time to come. Probably more worrisome to watchers of long-term trends in international relations, the crisis in Ukraine and recent spate of confrontations between China and its neighbors remind us that war between major powers could happen again, and this time those powers would both or all have nuclear weapons. Last but not least, climate change seems to be accelerating with consequences unknown.

Those are all important sources of elevated uncertainty, but uncertainty and breakdown are not the same thing. Although those press stories describing unprecedented crises are all covering important situations and trends, I think their historical perspective is too shallow. I’m forty-four years old. The global system is less orderly than it’s been in a while, but it’s still not worse than it’s ever been in my lifetime, and it’s still nowhere near as bad as it was when my parents were born. I won’t stop worrying or working on ways to try to make things a tiny bit better, but I will keep that frame of reference in mind.

Notes From a First Foray into Text Mining

Guess what? Text mining isn’t push-button, data-making magic, either. As Phil Schrodt likes to say, there is no Data Fairy.

data fairy meme

I’m quickly learning this point from my first real foray into text mining. Under a grant from the National Science Foundation, I’m working with Phil Schrodt and Mike Ward to use these techniques to develop new measures of several things, including national political regime type.

I wish I could say that I’m doing the programming for this task, but I’m not there yet. For the regime-data project, the heavy lifting is being done by Shahryar Minhas, a sharp and able Ph.D. student in political science at Duke University, where Mike leads the WardLab. Shahryar and I are scheduled to present preliminary results from this project at the upcoming Annual Meeting of the American Political Science Association in Washington, DC (see here for details).

When we started work on the project, I imagined a relatively simple and mostly automatic process running from location and ingestion of the relevant texts to data extraction, model training, and, finally, data production. Now that we’re actually doing it, though, I’m finding that, as always, the devil is in the details. Here are just a few of the difficulties and decision points we’ve had to confront so far.

First, the structure of the documents available online often makes it difficult to scrape and organize them. We initially hoped to include annual reports on politics and human-rights practices from four or five different organizations, but some of the ones we wanted weren’t posted online in a format we could readily scrape. At least one was scrapable but not organized by country, so we couldn’t properly group the text for analysis. In the end, we wound up with just two sets of documents in our initial corpus: the U.S. State Department’s Country Reports on Human Rights Practices, and Freedom House’s annual Freedom in the World documents.

Differences in naming conventions almost tripped us up, too. For our first pass at the problem, we are trying to create country-year data, so we want to treat all of the documents describing a particular country in a particular year as a single bag of words. As it happens, the State Department labels its human rights reports for the year on which they report, whereas Freedom House labels its Freedom in the World report for the year in which it’s released. So, for example, both organizations have already issued their reports on conditions in 2013, but Freedom House dates that report to 2014 while State dates its version to 2013. Fortunately, we knew this and made a simple adjustment before blending the texts. If we hadn’t known about this difference in naming conventions, however, we would have ended up combining reports for different years from the two sources and made a mess of the analysis.

Once ingested, those documents include some text that isn’t relevant to our task, or that is relevant but the meaning of which is tacit. Common stop words like “the”, “a”, and “an” are obvious and easy to remove. More challenging are the names of people, places, and organizations. For our regime-data task, we’re interested in the abstract roles behind some of those proper names—president, prime minister, ruling party, opposition party, and so on—rather than the names themselves, but text mining can’t automatically derive the one for the other.

For our initial analysis, we decided to omit all proper names and acronyms to focus the classification models on the most general language. In future iterations, though, it would be neat if we could borrow dictionaries developed for related tasks and use them to replace those proper names with more general markers. For example, in a report or story on Russia, Vladimir Putin might get translated into <head of government>, the FSB into <police>, and Chechen Republic of Ichkeria into <rebel group>. This approach would preserve the valuable tacit information in those names while making it explicit and uniform for the pattern-recognition stage.

That’s not all, but it’s enough to make the point. These things are always harder than they look, and text mining is no exception. In any case, we’ve now run this gantlet once and made our way to an encouraging set of initial results. I’ll post something about those results closer to the conference when the paper describing them is ready for public consumption. In the meantime, though, I wanted to share a few of the things I’ve already learned about these techniques with others who might be thinking about applying them, or who already do and can commiserate.

Turning Crowdsourced Preseason NFL Strength Ratings into Game-Level Forecasts

For the past week, nearly all of my mental energy has gone into the Early Warning Project and a paper for the upcoming APSA Annual Meeting here in Washington, DC. Over the weekend, though, I found some time for a toy project on forecasting pro-football games. Here are the results.

The starting point for this toy project is a pairwise wiki survey that turns a crowd’s beliefs about relative team strength into scalar ratings. Regular readers will recall that I first experimented with one of these before the 2013-2014 NFL season, and the predictive power wasn’t terrible, especially considering that the number of participants was small and the ratings were completed before the season started.

This year, to try to boost participation and attract a more knowledgeable crowd of respondents, I paired with Trey Causey to announce the survey on his pro-football analytics blog, The Spread. The response has been solid so far. Since the survey went up, the crowd—that’s you!—has cast nearly 3,400 votes in more than 100 unique user sessions (see the Data Visualizations section here).

The survey will stay open throughout the season, but that doesn’t mean it’s too early to start seeing what it’s telling us. One thing I’ve already noticed is that the crowd does seem to be updating in response to preseason action. For example, before the first round of games, I noticed that the Baltimore Ravens, my family’s favorites, were running mid-pack with a rating of about 50. After they trounced the defending NFC champion 49ers in their preseason opener, however, the Ravens jumped to the upper third with a rating of 59. (You can always see up-to-the-moment survey results here, and you can cast your own votes here.)

The wiki survey is a neat way to measure team strength. On their own, though, those ratings don’t tell us what we really want to know, which is how each game is likely to turn out, or how well our team might be expected to do this season. The relationship between relative strength and game outcomes should be pretty strong, but we might want to consider other factors, too, like home-field advantage. To turn a strength rating into a season-level forecast for a single team, we need to consider the specifics of its schedule. In game play, it’s relative strength that matters, and some teams will have tougher schedules than others.

A statistical model is the best way I can think to turn ratings into game forecasts. To get a model to apply to this season’s ratings, I estimated a simple linear one from last year’s preseason ratings and the results of all 256 regular-season games (found online in .csv format here). The model estimates net score (home minus visitor) from just one feature, the difference between the two teams’ preseason ratings (again, home minus visitor). Because the net scores are all ordered the same way and the model also includes an intercept, though, it implicitly accounts for home-field advantage as well.

The scatterplot below shows the raw data on those two dimensions from the 2013 season. The model estimated from these data has an intercept of 3.1 and a slope of 0.1 for the score differential. In other words, the model identifies a net home-field advantage of 3 points—consistent with the conventional wisdom—and it suggests that every point of advantage on the wiki-survey ratings translates into a net swing of one-tenth of a point on the field. I also tried a generalized additive model with smoothing splines to see if the association between the survey-score differential and net game score was nonlinear, but as the scatterplot suggests, it doesn’t seem to be.

2013 NFL Games Arranged by Net Game Score and Preseason Wiki Survey Rating Differentials

2013 NFL Games Arranged by Net Game Score and Preseason Wiki Survey Rating Differentials

In sample, the linear model’s accuracy was good, not great. If we convert the net scores the model postdicts to binary outcomes and compare those postdictions to actual outcomes, we see that the model correctly classifies 60 percent of the games. That’s in sample, but it’s also based on nothing more than home-field advantage and a single preseason rating for each team from a survey with a small number of respondents. So, all things considered, it looks like a potentially useful starting point.

Whatever its limitations, that model gives us the tool we need to convert 2014 wiki survey results into game-level predictions. To do that, we also need a complete 2014 schedule. I couldn’t find one in .csv format, but I found something close (here) that I saved as text, manually cleaned in a minute or so (deleted extra header rows, fixed remaining header), and then loaded and merged with a .csv of the latest survey scores downloaded from the manager’s view of the survey page on All Our Ideas.

I’m not going to post forecasts for all 256 games—at least not now, with three more preseason games to learn from and, hopefully, lots of votes yet to be cast. To give you a feel for how the model is working, though, I’ll show a couple of cuts on those very preliminary results.

The first is a set of forecasts for all Week 1 games. The labels show Visitor-Home, and the net score is ordered the same way. So, a predicted net score greater than 0 means the home team (second in the paired label) is expected to win, while a predicted net score below 0 means the visitor (first in the paired label) is expected to win. The lines around the point predictions represent 90-percent confidence intervals, giving us a partial sense of the uncertainty around these estimates.

Week 1 Game Forecasts from Preseason Wiki Survey Results on 10 August 2014

Week 1 Game Forecasts from Preseason Wiki Survey Results on 10 August 2014

Of course, as a fan of particular team, I’m most interested in what the model says about how my guys are going to do this season. The next plot shows predictions for all 16 of Baltimore’s games. Unfortunately, the plotting command orders the data by label, and my R skills and available time aren’t sufficient to reorder them by week, but the information is all there. In this plot, the dots for the point predictions are colored red if they predict a Baltimore win and black for an expected loss. The good news for Ravens fans is that this plot suggests an 11-5 season, good enough for a playoff berth. The bad news is that an 8-8 season also lies within the 90-percent confidence intervals, so the playoffs don’t look like a lock.

2014 Game-Level Forecasts for the Baltimore Ravens from 10 August 2014 Wiki Survey Scores

2014 Game-Level Forecasts for the Baltimore Ravens from 10 August 2014 Wiki Survey Scores

So that’s where the toy project stands now. My intuition tells me that the predicted net scores aren’t as well calibrated as I’d like, and the estimated confidence intervals surely understate the true uncertainty around each game (“On any given Sunday…”). Still, I think this exercise demonstrates the potential of this forecasting process. If I were a betting man, I wouldn’t lay money on these estimates. As an applied forecaster, though, I can imagine using these predictions as priors in a more elaborate process that incorporates additional and, ideally, more dynamic information about each team and game situation over the course of the season. Maybe my doppelganger can take that up while I get back to my day job…

Postscript. After I published this post, Jeff Fogle suggested via Twitter that I compare the Week 1 forecasts to the current betting lines for those games. The plot below shows the median point spread from an NFL odds-aggregating site as blue dots on top of the statistical forecasts already shown above. As you can see, the statistical forecasts are tracking the betting lines pretty closely. There’s only one game—Carolina at Tampa Bay—where the predictions from the two series fall on different sides of the win/loss line, and it’s a game the statistical model essentially sees as a toss-up. It’s also reassuring that there isn’t a consistent direction to the differences, so the statistical process doesn’t seem to be biased in some fundamental way.

Week 1 Game-Level Forecasts Compared to Median Point Spread from Betting Sites on 11 August 2014

Week 1 Game-Level Forecasts Compared to Median Point Spread from Betting Sites on 11 August 2014

Forecasting Round-Up No. 7

1. I got excited when I heard on Twitter yesterday about a machine-learning process that turns out to be very good at predicting U.S. Supreme Court decisions (blog post here, paper here). I got even more excited when I saw that the guys who built that process have also been running a play-money prediction market on the same problem for the past several years, and that the most accurate forecasters in that market have done even better than that model (here). It sounds like they are now thinking about more rigorous ways to compare and cross-pollinate the two. That’s part of what we’re trying to do with the Early Warning Project, so I hope that they do and we can learn from their findings.

2. A paper in the current issue of the Journal of Personality and Social Psychology (here, but paywalled; hat-tip to James Igoe Walsh) adds to the growing pile of evidence on the forecasting power of crowds, with an interesting additional finding on the willingness of others to trust and use those forecasts:

We introduce the select-crowd strategy, which ranks judges based on a cue to ability (e.g., the accuracy of several recent judgments) and averages the opinions of the top judges, such as the top 5. Through both simulation and an analysis of 90 archival data sets, we show that select crowds of 5 knowledgeable judges yield very accurate judgments across a wide range of possible settings—the strategy is both accurate and robust. Following this, we examine how people prefer to use information from a crowd. Previous research suggests that people are distrustful of crowds and of mechanical processes such as averaging. We show in 3 experiments that, as expected, people are drawn to experts and dislike crowd averages—but, critically, they view the select-crowd strategy favorably and are willing to use it. The select-crowd strategy is thus accurate, robust, and appealing as a mechanism for helping individuals tap collective wisdom.

3. Adam Elkus recently spotlighted two interesting papers involving agent-based modeling (ABM) and forecasting.

  • The first (here) “presents a set of guidelines, imported from the field of forecasting, that can help social simulation and, more specifically, agent-based modelling practitioners to improve the predictive performance and the robustness of their models.”
  • The second (here), from 2009 but new to me, describes an experiment in deriving an agent-based model of political conflict from event data. The results were pretty good; a model built from event data and then tweaked by a subject-matter expert was as accurate as one built entirely by hand, and the hybrid model took much less time to construct.

4. Nautilus ran a great piece on Lewis Fry Richardson, a pioneer in weather forecasting who also applied his considerable intellect to predicting violent conflict. As the story notes,

At the turn of the last century, the notion that the laws of physics could be used to predict weather was a tantalizing new idea. The general idea—model the current state of the weather, then apply the laws of physics to calculate its future state—had been described by the pioneering Norwegian meteorologist Vilhelm Bjerknes. In principle, Bjerkens held, good data could be plugged into equations that described changes in air pressure, temperature, density, humidity, and wind velocity. In practice, however, the turbulence of the atmosphere made the relationships among these variables so shifty and complicated that the relevant equations could not be solved. The mathematics required to produce even an initial description of the atmosphere over a region (what Bjerknes called the “diagnostic” step) were massively difficult.

Richardson helped solve that problem in weather forecasting by breaking the task into many more manageable parts—atmospheric cells, in this case—and thinking carefully about how those parts fit together. I wonder if we will see similar advances in forecasts of social behavior in the next 100 years. I doubt it, but the trajectory of weather prediction over the past century should remind us to remain open to the possibility.

5. Last, a bit of fun: Please help Trey Causey and me forecast the relative strength of this year’s NFL teams by voting in this pairwise wiki survey! I did this exercise last year, and the results weren’t bad, even though the crowd was pretty small and probably not especially expert. Let’s see what happens if more people participate, shall we?

In Praise of a Measured Response to the Ukraine Crisis

Yesterday afternoon, I tweeted that the Obama administration wasn’t getting enough credit for its measured response to the Ukraine crisis so far, asserting that sanctions were really hurting Russia and noting that “we”—by which I meant the United States—were not directly at war.

Not long after I said that, someone I follow tweeted that he hadn’t seen a compelling explanation of how sanctions are supposed to work in this case. That’s an important question, and one I also haven’t seen or heard answered in depth. I don’t know how U.S. or European officials see this process beyond what they say in public, but I thought I would try to spell out the logic as a way to back up my own assertion in support of the approach the U.S. and its allies have pursued so far.

I’ll start by clarifying what I’m talking about. When I say “Ukraine crisis,” I am referring to the tensions created by Russia’s annexation of Crimea and its evident and ongoing support for a separatist rebellion in eastern Ukraine. These actions are only the latest in a long series of interactions with the U.S. and Europe in Russia’s “near abroad,” but their extremity and the aggressive rhetoric and action that has accompanied them have sharply amplified tensions between the larger powers that abut Ukraine on either side. For the first time in a while, there has been open talk of a shooting war between Russia and NATO. Whatever you make of the events that led to it and however you assign credit or blame for them, this state of affairs represents a significant and undesirable escalation.

Faced with this crisis, the U.S. and its NATO allies have three basic options: compel, cajole, or impel.

Compel in this case means to push Russia out of Ukraine by force—in other words, to go to war. So far, the U.S. and Europe appear to have concluded—correctly, in my opinion—that Russia’s annexation of Crimea and its support for separatists in eastern Ukraine does not warrant a direct military response. The likely and possible costs of war between two nuclear powers are simply too great to bear for the sake of Ukraine’s autonomy or territorial integrity.

Cajoling would mean persuading Russian leaders to reverse course through positive incentives—carrots of some kind. It’s hard to imagine what the U.S. and E.U. could offer that would have the desired effect, however. Russian leaders consider Ukraine a vital interest, and the West has nothing comparably valuable to offer in exchange. More important, the act of making such an offer would reward Russia for its aggression, setting a precedent that could encourage Russia to grab for more and could also affect other country’s perceptions of the U.S.’s tolerance for seizures of territory.

That leaves impel—to impose costs on Russia to the point where its leaders feel obliged to change course. The chief tool that U.S. and European leaders have to impose costs on Russia are economic and financial sanctions. Those leaders are using this tool, and it seems to be having the desired effect. Sanctions are encouraging capital flight, raising the costs of borrowing, increasing inflation, and slowing Russia’s already-anemic economic growth (see here and here for some details). Investors, bankers, and consumers are partly responding to the specific constraints of sanctions, but they are also responding to the broader economic uncertainty associated with those sanctions and the threat of wider war they imply. “It’s pure geopolitical risk,” one analyst told Bloomberg.

These costs can directly and indirectly shape Russian policy. They can directly affect Russian policy if and as the present leadership comes to view them as unbearable, or at least not worth the trade-offs against other policy objectives. That seems unlikely in the short term but increasingly likely over the long term, if the sanctions are sustained and markets continue to react so negatively. Sustained capital flight, rising inflation, and slower growth will gradually shrink Russia’s domestic policy options and its international power by eroding its fiscal health, and at some point these costs should come to outweigh the putative gains of territorial expansion and stronger leverage over Ukrainian policy.

These costs can also indirectly affect Russian policy by increasing the risk of internal instability. In authoritarian regimes, significant reforms usually occur in the face of popular unrest that may or may not be egged on by elites who defect from the ruling coalition. We are already seeing signs of infighting among regime insiders, and rising inflation and slowing growth should increase the probability of popular unrest.

To date, sanctions have not dented Putin’s soaring approval rating, but social unrest is not a referendum. Unrest only requires a small but motivated segment of the population to get started, and once it starts, its very occurrence can help persuade others to follow. I still wouldn’t bet on Putin’s downfall in the near future, but I believe the threat of significant domestic instability is rising, and I think that Putin & co. will eventually care more about this domestic risk than the rewards of continued adventurism abroad. In fact, I think we see some evidence that Putin & co. are already worrying more about this risk in their ever-expanding crackdown on domestic media and their recent moves to strengthen punishment for unauthorized street rallies and, ironically, calls for separatism. Even if this mobilization does not come, the increased threat of it should weigh on the Russian administration’s decision-making.

In my tweet on the topic, I credited the Obama administration for using measured rhetoric and shrewd policy in response to this crisis. Importantly, though, the success of this approach also depends heavily on cooperation among the U.S. and the E.U., and that seems to be happening. It’s not clear who deserves the credit for driving this process, but as one anonymous tweeter pointed out, the downing of flight MH17 appears to have played a role in deepening it.

Concerns are growing that sanctions may, in a sense, be too successful. Some observers fear that apparent capitulation to the U.S. and Europe would cost Russian leaders too much at home at a time when nationalist fervor has reached fever pitch. Confronted with a choice between wider war abroad or a veritable lynch mob at home, Putin & co. will, they argue, choose the former.

I think that this line of reasoning overstates the extent to which the Russian administration’s hands are tied at home. Putin & co. are arguably no more captive to the reinvigorated radical-nationalist fringe than they were to the liberal fringe that briefly threatened to oust them after the last presidential election.

Still, it is at least a plausible scenario, and the U.S. and E.U. have to be prepared for the possibility that Russian aggression will get worse before it gets better. This is where rhetorical and logistical efforts to bolster NATO are so important, and that’s just what NATO has been doing. NATO is predicated on a promise of collective defense; an attack on any one member state is regarded as an attack on all. By strengthening Russian policy-makers’ beliefs that this promise is credible, NATO can lead them to fear that escalations beyond certain thresholds will carry extreme costs and even threaten their very survival. So far, that’s just what the alliance has been doing with a steady flow of words and actions. Russian policy-makers could still choose wider war for various reasons, but theory and experience suggest that they are less likely to do so than they would be in the absence of this response.

In sum, given a short menu of unpalatable options, I think that the Obama administration and its European allies have chosen the best line of action and, so far, made the most of it. To expect Russia quickly to reverse course by withdrawing from Crimea and stopping its rabble-rousing in eastern Ukraine without being compelled by force to do so is unrealistic. The steady, measured approach the U.S. and E.U. have adopted appears to be having the intended effects. Russia could still react to the rising structural pressures on it by lashing out, but NATO is taking careful steps to discourage that response and to prepare for it if it comes. Under such lousy circumstances, I think this is about as well as we could expect the Obama administration and its E.U. counterparts to do.

Uncertainty About How Best to Convey Uncertainty

NPR News ran a series of stories this week under the header Risk and Reason, on “how well we understand and act on probabilities.” I thought the series nicely represented how uncertain we are about how best to convey forecasts to people who might want to use them. There really is no clear standard here, even though it is clear that the choices we make in presenting forecasts and other statistics on risks to their intended consumers strongly shape what they hear.

This uncertainty about how best to convey forecasts was on full display in the piece on how CIA analysts convey predictive assessments (here). Ken Pollack, a former analyst who now teaches intelligence analysis, tells NPR that, at CIA, “There was a real injunction that no one should ever use numbers to explain probability.” Asked why, he says that,

Assigning numerical probability suggests a much greater degree of certainty than you ever want to convey to a policymaker. What we are doing is inherently difficult. Some might even say it’s impossible. We’re trying to protect the future. And, you know, saying to someone that there’s a 67 percent chance that this is going to happen, that sounds really precise. And that makes it seem like we really know what’s going to happen. And the truth is that we really don’t.

In that same segment, though, Dartmouth professor Jeff Friedman, who studies decision-making about national security issues, says we should provide a numeric point estimate of an event’s likelihood, along with some information about our confidence in that estimate and how malleable it may be. (See this paper by Friedman and Richard Zeckhauser for a fuller treatment of this argument.) The U.S. Food and Drug Administration apparently agrees; according to the same NPR story, the FDA “prefers numbers and urges drug companies to give numerical values for risk—and to avoid using vague terms such as ‘rare, infrequent and frequent.'”

Instead of numbers, Pollack advocates for using words: “Almost certainly or highly likely or likely or very unlikely,” he tells NPR. As noted by one of the other stories in the series (here), however—on the use of probabilities in medical decision-making—words and phrases are ambiguous, too, and that ambiguity can be just as problematic.

Doctors, including Leigh Simmons, typically prefer words. Simmons is an internist and part of a group practice that provides primary care at Mass General. “As doctors we tend to often use words like, ‘very small risk,’ ‘very unlikely,’ ‘very rare,’ ‘very likely,’ ‘high risk,’ ” she says.

But those words can be unclear to a patient.

“People may hear ‘small risk,’ and what they hear is very different from what I’ve got in my mind,” she says. “Or what’s a very small risk to me, it’s a very big deal to you if it’s happened to a family member.

Intelligence analysts have sometimes tried to remove that ambiguity by standardizing the language they use to convey likelihoods, most famously in Sherman Kent’s “Words of Estimative Probability.” It’s not clear to me, though, how effective this approach is. For one thing, consumers are often lazy about trying to understand just what information they’re being given, and templates like Kent’s don’t automatically solve that problem. This laziness came across most clearly in NPR’s Risk and Reason segment on meteorology (here). Many of us routinely consume probabilistic forecasts of rainfall and make decisions in response to them, but it turns out that few of us understand what those forecasts actually mean. With Kent’s words of estimative probability, I suspect that many readers of the products that use them haven’t memorized the table that spells out their meaning and don’t bother to consult it when they come across those phrases, even when it’s reproduced in the same document.

Equally important, a template that works well for some situations won’t necessarily work for all. I’m thinking in particular of forecasts on the kinds of low-probability, high-impact events that I usually analyze and that are essential to the CIA’s work, too. Here, what look like small differences in probability can sometimes be very meaningful. For example, imagine that it’s August 2001 and you’ve three different assessments of the risk of a major terrorist attack on U.S. soil in the next few months. One pegs the risk at 1 in 1,000; another at 1 in 100; and another at 1 in 10. Using Kent’s table, all three of those assessments would get translated into a statement that the event is “almost certainly not” going to happen, but I imagine that most U.S. decision-makers would have felt very differently about risks of 0.1%, 1%, and 10% with a threat of that kind.

There are lots of rare but important events that inhabit this corner of the probability space: nuclear accidents, extreme weather events, medical treatments, and mass atrocities, to name a few. We could create a separate lexicon for assessments in these areas, as the European Medicines Agency has done for adverse reactions to medical therapies (here, via NPR). I worry, though, that we ask too much of consumers of these and other forecasts if we expect them to remember multiple lexicons and to correctly code-switch between them. We also know that the relevant scale will differ across audiences, even on the same topic. For example, an individual patient considering a medical treatment might not care much about the difference between a mortality risk of 1 in 1,000 and 1 in 10,000, but a drug company and the regulators charged with overseeing them hopefully do.

If there’s a general lesson here, it’s that producers of probabilistic forecasts should think carefully about how best to convey their estimates to specific audiences. In practice, that means thinking about the nature of the decision processes those forecasts are meant to inform and, if possible, trying different approaches and checking to find out how each is being understood. Ideally, consumers of those forecasts should also take time to try to educate themselves on what they’re getting. I’m not optimistic that many will do that, but we should at least make it as easy as possible for them to do so.

Follow

Get every new post delivered to your Inbox.

Join 7,730 other followers

%d bloggers like this: