Big Data Doesn’t Automatically Produce Better Predictions

At FiveThirtyEight, Neil Payne and Rob Arthur report on an intriguing puzzle:

In an age of unprecedented baseball data, we somehow appear to be getting worse at knowing which teams are — and will be — good.

Player-level predictions are as good if not better than they used to be, but team-level predictions of performance are getting worse. Payne and Arthur aren’t sure why, but they rank a couple of trends in the industry — significant changes in the age structure of the league’s players and, ironically, the increased use of predictive analytics in team management — among the likely culprits.

This story nicely illustrates a fact that breathless discussions of the power of “Big Data” often elide: more and better data don’t automatically lead to more accurate predictions. Observation and prediction are interrelated, but the latter does not move in lock step with the former. At least two things can weaken the link between those two steps in the analytical process.

First, some phenomena are just inherently difficult or impossible to predict with much accuracy. That’s not entirely true of baseball; as Payne and Arthur show, team-level performance predictions have been pretty good in the past. It is true of many other phenomena or systems, however. Take earthquakes; we can now detect and record these events with tremendous precision, but we’re still pretty lousy at anticipating when they’ll occur and how strong they will be. So far, better observation hasn’t led to big gains in prediction.

Second, the systems we’re observing sometimes change, even as we get better at observing them. This is what Payne and Arthur imply is occurring in baseball when they identify trends in the industry as likely explanations for a decline in the predictive power of models derived from historical data. It’s like trying to develop a cure for a disease that’s evolving rapidly as you work on it; the cure you develop in the lab might work great on the last version you captured, but by the time you deploy it, the disease has evolved further, and the treatment doesn’t have much effect.

I wonder if this is also the trajectory social science will follow over the next few decades. Right now, we’re getting hit by the leading edge of what will probably be a large and sustained flood tide of new data on human behavior.  That inflow is producing some rather optimistic statements about how predictable human behavior in general, and sometimes politics in particular, will become as we discover deeper patterns in those data.

I don’t share that confidence. A lot of human behavior is predictably routine, and a torrent of new behavioral data will almost certainly make us even better at predicting these actions and events. For better or for worse, though, those routines are not especially interesting or important to most political scientists. Political scientists are more inclined to focus on “high” politics, which remains fairly opaque, or on system-level outcomes like wars and revolutions that emerge from changes in individual-level behavior in non-obvious ways. I suspect we’ll get a little better at predicting these things as we accumulate richer data on various parts of those systems, but I am pretty sure we won’t ever get great at it. The processes are too complex, and the systems themselves are constantly evolving, maybe even at an accelerating rate.

Interactive 2015 NFL Forecasts

As promised in my last post, I’ve now built and deployed a web app that lets you poke through my preseason forecasts for the 2015 NFL regular season:

2015 NFL Forecasts

I learned several new tricks in the course of generating these forecasts and building this app, so the exercise served its didactic purpose. (You can find the code for the app here, on GitHub.) I also got lucky with the release of a new R package that solved a crucial problem I was having when I started to work on this project a couple of weeks ago. Open source software can be a wonderful thing.

The forecasts posted right now are based on results of the pairwise wiki survey through the morning of Monday, August 17. At that point, the survey had already logged upwards of 12,000 votes, triple the number cast in last year’s edition. This time around, I posted a link to the survey on the r/nfl subreddit, and that post produced a brief torrent of activity from what I hope was a relatively well-informed crowd.

The regular season doesn’t start until September, and I will update these forecasts at least once more before that happens. With so many votes already cast, though, the results will only change significantly if a) a large number of new votes are cast and b) those votes differ substantially from the ones already cast, and those conditions are highly unlikely to intersect.

One thing these forecasts help to illustrate is how noisy a game professional football is. By noisy, I mean hard to predict with precision. Even in games where one team is much stronger than the other, we still see tremendous variance in the simulated net scores and the associated outcomes. Heavy underdogs will win big every once in a while, and games we’d consider close when watching can produce a wide range of net scores.

Take, for example, the week 1 match-up between the Bears and Packers. Even though Chicago’s the home team, the simulation results (below) favor Green Bay by more than eight points. At the same time, those simulations also include a smattering of outcomes in which the Bears win by multiple touchdowns, and the peak of the distribution of simulations is pretty broad and flat. Some of that variance results from the many imperfections of the model and survey scores, but a lot of it is baked into the game, and plots of the predictive simulations nicely illustrate that noisiness.

nfl.forecast.packers.bears

The big thing that’s still missing from these forecasts is updating during the season. The statistical model that generates the predictive simulations takes just two inputs for each game — the difference between the two teams’ strength scores and the name of the home team — and, barring catastrophe, only one of those inputs can change as the season passes. I could leave the wiki survey running throughout the season, but the model that turns survey votes into scores doesn’t differentiate between recent and older votes, so updating the forecasts with the latest survey scores is unlikely to move the needle by much.*

I’m now hoping to use this problem as an entry point to learning about Bayesian updating and how to program it in R. Instead of updating the actual survey scores, we could treat the preseason scores as priors and then use observed game scores or outcomes to sequentially update estimates of them. I haven’t figured out how to implement this idea yet, but I’m working on it and will report back if I do.

* The pairwise wiki survey runs on open source software, and I can imagine modifying the instrument to give more weight to recent votes than older ones. Right now, I don’t have the programming skills to make those modifications, but I’m still hoping to find someone who might want to work with me, or just take it upon himself or herself, to do this.

Yes, By Golly, I Am Ready for Some Football

The NFL’s 2015 season sort of got underway last night with the Hall of Fame Game. Real preseason play doesn’t start until this weekend, and the first kickoff of the regular season is still a month away.

No matter, though — I’m taking the Hall of Fame Game as my cue to launch this year’s football forecasting effort. As it has for the past two years (see here and here), the process starts with me asking you to help assess the strength of this year’s teams by voting in a pairwise wiki survey:

In the 2015 NFL season, which team will be better?

That survey produces scores on a scale of 0–100. Those scores will become the crucial inputs into simulations based on a simple statistical model estimated from the past two years’ worth of survey data and game results. Using an R function I wrote, I’ve determined that I should be able to improve the accuracy of my forecasts a bit this year by basing them on a mixed-effects model with random intercepts to account for variation in home-team advantages across the league. Having another season’s worth of predicted and actual outcomes should help, too; with two years on the books, my model-training sample has doubled.

An improvement in accuracy would be great, but I’m also excited about using R Studio’s Shiny to build a web page that will let you explore the forecasts at a few levels: by game, by team, and by week. Here’s a screenshot of the game-level tab from a working version using the 2014 data. It plots the distribution of the net scores (home – visitor) from the 1,000 simulations, and it reports win probabilities for both teams and a line (the median of the simulated scores).

nfl.forecasts.app.game.20150809

The “By team” tab lets you pick a team to see a plot of the forecasts for all 16 of their games, along with their predicted wins (count of games with win probabilities over 0.5) and expected wins (sum of win probabilities for all games) for the year. The “By week” tab (shown below) lets you pick a week to see the forecasts for all the games happening in that slice of the season. Before, I plan to add annotation to the plot, reporting the lines those forecasts imply (e.g., Texans by 7).

nfl.forecasts.app.week.20150809

Of course, the quality of the forecasts displayed in that app will depend heavily on participation in the wiki survey. Without a diverse and well-informed set of voters, it will be hard to do much better than guessing that each team will do as well this year as it did last year. So, please vote here; please share this post or the survey link with friends and family who know something about pro football; and please check back in a few weeks for the results.

Finding the Right Statistic

Earlier this week, Think Progress reported that at least five black women have died in police custody in the United States since mid-July. The author of that post, Carimah Townes, wrote that those deaths “[shine] an even brighter spotlight on the plight of black women in the criminal justice system and [fuel] the Black Lives Matter movement.” I saw the story on Facebook, where the friend who posted it inferred that “a disproportionate percentage of those who died in jail are from certain ethnic minorities.”

As a citizen, I strongly support efforts to draw attention to implicit and explicit racism in the U.S. criminal justice system, and in the laws that system is supposed to enforce. The inequality of American justice across racial and ethnic groups is a matter of fact, not opinion, and its personal and social costs are steep.

As a social scientist, though, I wondered how much the number in that Think Progress post — five — really tells us. To infer bias, we need to make comparisons to other groups. How many white women died in police custody during that same time? What about black men and white men? And so on for other subsets of interest.

Answering those questions would still get us only partway there, however. To make the comparisons fair, we would also need to know how many people from each of those groups passed through police custody during that time. In epidemiological jargon, what we want are incidence rates for each group: the number of cases from some period divided by the size of the population during that period. Here, cases are deaths, and the population of interest is the number of people from that group who spent time in police custody.

I don’t have those data for the United States for second half of July, and I doubt that they exist in aggregate at this point. What we do have now, however, is a U.S. Department of Justice report from October 2014 on mortality in local jails and state prisons (PDF). This isn’t exactly what we’re after, but it’s close.

So what do those data say? Here’s an excerpt from Table 6, which reports the “mortality rate per 100,000 local jail inmates by selected decedent characteristics, 2000–2012”:

                    2008     2009     2010     2011     2012
By Sex
Male                 123      129      125      123      129
Female               120      120      124      122      123

Race/Hispanic Origin
White                185      202      202      212      220
Black/Afr. Am.       109      100      102       94      109
Hispanic/Latino       70       71       58       67       60
Other                 41       53       36       28       31

Given what we know about the inequality of American justice, these figures surprised me. According to data assembled by the DOJ, the mortality rate of blacks in local jails in those recent years was about half the rate for whites. For Latinos, it was about one-third the rate for whites.

That table got me wondering why those rates were so different from what I’d expected. Table 8 in the same report offers some clues. It provides death rates by cause for each of those same subgroups for the whole 13-year period. According to that table, white inmates committed suicide in local jails at a much higher rate than blacks and Latinos: 80 per 100,000 versus 14 and 25, respectively. Those figures jibe with ones on suicide rates for the general population. White inmates also died from heart disease and drug and alcohol intoxication at a higher rate than their black and Latino counterparts. In short, it looks like whites are more likely than blacks or Latinos to die while in local jails, mostly because they are much more likely to commit suicide there.

These statistics tell us nothing about whether or not racism or malfeasance played a role in the deaths of any of those five black women mentioned in the Think Progress post. They also provide a woefully incomplete picture of the treatment of different racial and ethnic groups by police and the U.S. criminal justice system. For example and as FiveThirtyEight reported just a few days ago, DOJ statistics also show that the rate of arrest-related deaths by homicide is almost twice as high for blacks as whites — 3.4 per 100,000 compared to 1.8. In many parts of the U.S., blacks convicted of murder are more likely than their white counterparts to get the death penalty, even when controlling for similarities in the crimes involved and especially when the victims were white (see here). A 2013 Pew Research Center Study found that, in 2010, black men were six times as likely as white men to be incarcerated in federal, state and local jails.

Bearing all of that in mind, what I hope those figures do is serve as a simple reminder that, when mustering evidence of a pattern, it’s important to consider the right statistic for the question. Raw counts will rarely be that statistic. If we want to make comparisons across groups, we need to think about differences in group size and other factors that might affect group exposure, too.

Be Vewy, Vewy Quiet

This blog has gone relatively quiet of late, and it will probably stay that way for a while. That’s partly a function of my personal life, but it also reflects a conscious decision to spend more time improving my abilities as a programmer.

I want to get better at scraping, making, munging, summarizing, visualizing, and analyzing data. So, instead of contemplating world affairs, I’ve been starting to learn Python; using questions on Stack Overflow as practice problems for R; writing scripts that force me to expand my programming skills; and building Shiny apps that put those those skills to work. Here’s a screenshot of one app I’ve made—yes, it actually works—that interactively visualizes ACLED’s latest data on violence against civilians in Africa, based partly on this script for scraping ACLED’s website:

acled.visualizer.20150728

When I started on this kick, I didn’t plan to stop writing blog posts about international affairs. As I’ve gotten into it, though, I’ve found that my curiosity about current events has ebbed, and the pilot light for my writing brain has gone out. Normally, writing ideas flare up throughout the day, but especially in the early morning. Lately, I wake up thinking about the coding problems I’m stuck on.

I think it’s a matter of attention, not interest. Programming depends on the tiniest details. All those details quickly clog the brain’s RAM, leaving no room for the unconscious associations that form the kernels of new prose. That clogging happens even faster when other parts of your life are busy, stressful, or off kilter, as they are for many of us, as as they are for me right now.

That’s what I think, anyway. Whatever the cause, though, I know that I’m rarely feeling the impulse to write, and I know that shift has sharply slowed the pace of publishing here. I’m leaving the channel open and hope I can find the mental and temporal space to keep using it, but who knows what tomorrow may bring?

ACLED in R

The Armed Conflict Location & Event Data Project, a.k.a. ACLED, produces up-to-date event data on certain kinds of political conflict in Africa and, as of 2015, parts of Asia. In this post, I’m not going to dwell on the project’s sources and methods, which you can read about on ACLED’s About page, in the 2010 journal article that introduced the project, or in the project’s user’s guides. Nor am I going to dwell on the necessity of using all political event data sets, including ACLED, with care—understanding the sources of bias in how they observe events and error in how they code them and interpreting (or, in extreme cases, ignoring) the resulting statistics accordingly.

Instead, my only aim here is to share an R script I’ve written that largely automates the process of downloading and merging ACLED’s historical and current Africa data and then creates a new data frame with counts of events by type at the country-month level. If you use ACLED in R, this script might save you some time and some space on your hard drive.

You can find the R script on GitHub, here.

The chief problem with this script is that the URLs and file names of ACLED’s historical and current data sets change with every update, so the code will need to be modified each time that happens. If the names were modular and the changes to them predictable, it would be easy to rewrite the code to keep up with those changes automatically. Unfortunately, they aren’t, so the best I can do for now is to give step-by-step instructions in comments embedded in the script on how to update the relevant four fields by hand. As long as the basic structure of the .csv files posted by ACLED doesn’t change, though, the rest should keep working.

[UPDATE: I revised the script so it will scrape the link addresses from the ACLED website and parse the file names from them. The new version worked after ACLED updated its real-time file earlier today, when the old version would have broken. Unless ACLED changes its file-naming conventions or the structure of its website, the version should work for the rest of 2015. In case it does fail, instructions on how to hard-code a workaround are included as comments at the bottom of the script.]

It should also be easy to adapt the part of the script that generates country-month event counts to slice the data even more finely, or to count by something other than event type. To do that, you would just need to add variables to the group_by() part of the block of code that produces the object ACLED.cm. For example, if you wanted to get counts of events by type at the level of the state or province, you would revise that line to read group_by(gwno, admin1, year, month, event_type). Or, if you wanted country-month counts of events by the type(s) of actor involved, you could use group_by(gwno, year, month, interaction) and then see this user’s guide to decipher those codes. You get the drift.

The script also shows a couple of examples of how to use ‘gglot2’ to generate time-series plots of those monthly counts. Here’s one I made of monthly counts of battle events by country for the entire period covered by ACLED as of this writing: January 1997–June 2015. A production-ready version of this plot would require some more tinkering with the size of the country names and the labeling of the x-axis, but the kind of small-multiples chart offers a nice way to explore the data before analysis.

Monthly counts of battle events, January 1997-June 2015

Monthly counts of battle events, January 1997-June 2015

If you use the script and find flaws in it or have ideas on how to make it work better or do more, please email me at ulfelder <at> gmail <dot> com.

The Dilemma of Getting to Civilian Control

A country can’t really qualify as a democracy without civilian control of its own security forces, but the logic that makes that statement true also makes civilian control hard to achieve, as events in Burkina Faso are currently reminding us.

The essential principle of democracy is popular sovereignty. Civilian control is fundamental to democracy because popular sovereignty requires that elected officials rule, but leaders of security forces—military and police—are not elected. Neither are the leaders of many other influential organizations, of course, but security forces occupy a special status in national affairs by virtue of their particular skills. To perform their designated roles, national rulers must determine and try to implement policies involving the collection of revenue and the production of public goods, including security. To do that, rulers need to wield the threat of coercion, and security forces supply that threat.

That necessity creates a dependency, and that dependency conveys power. In principle—and, historically, often in practice—leaders of security forces can use that dependency as leverage to bargain for bigger budgets or other policies they prefer for parochial reasons. Because those leaders are not held accountable to the public through elections, strong versions of that bargaining contravene the principle of popular sovereignty. Of course, security forces’ specific skills also make them uniquely capable of claiming political authority for themselves at any time. Military leaders rarely flex that muscle, but the threat of a coup only enhances their bargaining power with elected rulers, and thus further constrains popular sovereignty.

This logic implies that democracy only really obtains when state security forces reliably subordinate themselves to the authority of those elected civilian rulers. That arrangement seems clear in principle, but it turns out to be hard to achieve in practice. The problem is that the transition to civilian control demands that security forces concede their power. Organizations of all kinds are rarely excited about doing that, but it is especially hard for rulers to compel security forces to do it, because those forces are the stick those rulers would normally wield in that act of compellence. When pushed, military and police leaders can ask “Or what?” and civilian rulers will rarely have a strong answer. Under those circumstances, attempts to force the issue may have the opposite of their intended effect, provoking security forces into seizing power for themselves as a way to dispatch the civilian threat to their established position.

In short, the problem of getting to civilian control confronts civilian rulers with a dilemma: assert their authority and risk provoking a hard coup, or tolerate security forces’ continuing political power and accept what amounts to a perpetual soft coup.

This dilemma is bedeviling politics in Burkina Faso right now. Last fall, mass demonstrations in Burkina Faso triggered a military coup that toppled longtime autocratic ruler Blaise Compaoré. Under domestic and international pressure, the ensuing junta established a transitional process that is supposed to lead to democratic civilian rule after general elections on October 11, 2015.

Gen. Honore Nabere Traore leads a press conference on October 31, 2014, announcing that he would serve as president following Blaise Compaore's resignation (Photo: Theo Renault/AP)

Gen. Honore Nabere Traore leads an October 2014 press conference announcing that he would serve as president following Blaise Compaore’s apparent resignation. Traore was promptly supplanted by Lt. Col. Isaac Zida, who in November 2014 stepped aside for a civilian interim president, who then appointed Zida to the post of interim prime minister. (Photo: Theo Renault/Associated Press)

On paper, a civilian now rules Burkina Faso as interim president, but attempts to clarify the extent of the interim government’s power, and to circumscribe the role of certain security organs in Burkinabe politics, are generating the expected friction and heat. Here is how Alex Thurston described the situation on his Sahel Blog:

In recent weeks, NGOs and media outlets have buzzed with discussions of tension between the Presidential Security Regiment (RSP) and Prime Minister Yacouba Isaac Zida, a conflict that could, at worst, derail the transition. Although both Zida and Compaore belonged to the RSP in the past, the elite unit has reasons to fear that it will be disbanded and punished: in December, Zida called for its dismantling, and in February, a political crisis unfolded when Zida attempted to reshuffle the RSP’s officer corps (French).

The most recent crisis (French) involves suspicions in some quarters of the government that the RSP was planning to arrest Zida upon his return from a trip to Taiwan – suspicions that were serious enough to make Zida land at a military base instead of at the airport as planned (French). On June 29, the day after Zida got home, gendarmes in the capital questioned three RSP officers, including Lieutenant Colonel Céleste Coulibaly, about their involvement in the suspected plot. That evening, shots were heard coming from the RSP’s barracks, which sits behind the presidential palace. Rumors then spread that Zida was resigning under RSP pressure, but he quickly stated that he was not stepping down.

These incidents have passed without bloodshed, but they have raised fears of an RSP-led coup. For its part, the RSP says (French) that there are no plots, but that it wants Zida and other military officers, such as Minister of Territorial Administration and Security Auguste Barry, to leave the government (French). Both sides accuse the other of seeking to undermine the planned transition. Many observers now look to interim President Michel Kafando to mediate (French) between the parties.

In a recent briefing, the International Crisis Group (ICG) surveyed that landscape and argued in favor of deferring any clear decisions on the RSP’s status until after the elections. Thurston sympathizes with ICG’s view but worries that deferral of those decisions will produce “an atmosphere of impunity.” History says that Thurston is right to worry, but so is ICG. In other words, there are no obvious ways to climb down from the horns of this dilemma.

How Likely Is (Nuclear) War Between the United States and Russia?

Last week, Vox ran a long piece by Max Fisher claiming that “the prospect of a major war, even a nuclear war, in Europe has become thinkable, [experts] warn, even plausible.” Without ever clarifying what “thinkable” or “plausible” mean in this context, Fisher seems to be arguing that, while still unlikely, the probability of a nuclear war between the United States and Russia is no longer small and is rising.

I finished Fisher’s piece and wondered: Is that true? As someone who’s worked on a couple of projects (here and here) that use “wisdom of crowds” methods to make educated guesses about how likely various geopolitical events are, I know that one way to try to answer that question is to ask a bunch of informed people for their best estimates and then average them.

So, on Thursday morning, I went to SurveyMonkey and set up a two-question survey that asks respondents to assess the likelihood of war between the United States and Russia before 2020 and, if war were to happen, the likelihood that one or both sides would use nuclear weapons. To elicit responses, I tweeted the link once and posted it to the Conflict Research Group on Facebook and the IRstudies subreddit. The survey is still running [UPDATE: It’s now closed, because Survey Monkey won’t show me more than the first 100 responses without a paid subscription], but 100 people have taken it so far, and here are the results—first, on the risk of war:

wwiii.warrisk

And then on the risk that one or both sides would nuclear weapons, conditional on the occurrence of war:

wwiii.nukerisk

These results come from a convenience sample, so we shouldn’t put too much stock in them. Still, my confidence in their reliability got a boost when I learned yesterday that a recent survey of international-relations experts around the world asked an almost-identical question about the risk of a war and obtained similar results. In its 2014 survey, the TRIP project asked: “How likely is war between the United States and Russia over the next decade? Please use the 0–10 scale with 10 indicating that war will definitely occur.” They got 2,040 valid responses to that question, and here’s how they were distributed:

trip.warrisk

Those results are centered a little further to the right than the ones from my survey, but TRIP asked about a longer time period (“next decade” vs. “before 2020”), and those additional five years could explain the difference. It’s also important to note that the scales aren’t directly comparable; where the TRIP survey’s bins implicitly lie on a linear scale, mine were labeled to give respondents more options toward the extremes (e.g., “Certainly not” and “Almost certainly not”).

In light of that corroborating evidence, let’s assume for the moment that the responses to my survey are not junk. So then, how likely is a US/Russia war in the next several years, and how likely is it that such a war would go nuclear if it happened? To get to estimated probabilities of those events, I did two things:

  1. Assuming that the likelihoods implicit my survey’s labels follow a logistic curve, I converted them to predicted probabilities as follows: p(war) = exp(response – 5)/(1 + exp(response – 5)). That rule produces the following sequence for the 0–10 bins: 0.007, 0.018, 0.047, 0.119, 0.269, 0.500, 0.731, 0.881, 0.953, 0.982, 0.993.

  2. I calculated the unweighted average of those predicted probabilities.

Here are the estimates that process produced, rounded up to the nearest whole percentage point:

  • Probability of war: 11%
  • Probability that one or both sides will use nuclear weapons, conditional on war: 18%

To translate those figures into a single number representing the crowd’s estimate of the probability of nuclear war between the US and Russia before 2020, we take their product: 2%.

Is that number different from what Max Fisher had in mind when he wrote that a nuclear war between the US and Russia is now “thinkable,” “plausible,” and “more likely than you think”? I don’t know. To me, “thinkable” and “plausible” seem about as specific as “possible,” a descriptor that applies to almost any geopolitical event you can imagine. I think Max’s chief concern in writing that piece was to draw attention to a risk that he believes to be dangerously under-appreciated, but it would be nice if he had asked his sources to be more specific about just how likely they think this calamity is.

More important, is that estimate “true”? As Ralph Atkins argued in a recent Financial Times piece about estimating the odds of Grexit, it’s impossible to say. For unprecedented and at least partially unique events like these—an exit from the euro zone, or a nuclear war between major powers—we can never know the event-generating process well enough to estimate their probabilities with high confidence. What we get instead are summaries of peoples’ current beliefs about those events’ likelihood. That’s highly imperfect, but it’s still informative in its own way.

2015 Tour de France Predictions

I like to ride bikes, I like to watch the pros race their bikes, and I make forecasts for a living, so I thought it would be fun to try to predict the outcome of this year’s Tour de France, which starts this Saturday and ends on July 26. I’m also interested in continuing to explore the predictive power of pairwise wiki surveys, a crowdsourcing tool that I’ve previously used to try to forecast mass-killing onsets, coup attempts, and pro football games, and that ESPN recently used to rank NBA draft prospects.

So, a couple of weeks ago, I used All Our Ideas to create a survey that asks, “Which rider is more likely to win the 2015 Tour de France?” I seeded the survey with the names of 11 riders—the 10 seen by bookmakers at Paddy Power as the most likely winners, plus Peter Sagan because he’s fun to watchposted a link to the survey on Tumblr, and trolled for respondents on Twitter and Facebook. The survey got off to a slow start, but then someone posted a link to it in the r/cycling subreddit, and the votes came pouring in. As of this afternoon, the survey had garnered more than 4,000 votes in 181 unique user sessions that came from five continents (see the map below). The crowd also added a handful of other riders to the set under consideration, bringing the list up to 16.

tourdefrance.2015.votemap

So how does that self-selected crowd handicap the race? The dot plot below shows the riders in descending order by their survey scores, which range from 0 to 100 and indicate the probability that that rider would beat a randomly chosen other rider for a randomly chosen respondent. In contrast to Paddy Power, which currently shows Chris Froome as the clear favorite and gives Nairo Quintana a slight edge over Alberto Contador, this survey sees Contador as the most likely winner (survey score of 90), followed closely by Froome (87) and a little further by Quintana (80). Both sources put Vincenzo Nibali as fourth likeliest (73) and Tejay van Garderen (65) and Thibaut Pinot (51) in the next two spots, although Paddy Power has them in the opposite order. Below that, the distances between riders’ chances get smaller, but the wiki survey’s results still approximate the handicapping of the real-money markets pretty well.

tourdefrance.2015.scores

There are at least a couple of ways to try to squeeze some meaning out those scores. One is to read the chart as a predicted finishing order for the 16 riders listed. That’s useful for something like a bike race, where we—well, some of us, anyway—care not only who wins, but also where other will riders finish, too.

We can also try to convert those scores to predicted probabilities of winning. The chart below shows what happens when we do that by dividing each rider’s score by the sum of all scores and then multiplying the result by 100. The probabilities this produces are all pretty low and more tightly bunched than seems reasonable, but I’m not sure how else to do this conversion. I tried squaring and cubing the scores; the results came closer to what the betting-market odds suggest are the “right” values, but I couldn’t think of a principled reason to do that, so I’m not showing those here. If you know a better way to get from those model scores to well-calibrated win probabilities, please let me know in the comments.

tourdefrance.2015.winprobs

So that’s what the survey says. After the Tour concludes in a few weeks, I’ll report back on how the survey’s predictions fared. Meanwhile, here’s wishing the athletes a crash–, injury–, and drug–free tour. Judging by the other big races I’ve seen so far this year, it should be a great one to watch.

The Birth of Crowdsourcing?

From p. 106 of the first paperback edition of The Professor and the Madman, a slightly overwrought but enjoyable history of the origins of the Oxford English Dictionary, found on the shelf of a vacation rental:

The new venture that [Richard Chenevix] Trench seemed now to be proposing would demonstrate not merely the meaning but the history of meaning, the life story of each word. And that would mean the reading of everything and the quoting of everything that showed anything of the history of the words that were to be cited. The task would be gigantic, monumental, and—according to the conventional thinking of the times—impossible.

Except that here Trench presented an idea, an idea that—to those ranks of conservative and frock-coated men who sat silently in the [London Library] on that dank and foggy evening [in 1857]—was potentially dangerous and revolutionary. But it was the idea that in the end made the whole venture possible.

The undertaking of the scheme, he said, was beyond the ability of any one man. To peruse all of English literature—and to comb the London and New York newspapers and the most literate of the magazines and journals—must be instead “the combined action of many.” It would be necessary to recruit a team—moreover, a huge one—probably comprising hundreds and hundreds of unpaid amateurs, all of them working as volunteers.

The audience murmured with surprise. Such an idea, obvious though it may sound today, had never been put forward before. But then, some members said as the meeting was breaking up, it did have some real merit.

And here’s what that crowdsourcing process ended up looking like in practice:

[Frederick] Furnivall then issued a circular calling for volunteer readers. They could select from which period of history they would like to read books—from 1250 to 1526, the year of the New English Testament; from then to 1674, the year when Milton died; or from 1674 to what was then the present day. Each period, it was felt, represented the existence of different trends in the development of the language.

The volunteers’ duties were simple enough, if onerous. They would write to the society offering their services in reading certain books; they would be asked to read and make word-lists of all that they read, and would then be asked to look, super-specifically, for certain words that currently interested the dictionary team. Each volunteer would take a slip of paper, write at its top left-hand side the target word, and below, also on the left, the date of the details that followed: These were, in order, the title of the book or paper, its volume and page number, and then, below that, the full sentence that illustrated the use of the target word. It was a technique that has been undertaken by lexicographers to the present day.

Herbert Coleridge became the first editor of what was to be called A New English Dictionary on Historical Principles. He undertook as his first task what may seem prosaic in the extreme: the design of a small stack of oak-board pigeonholes, nine holes wide and six high, which could accommodate the anticipated sixty to one hundred thousand slips of paper that would come in from the volunteers. He estimated that the first volume of the dictionary would be available to the world within two years. “And were it not for the dilatoriness of many contributors,” he wrote, clearly in a tetchy mood, “I should not hesitate to name an earlier period.”

Everything about these forecasts was magnificently wrong. In the end more than six million slips of paper came in from the volunteers; and Coleridge’s dreamy estimate that it might take two years to have the first salable section of the dictionary off the presses—for it was to be sold in parts, to help keep revenues coming in—was wrong by a factor of ten. It was this kind of woefully naive underestimate—of work, of time, of money—that at first so hindered the dictionary’s advance. No one had a clue what they were up against: They were marching blindfolded through molasses.

So, even with all those innovations, this undertaking also produced a textbook example of the planning fallacy. I wonder how quickly and cheaply the task could have been completed with Mechanical Turk, or with some brush-clearing assistance from text mining?

  • Author

  • Follow me on Twitter

  • Follow Dart-Throwing Chimp on WordPress.com
  • Enter your email address to follow this blog and receive notifications of new posts by email.

    Join 13,609 other subscribers
  • Archives

%d bloggers like this: