A First-Person Reminder of How Not to Do Statistics and Science

I recently re-learned a lesson in statistical and scientific thinking by getting—or, really, putting—some egg on my face. I think this experience has some didactic value, so I thought I would share it here.

On Monday, the New York Times ran a story claiming that “cities across the nation are seeing a startling rise in murders after years of declines.” The piece included a chart showing year-over-year change in murder counts in 10 cities, many of them substantial, and it discussed various ideas about why homicide rates are spiking now after years of declines.

I read the piece and thought of claims made in the past decade about the relationship between lead (the metal) and crime. I don’t know the science on that topic, but I read about it in 2013 in Mother Jones, where Kevin Drum wrote:

We now have studies at the international level, the national level, the state level, the city level, and even the individual level. Groups of children have been followed from the womb to adulthood, and higher childhood blood lead levels are consistently associated with higher adult arrest rates for violent crimes. All of these studies tell the same story: Gasoline lead is responsible for a good share of the rise and fall of violent crime over the past half century.

When I read the NYT piece, though, I thought: If murder rates are now spiking in the U.S. but ambient lead levels remain historically low, doesn’t that disprove or at least undercut the claim that lead was responsible for the last crime wave? So I tweeted:

Jordan Wilcox pushed back:

Jordan was right, and I had made two basic mistakes en route to my glib but erroneous conclusion.

First and dumbest, I didn’t check the numbers. The Times only reported statistics from a small and non-representative sample of U.S. cities, and it only compared them across two years. In my experience, that’s not uncommon practice in popular-press trend pieces.

As Bruce Frederick argues in a Marshall Project commentary responding to the same NYT piece, however, that’s not a sound way to look for patterns. When Frederick took a deeper look at the latest police data across a more representative set of cases, he found that almost no U.S. cities appear to be experiencing changes in murder rates outside what we would expect from normal variation around historically low means of recent years. He concludes: “Neither the Times analysis nor my own yields compelling evidence that there has been a pervasive increase in homicides that is substantively meaningful.” On the Washington Post‘s Wonkblog, Max Ehrenfreund showed the same.

Second, even with the flawed statistics I had, I didn’t think carefully about how they related to the Pb-crime hypothesis. Instead, I thought: “We are experiencing a new crime wave and lead levels are still low; therefore lead does not explain the current wave; therefore lead can’t explain the last wave, either.”

In that simple chain of reasoning, I had failed to consider the possibility that different crime waves could have different causes—or really contributing factors, as no one doing careful work on this topic would offer a monocausal explanation of crime. Just as leaded gasoline came and went, other potentially relevant “treatments” that might affect crime rates could come and go, and those subsequent variations would provide little new information about the effects of lead at an earlier time. Imagine that in the near future that smoking is virtually eliminated and yet we still see a new wave of lung-cancer cases; would that new wave disprove the link between smoking and lung cancer? No. It might help produce a sharper estimate of the size of that earlier effect and give us a clearer picture of the causal mechanisms at work, but there’s almost always more than one pathway to the same outcome, and the affirmation of one does not disprove the possibility of another.

After reading more about the crime stats and thinking more about the evidence on lead, I’m back where I started. I believe that rates of homicide and other crimes remain close to historical lows in most U.S. cities, and I believe that lead exposure probably had a significant effect on crime rates in previous decades. That’s not terribly interesting, but it’s truer than the glib and provocative thing I tweeted, and it’s easier to see when I slow down and work more carefully through the basics.

ACLED in R

The Armed Conflict Location & Event Data Project, a.k.a. ACLED, produces up-to-date event data on certain kinds of political conflict in Africa and, as of 2015, parts of Asia. In this post, I’m not going to dwell on the project’s sources and methods, which you can read about on ACLED’s About page, in the 2010 journal article that introduced the project, or in the project’s user’s guides. Nor am I going to dwell on the necessity of using all political event data sets, including ACLED, with care—understanding the sources of bias in how they observe events and error in how they code them and interpreting (or, in extreme cases, ignoring) the resulting statistics accordingly.

Instead, my only aim here is to share an R script I’ve written that largely automates the process of downloading and merging ACLED’s historical and current Africa data and then creates a new data frame with counts of events by type at the country-month level. If you use ACLED in R, this script might save you some time and some space on your hard drive.

You can find the R script on GitHub, here.

The chief problem with this script is that the URLs and file names of ACLED’s historical and current data sets change with every update, so the code will need to be modified each time that happens. If the names were modular and the changes to them predictable, it would be easy to rewrite the code to keep up with those changes automatically. Unfortunately, they aren’t, so the best I can do for now is to give step-by-step instructions in comments embedded in the script on how to update the relevant four fields by hand. As long as the basic structure of the .csv files posted by ACLED doesn’t change, though, the rest should keep working.

[UPDATE: I revised the script so it will scrape the link addresses from the ACLED website and parse the file names from them. The new version worked after ACLED updated its real-time file earlier today, when the old version would have broken. Unless ACLED changes its file-naming conventions or the structure of its website, the version should work for the rest of 2015. In case it does fail, instructions on how to hard-code a workaround are included as comments at the bottom of the script.]

It should also be easy to adapt the part of the script that generates country-month event counts to slice the data even more finely, or to count by something other than event type. To do that, you would just need to add variables to the group_by() part of the block of code that produces the object ACLED.cm. For example, if you wanted to get counts of events by type at the level of the state or province, you would revise that line to read group_by(gwno, admin1, year, month, event_type). Or, if you wanted country-month counts of events by the type(s) of actor involved, you could use group_by(gwno, year, month, interaction) and then see this user’s guide to decipher those codes. You get the drift.

The script also shows a couple of examples of how to use ‘gglot2’ to generate time-series plots of those monthly counts. Here’s one I made of monthly counts of battle events by country for the entire period covered by ACLED as of this writing: January 1997–June 2015. A production-ready version of this plot would require some more tinkering with the size of the country names and the labeling of the x-axis, but the kind of small-multiples chart offers a nice way to explore the data before analysis.

Monthly counts of battle events, January 1997-June 2015

Monthly counts of battle events, January 1997-June 2015

If you use the script and find flaws in it or have ideas on how to make it work better or do more, please email me at ulfelder <at> gmail <dot> com.

2015 Tour de France Predictions

I like to ride bikes, I like to watch the pros race their bikes, and I make forecasts for a living, so I thought it would be fun to try to predict the outcome of this year’s Tour de France, which starts this Saturday and ends on July 26. I’m also interested in continuing to explore the predictive power of pairwise wiki surveys, a crowdsourcing tool that I’ve previously used to try to forecast mass-killing onsets, coup attempts, and pro football games, and that ESPN recently used to rank NBA draft prospects.

So, a couple of weeks ago, I used All Our Ideas to create a survey that asks, “Which rider is more likely to win the 2015 Tour de France?” I seeded the survey with the names of 11 riders—the 10 seen by bookmakers at Paddy Power as the most likely winners, plus Peter Sagan because he’s fun to watchposted a link to the survey on Tumblr, and trolled for respondents on Twitter and Facebook. The survey got off to a slow start, but then someone posted a link to it in the r/cycling subreddit, and the votes came pouring in. As of this afternoon, the survey had garnered more than 4,000 votes in 181 unique user sessions that came from five continents (see the map below). The crowd also added a handful of other riders to the set under consideration, bringing the list up to 16.

tourdefrance.2015.votemap

So how does that self-selected crowd handicap the race? The dot plot below shows the riders in descending order by their survey scores, which range from 0 to 100 and indicate the probability that that rider would beat a randomly chosen other rider for a randomly chosen respondent. In contrast to Paddy Power, which currently shows Chris Froome as the clear favorite and gives Nairo Quintana a slight edge over Alberto Contador, this survey sees Contador as the most likely winner (survey score of 90), followed closely by Froome (87) and a little further by Quintana (80). Both sources put Vincenzo Nibali as fourth likeliest (73) and Tejay van Garderen (65) and Thibaut Pinot (51) in the next two spots, although Paddy Power has them in the opposite order. Below that, the distances between riders’ chances get smaller, but the wiki survey’s results still approximate the handicapping of the real-money markets pretty well.

tourdefrance.2015.scores

There are at least a couple of ways to try to squeeze some meaning out those scores. One is to read the chart as a predicted finishing order for the 16 riders listed. That’s useful for something like a bike race, where we—well, some of us, anyway—care not only who wins, but also where other will riders finish, too.

We can also try to convert those scores to predicted probabilities of winning. The chart below shows what happens when we do that by dividing each rider’s score by the sum of all scores and then multiplying the result by 100. The probabilities this produces are all pretty low and more tightly bunched than seems reasonable, but I’m not sure how else to do this conversion. I tried squaring and cubing the scores; the results came closer to what the betting-market odds suggest are the “right” values, but I couldn’t think of a principled reason to do that, so I’m not showing those here. If you know a better way to get from those model scores to well-calibrated win probabilities, please let me know in the comments.

tourdefrance.2015.winprobs

So that’s what the survey says. After the Tour concludes in a few weeks, I’ll report back on how the survey’s predictions fared. Meanwhile, here’s wishing the athletes a crash–, injury–, and drug–free tour. Judging by the other big races I’ve seen so far this year, it should be a great one to watch.

The Birth of Crowdsourcing?

From p. 106 of the first paperback edition of The Professor and the Madman, a slightly overwrought but enjoyable history of the origins of the Oxford English Dictionary, found on the shelf of a vacation rental:

The new venture that [Richard Chenevix] Trench seemed now to be proposing would demonstrate not merely the meaning but the history of meaning, the life story of each word. And that would mean the reading of everything and the quoting of everything that showed anything of the history of the words that were to be cited. The task would be gigantic, monumental, and—according to the conventional thinking of the times—impossible.

Except that here Trench presented an idea, an idea that—to those ranks of conservative and frock-coated men who sat silently in the [London Library] on that dank and foggy evening [in 1857]—was potentially dangerous and revolutionary. But it was the idea that in the end made the whole venture possible.

The undertaking of the scheme, he said, was beyond the ability of any one man. To peruse all of English literature—and to comb the London and New York newspapers and the most literate of the magazines and journals—must be instead “the combined action of many.” It would be necessary to recruit a team—moreover, a huge one—probably comprising hundreds and hundreds of unpaid amateurs, all of them working as volunteers.

The audience murmured with surprise. Such an idea, obvious though it may sound today, had never been put forward before. But then, some members said as the meeting was breaking up, it did have some real merit.

And here’s what that crowdsourcing process ended up looking like in practice:

[Frederick] Furnivall then issued a circular calling for volunteer readers. They could select from which period of history they would like to read books—from 1250 to 1526, the year of the New English Testament; from then to 1674, the year when Milton died; or from 1674 to what was then the present day. Each period, it was felt, represented the existence of different trends in the development of the language.

The volunteers’ duties were simple enough, if onerous. They would write to the society offering their services in reading certain books; they would be asked to read and make word-lists of all that they read, and would then be asked to look, super-specifically, for certain words that currently interested the dictionary team. Each volunteer would take a slip of paper, write at its top left-hand side the target word, and below, also on the left, the date of the details that followed: These were, in order, the title of the book or paper, its volume and page number, and then, below that, the full sentence that illustrated the use of the target word. It was a technique that has been undertaken by lexicographers to the present day.

Herbert Coleridge became the first editor of what was to be called A New English Dictionary on Historical Principles. He undertook as his first task what may seem prosaic in the extreme: the design of a small stack of oak-board pigeonholes, nine holes wide and six high, which could accommodate the anticipated sixty to one hundred thousand slips of paper that would come in from the volunteers. He estimated that the first volume of the dictionary would be available to the world within two years. “And were it not for the dilatoriness of many contributors,” he wrote, clearly in a tetchy mood, “I should not hesitate to name an earlier period.”

Everything about these forecasts was magnificently wrong. In the end more than six million slips of paper came in from the volunteers; and Coleridge’s dreamy estimate that it might take two years to have the first salable section of the dictionary off the presses—for it was to be sold in parts, to help keep revenues coming in—was wrong by a factor of ten. It was this kind of woefully naive underestimate—of work, of time, of money—that at first so hindered the dictionary’s advance. No one had a clue what they were up against: They were marching blindfolded through molasses.

So, even with all those innovations, this undertaking also produced a textbook example of the planning fallacy. I wonder how quickly and cheaply the task could have been completed with Mechanical Turk, or with some brush-clearing assistance from text mining?

A Plea for More Prediction

The second Annual Bank Conference on Africa happened in Berkeley, CA, earlier this week, and the World Bank’s Development Impact blog has an outstanding summary of the 50-odd papers presented there. If you have to pick between reading this post and that one, go there.

One paper on that roster that caught my eye revisits the choice of statistical models for the study of civil wars. As authors John Paul Dunne and Nan Tian describe, the default choice is logistic regression, although probit gets a little playing time, too. They argue, however, that a zero-inflated Poisson (ZIP) model matches the data-generating process better than either of these traditional picks, and they show that this choice affects what we learn about the causes of civil conflict.

Having worked on statistical models of civil conflict for nearly 20 years, I have some opinions on that model-choice issue, but those aren’t what I want to discuss right now. Instead, I want to wonder aloud why more researchers don’t use prediction as the yardstick—or at least one of the yardsticks—for adjudicating these model comparisons.

In their paper, Dunne and Tian stake their claim about the superiority of ZIP to logit and probit on comparisons of Akaike information criteria (AIC) and Vuong tests. Okay, but if their goal is to see if ZIP fits the underlying data-generating process better than those other choices, what better way to find out than by comparing out-of-sample predictive power?

Prediction is fundamental to the accumulation of scientific knowledge. The better we understand why and how something happens, the more accurate our predictions of it should be. When we estimate models from observational data and only look at how well our models fit the data from which they were estimated, we learn some things about the structure of that data set, but we don’t learn how well those things generalize to other relevant data sets. If we believe that the world isn’t deterministic—that the observed data are just one of many possible realizations of the world—then we need to care about that ability to generalize, because that generalization and the discovery of its current limits is the heart of the scientific enterprise.

From a scientific standpoint, the ideal world would be one in which we could estimate models representing rival theories, then compare the accuracy of the predictions they generate across a large number of relevant “trials” as they unfold in real time. That’s difficult for scholars studying big but rare events like civil wars and wars between states; though; a lot of time has to pass before we’ll see enough new examples to make a statistically powerful comparison across models.

But, hey, there’s an app for that—cross-validation! Instead of using all the data in the initial estimation, hold some out to use as a test set for the models we get from the rest. Better yet, split the data into several equally-sized folds and then iterate the training and testing across all possible groupings of them (k-fold cross-validation). Even better, repeat that process a bunch of times and compare distributions of the resulting statistics.

Prediction is the gold standard in most scientific fields, and cross-validation is standard practice in many areas of applied forecasting, because they are more informative than in-sample tests. For some reason, political science still mostly eschews both.* Here’s hoping that changes soon.

* For some recent exceptions to this rule on topics in world politics, see Ward, Greenhill, and Bakke and Blair, Blattman, and Hartman on predicting civil conflict; Chadefaux on warning signs of interstate war; Hill and Jones on state repression; and Chenoweth and me on the onset of nonviolent campaigns.

An Applied Forecaster’s Bad Dream

This is the sort of thing that freaks me out every time I’m getting ready to deliver or post a new set of forecasts:

In its 2015 States of Fragility report, the Organization for Economic Co-operation and Development (OECD) decided to complicate its usual one-dimensional list of fragile states by assessing five dimensions of fragility: Violence, Justice, Institutions, Economic Foundations and Resilience…

Unfortunately, something went wrong during the calculations. In my attempts to replicate the assessment, I found that the OECD misclassified a large number of states.

That’s from a Monkey Cage post by Thomas Leo Scherer, published today. Here, per Scherer, is why those errors matter:

Recent research by Judith Kelley and Beth Simmons shows that international indicators are an influential policy tool. Indicators focus international attention on low performers to positive and negative effect. They cause governments in poorly ranked countries to take action to raise their scores when they realize they are being monitored or as domestic actors mobilize and demand change after learning how they rate versus other countries. Given their potential reach, indicators should be handled with care.

For individuals or organizations involved in scientific or public endeavors, the best way to mitigate that risk is transparency. We can and should argue about concepts, measures, and model choices, but given a particular set of those elements, we should all get essentially the same results. When one or more of those elements is hidden, we can’t fully understand what the reported results represent, and researchers who want to improve the design by critiquing and perhaps extending it are forced to box shadows. Also, individuals and organizations can double– and triple-check their own work, but errors are almost inevitable. When getting the best possible answers matters more than the risk of being seen making mistakes, then transparency is the way to go. This is why the Early Warning Project shares the data and code used to produce its statistical risk assessments in a public repository, and why Reinhart and Rogoff probably (hopefully?) wish they’d done something similar.

Of course, even though transparency improves the probability of catching errors and improving on our designs, it doesn’t automatically produce those goods. What’s more, we can know that we’re doing the right thing and still dread the public discovery of an error. Add to that risk the near-certainty of other researchers scoffing at your terrible code, and it’s easy see why even the best practices won’t keep you from breaking out in a cold sweat each time you hit “Send” or “Publish” on a new piece of work.

 

Waiting for Data-dot

A suburban house. A desk cluttered with papers, headphones, stray cables, and a pair of socks. Dawn.

Jay, sitting at the desk, opens a browser tab and clicks on a favorited site to see if a data set he needs to produce forecasts has been updated yet. It has not. He pauses, slurps coffee from a large mug, and tries another site. As before.

JAY: (giving up again). Nothing to be done.

 

[Apologies to Samuel Beckett.]

 

 

A Bit More on Country-Month Modeling

My family is riding the flu carousel right now, and my turn came this week. So, in lieu of trying to write from scratch, I wanted to pick up where my last post—on moving from country-year to country-month modeling—left off.

As many of you know, this notion is hardly new. For at least the past decade, many political scientists who use statistical tools to study violent conflict have been advocating and sometimes implementing research designs that shrink their units of observation on various dimensions, including time. The Journal of Conflict Resolution published a special issue on “disaggregating civil war” in 2009. At the time, that publication felt (to me) more like the cresting of a wave of new work than the start of one, and it was motivated, in part, by frustration over all the questions that a preceding wave of country-year civil-war modeling had inevitably left unanswered. Over the past several years, Mike Ward and his WardLab collaborators at Duke have been using ICEWS and other higher-resolution data sets to develop predictive models of various kinds of political instability at the country-month level. Their work has used designs that deal thoughtfully with the many challenges this approach entails, including spatial and temporal interdependence and the rarity of the events of interest. So have others.

Meanwhile, sociologists who study protests and social movements have been pushing in this direction even longer. Scholars trying to use statistical methods to help understand the dynamic interplay between mobilization, activism, repression, and change recognized that those processes can take important turns in weeks, days, or even hours. So, researchers in that field started trying to build event data sets that recorded as exactly as possible when and where various actions occurred, and they often use event history models and other methods that “take time seriously” to analyze the results. (One of them sat on my dissertation committee and had a big influence on my work at the time.)

As far as I can tell, there are two main reasons that all research in these fields hasn’t stampeded in the direction of disaggregation, and one of them is a doozy. The first and lesser one is computing power. It’s no simple thing to estimate models of mutually causal processes occurring across many heterogeneous units observed at high frequency. We still aren’t great at it, but accelerating improvements in computational processing, storage, software—and co-evolving improvements in statistical methods—have made it more tractable than it was even five or 10 years ago.

The second, more important, and more persistent impediment to disaggregated analysis is data, or the lack thereof. Data sets used by statistically minded political scientists come in two basic flavors: global, and case– or region-specific. Almost all of the global data sets of which I’m aware have always used, and continue to use, country-years as their units of observation.

That’s partly a function of the research questions they were built to help answer, but it’s also a function of cost. Data sets were (and mostly still are) encoded by hand by people sifting through or poring over relevant documents. All that labor takes a lot of time and therefore costs a lot of money. One can make (or ask RAs to make) a reasonably reliable summary judgment about something like whether or not a civil war was occurring in a particular country during particular year much quicker than one can do that for each month of that year, or each district in that country, or both. This difficulty hasn’t stopped everyone from trying, but the exceptions have been few and often case-specific. In a better world, we could have patched together those case-specific sets to make a larger whole, but they often use idiosyncratic definitions and face different informational constraints, making cross-case comparison difficult.

That’s why I’ve been so excited about the launch of GDELT and Phoenix and now the public release of the ICEWS event data. These are, I think, the leading edge of efforts to solve those data-collection problems in an efficient and durable way. ICEWS data have been available for several years to researchers working on a few contracts, but they haven’t been accessible to most of us until now.  At first I thought GDELT had rendered that problem moot, but concerns about its reliability have encouraged me to keep looking. I think Phoenix’s open-source-software approach holds more promise for the long run, but, as its makers describe, it’s still in “beta release” and “under active development.” ICEWS is a more mature project that has tried carefully to solve some of the problems, like event duplication and errors in geolocation, that diminish GDELT’s utility. (Many millions of dollars help.) So, naturally, I and many others have been eager to start exploring it. And now we can. Hooray!

To really open up analysis at this level, though, we’re going to need comparable and publicly (or at least cheaply) available data sets on a lot more of things our theories tell us to care about. As I said in the last post, we have a few of those now, but not many. Some of the work I’ve done over the past couple of years—this, especially—was meant to help fill those gaps, and I’m hoping that work will continue. But it’s just a drop in a leaky bucket. Here’s hoping for a hard turn of the spigot.

The Stacked-Label Column Plot

Most of the statistical work I do involves events that occur rarely in places over time. One of the best ways to get or give a feel for the structure of data like that is with a plot that shows variation in counts of those events across sequential, evenly-sized slices of time. For me, that usually means a sequence of annual, global counts of those events, like the one below for successful and failed coup attempts over the past several decades (see here for the R script that generated that plot and a few others and here for the data):

Annual, global counts of successful and failed coup attempts per the Cline Center's SPEED Project

Annual, global counts of successful and failed coup attempts per the Cline Center’s SPEED Project, 1946-2005

One thing I don’t like about those plots, though, is the loss of information that comes from converting events to counts. Sometimes we want to know not just how many events occurred in a particular year but also where they occurred, and we don’t want to have to query the database or look at a separate table to find out.

I try to do both in one go with a type of column chart I’ll call the stacked-label column plot. Instead of building columns from bricks of identical color, I use blocks of text that describe another attribute of each unit—usually country names in my work, but it could be lots of things. In order for those blocks to have comparable visual weight, they need to be equally sized, which usually means using labels of uniform length (e.g., two– or three-letter country codes) and a fixed-width font like Courier New.

I started making these kinds of plots in the 1990s, using Excel spreadsheets or tables in Microsoft Word to plot things like protest events and transitions to and from democracy. A couple decades later, I’m finally trying to figure out how to make them in R. Here is my first reasonably successful attempt, using data I just finished updating on when countries joined the World Trade Organization (WTO) or its predecessor, the General Agreement on Tariffs and Trade (GATT).

Note: Because the Wordpress template I use crams blog-post content into a column that’s only half as wide as the screen, you might have trouble reading the text labels in some browsers. If you can’t make out the letters, try clicking on the plot, then increasing the zoom if needed.

Annual, global counts of countries joining the global free-trade regime, 1960-2014

Annual, global counts of countries joining the global free-trade regime, 1960-2014

Without bothering to read the labels, you can see the time trend fine. Since 1960, there have been two waves of countries joining the global free-trade regime: one in the early 1960s, and another in the early 1990s. Those two waves correspond to two spates of state creation, so without the labels, many of us might infer that those stacks are composed mostly or entirely of new states joining.

When we scan the labels, though, we discover a different story. As expected, the wave in the early 1960s does include a lot of newly independent African states, but it also includes a couple of Warsaw Pact countries (Yugoslavia and Poland) and some middle-income cases from other parts of the world (e.g., Argentina and South Korea). Meanwhile, the wave of the early 1990s turns out to include very few post-Communist countries, most of which didn’t join until the end of that decade or early in the next one. Instead, we see a second wave of “developing” countries joining on the eve of the transition from GATT to the WTO, which officially happened on January 1, 1995. I’m sure people who really know the politics of the global free-trade regime, or of specific cases or regions, can spot some other interesting stories in there, too. The point, though, is that we can’t discover those stories if we can’t see the case labels.

Here’s another one that shows which countries had any coup attempts each year between 1960 and 2014, according to Jonathan Powell and Clayton Thyne‘s running list. In this case, color tells us the outcomes of those coup attempts: red if any succeeded, dark grey if they all failed.

Countries with any coup attempts per Powell and Thyne, 1960-2014

One story that immediately catches my eye in this plot is Argentina’s (ARG) remarkable propensity for coups in the early 1960s. It shows up in each of the first four columns, although only in 1962 are any of those attempts successful. Again, this is information we lose when we only plot the counts without identifying the cases.

The way I’m doing it now, this kind of chart requires data to be stored in (or converted to) event-file format, not the time-series cross-sectional format that many of us usually use. Instead of one row per unit–time slice, you want one row for each event. Each row should at least two columns with the case label and the time slice in which the event occurred.

If you’re interested in playing around with these types of plots, you can find the R script I used to generate the ones above here. Perhaps some enterprising soul will take it upon him- or herself to write a function that makes it easy to produce this kind of chart across a variety of data structures.

It would be especially nice to have a function that worked properly when the same label appears more than once in a given time slice. Right now, I’m using the function ‘match’ to assign y values that evenly stack the events within each bin. That doesn’t work for the second or third or nth match, though, because the ‘match’ function always returns the position of the first match in the relevant vector. So, for example, if I try to plot all coup attempts each year instead of all countries with any coup attempts each year, the second or later events in the same country get placed in the same position as the first, which ultimately means they show up as blank spaces in the columns. Sadly, I haven’t figured out yet how to identify location in that vector in a more general way to fix this problem.

Data Science Takes Work, Too

Yesterday, I got an email from the editor of an online publication inviting me to contribute pieces that would bring statistical analysis to bear on some topics they are hoping to cover. I admire the publication, and the topics interest me.

There was only one problem: the money. The honorarium they could offer for a published piece is less than my hourly consulting rate, and all of the suggested projects—as well most others I can imagine that would fit into this outlet’s mission—would probably take days to do. I would have to find, assemble, and clean the relevant data; explore and then analyze the fruits of that labor; generate and refine visualizations of those results; and, finally, write approximately 1,000 words about it. Extrapolating from past experience, I suspect that if I took on one of these projects, I would be working for less than minimum wage. And, of course, that estimated wage doesn’t account for the opportunity costs of foregoing other work (or leisure) I might have done during that time.

I don’t mean to cast aspersions on this editor. The publication is attached to a non-profit endeavor, so the fact that they were offering any payment at all already puts them well ahead of most peers. I’m also guessing that many of this outlet’s writers have salaried “day” jobs to which their contributions are relevant, so the honorarium is more of a bonus than a wage. And, of course, I spend hours of unpaid time writing posts for this blog, a pattern that some people might reasonably interpret as a signal of how much (or little) I think my time is worth.

Still, I wonder if part of the issue here is that this editor just had no idea how much work those projects would entail. A few days ago, Jeff Leek ran an excellent post on the Simply Statistics blog, about how “data science done well looks easy—and that is a big problem for data scientists.” As Leek points out,

Most well executed and successful data science projects don’t (a) use super complicated tools or (b) fit super complicated statistical models. The characteristics of the most successful data science projects I’ve evaluated or been a part of are: (a) a laser focus on solving the scientific problem, (b) careful and thoughtful consideration of whether the data is the right data and whether there are any lurking confounders or biases and (c) relatively simple statistical models applied and interpreted skeptically.

It turns out doing those three things is actually surprisingly hard and very, very time consuming. It is my experience that data science projects take a solid 2-3 times as long to complete as a project in theoretical statistics. The reason is that inevitably the data are a mess and you have to clean them up, then you find out the data aren’t quite what you wanted to answer the question, so you go find a new data set and clean it up, etc. After a ton of work like that, you have a nice set of data to which you fit simple statistical models and then it looks super easy to someone who either doesn’t know about the data collection and cleaning process or doesn’t care.

All I can say to all of that is: YES. On topics I’ve worked for years, I realize some economies of scale by knowing where to look for data, knowing what those data look like, and having ready-made scripts that ingest, clean, and combine them. Even on those topics, though, updates sometimes break the scripts, sources come and go, and the choice of model or methods isn’t always obvious. Meanwhile, on new topics, the process invariably takes many hours, and it often ends in failure or frustration because the requisite data don’t exist, or you discover that they can’t be trusted.

The visualization part alone can take a lot of time if you’re finicky about it—and you should be finicky about it, because your charts are what most people are going to see, learn from, and remember. Again, though, I think most people who don’t do this work simply have no idea.

Last year, as part of a paid project, I spent the better part of a day tinkering with an R script to ingest and meld a bunch of time series and then generate a single chart that would compare those time series. When I finally got the chart where I wanted it, I showed the results to someone else working on that project. He liked the chart and immediately proposed some other variations we might try. When I responded by pointing out that each of those variations might take an hour or two to produce, he was surprised and admitted that he thought the chart had come from a canned routine.

We laughed about it at the time, but I think that moment perfectly illustrates the disconnect that Gill describes. What took me hours of iterative code-writing and drew on years of accumulated domain expertise and work experience looked to someone else like nothing more than the result of a few minutes of menu-selecting and button-clicking. When that’s what people think you do, it’s hard to get them to agree to pay you well for what you actually do.

  • Author

  • Follow me on Twitter

  • Follow Dart-Throwing Chimp on WordPress.com
  • Enter your email address to follow this blog and receive notifications of new posts by email.

    Join 13,607 other subscribers
  • Archives

%d bloggers like this: