Data Science Takes Work, Too

Yesterday, I got an email from the editor of an online publication inviting me to contribute pieces that would bring statistical analysis to bear on some topics they are hoping to cover. I admire the publication, and the topics interest me.

There was only one problem: the money. The honorarium they could offer for a published piece is less than my hourly consulting rate, and all of the suggested projects—as well most others I can imagine that would fit into this outlet’s mission—would probably take days to do. I would have to find, assemble, and clean the relevant data; explore and then analyze the fruits of that labor; generate and refine visualizations of those results; and, finally, write approximately 1,000 words about it. Extrapolating from past experience, I suspect that if I took on one of these projects, I would be working for less than minimum wage. And, of course, that estimated wage doesn’t account for the opportunity costs of foregoing other work (or leisure) I might have done during that time.

I don’t mean to cast aspersions on this editor. The publication is attached to a non-profit endeavor, so the fact that they were offering any payment at all already puts them well ahead of most peers. I’m also guessing that many of this outlet’s writers have salaried “day” jobs to which their contributions are relevant, so the honorarium is more of a bonus than a wage. And, of course, I spend hours of unpaid time writing posts for this blog, a pattern that some people might reasonably interpret as a signal of how much (or little) I think my time is worth.

Still, I wonder if part of the issue here is that this editor just had no idea how much work those projects would entail. A few days ago, Jeff Leek ran an excellent post on the Simply Statistics blog, about how “data science done well looks easy—and that is a big problem for data scientists.” As Leek points out,

Most well executed and successful data science projects don’t (a) use super complicated tools or (b) fit super complicated statistical models. The characteristics of the most successful data science projects I’ve evaluated or been a part of are: (a) a laser focus on solving the scientific problem, (b) careful and thoughtful consideration of whether the data is the right data and whether there are any lurking confounders or biases and (c) relatively simple statistical models applied and interpreted skeptically.

It turns out doing those three things is actually surprisingly hard and very, very time consuming. It is my experience that data science projects take a solid 2-3 times as long to complete as a project in theoretical statistics. The reason is that inevitably the data are a mess and you have to clean them up, then you find out the data aren’t quite what you wanted to answer the question, so you go find a new data set and clean it up, etc. After a ton of work like that, you have a nice set of data to which you fit simple statistical models and then it looks super easy to someone who either doesn’t know about the data collection and cleaning process or doesn’t care.

All I can say to all of that is: YES. On topics I’ve worked for years, I realize some economies of scale by knowing where to look for data, knowing what those data look like, and having ready-made scripts that ingest, clean, and combine them. Even on those topics, though, updates sometimes break the scripts, sources come and go, and the choice of model or methods isn’t always obvious. Meanwhile, on new topics, the process invariably takes many hours, and it often ends in failure or frustration because the requisite data don’t exist, or you discover that they can’t be trusted.

The visualization part alone can take a lot of time if you’re finicky about it—and you should be finicky about it, because your charts are what most people are going to see, learn from, and remember. Again, though, I think most people who don’t do this work simply have no idea.

Last year, as part of a paid project, I spent the better part of a day tinkering with an R script to ingest and meld a bunch of time series and then generate a single chart that would compare those time series. When I finally got the chart where I wanted it, I showed the results to someone else working on that project. He liked the chart and immediately proposed some other variations we might try. When I responded by pointing out that each of those variations might take an hour or two to produce, he was surprised and admitted that he thought the chart had come from a canned routine.

We laughed about it at the time, but I think that moment perfectly illustrates the disconnect that Gill describes. What took me hours of iterative code-writing and drew on years of accumulated domain expertise and work experience looked to someone else like nothing more than the result of a few minutes of menu-selecting and button-clicking. When that’s what people think you do, it’s hard to get them to agree to pay you well for what you actually do.

The Importance of Thinking Statistically

In his enjoyable and accessible book, Numbers Rule Your World, statistician and blogger Kaiser Fung talks a lot about the value of “thinking statistically.” I was reminded of this point twice in the past 24 hours in ways that illustrate some common traps in our causal reasoning and, more generally, the difficulties of designing useful research.

First, I starting my Monday morning with a deeply disturbing but also annoying article in the New York Times, about a Tennessee pastor and his wife whose self-published book advocates corporal punishment as a basic part of child-rearing. The article was really a trend story in two parts. First, the article notes the book’s commercial success, which is linked to a wider resurgence in the use of corporal punishment in America. “More than 670,000 copies of the Pearls’ self-published book are in circulation,” we’re told, “and it is especially popular among Christian home-schoolers.” The real news hook, however, came from the second supposed trend: the deaths of three horribly abused kids in families that had been exposed to the Pearls’ teachings. “Debate over the Pearls’ teachings…gained new intensity after the death of a third child, all allegedly at the hands of parents who kept the Pearls’ book, To Train Up a Child, in their homes.”

The stories of extreme child abuse and neglect are the disturbing part of the article, and they are hard to read. Even so, the “data scientist” in me still managed to get annoyed by the insinuation that the Pearls are partly responsible for the three killings the article describes. The article’s author doesn’t flat-out blame the Pearls for the deaths of those three children, but he certainly entertains the idea.

In my view, this is a classic case of inference by anecdote. We see what looks like a cluster of related events (the three deaths); in looking at those events, we see exposure to a common factor that’s plausibly related to them (the Pearls’ book); and so we deduce that the factor caused or at least contributed to the events’ occurrence. The logic is the same as Michelle Bachmann’s absurd reasoning about vaccine safety: I met someone who said her daughter got vaccinated and suffered harm soon after; therefore vaccines are harmful, and parents should consider not using them.

Maybe the Pearls’ teachings do increase the risk of child abuse. To see if that’s true, though, we would need a lot more information. What the three deaths give us is a start on the numerator on one side of a comparison of rates of deadly child abuse among parents who have been exposed to the Pearls’ teachings and parents who have not.

Can we fill in any of those other blanks? Well, the advocacy group Childhelp tells us that more than 1,800 children die each year in the United States from child abuse and neglect (5 per day times 365 days), and the CIA Factbook says there are more than 60 million children under 14 in the U.S. That works out to an annual death rate about 0.003% (1,800 divided by 60 million). Meanwhile, the New York Times story tells us that the Pearls’ book is now in 670,000 households. If we assume that there are an average of two children in each of those households, that works out to 1.34 million kids in “exposed” families. For the risk to kids in those exposed families to be higher than the risks kids face in the general population, we would need to see more than 40 deaths from child abuse and neglect each year in that group of 1.34 million. To be confident that the difference wasn’t occurring by chance, we would need to see many more than 40 deaths from child abuse each year in that group.

Given the national rate of deaths from child abuse and neglect, it’s highly unlikely that the three killings discussed in the Times story are the only ones that have occurred in households with the Pearls’ book. Even so, once we widen our view beyond that “cluster” of three deaths and try to engage in a little comparison, it should become clear that we really don’t know whether or not the Pearls’ book is putting kids at increased risk of fatal abuse, and it’s arguably irresponsible to imply that it is on the basis of those three deaths alone.

The second time my statistical alarm went off in the past 24 hours was during a conversation on Twitter about the effectiveness of U.S. government support for pro-democracy movements in countries under authoritarian rule. As I’ve articulated elsewhere on this blog (see here and here, for example), I’m skeptical of the claim that U.S. support is required to help activists catalyze democratization and believe that it can sometimes even hurt that cause. That claim got me in a debate of sorts with Catherine Fitzpatrick, a human-rights activist who strongly believes U.S. support for democracy movements in other countries is morally and practically necessary. To rebut my argument, she challenged me to find “a carefully calibrated US-hands-off [movement] that succeeded in the world against a deadly authoritarian regime.”

She’s right that there aren’t many. The problem with reaching a conclusion from that fact alone, though, is that there aren’t many authoritarian regimes in which the U.S. government hasn’t provided some support for pro-democracy advocates. To infer something about the effects of U.S. democracy-promoting activity, we need to compare cases where the U.S. got involved with ones where it didn’t, and there are very few cases in the latter set. In experimental-design terms, we’ve got a large test group and a tiny control group.

Making the inferential job even tougher, countries aren’t randomly assigned to those two groups. I’m not privy to the conversations where these decisions are made, but I presume the U.S. government prefers to support movements in cases where it believes those efforts will be more effective. If those judgments are better than chance, then there is an element of self-fulfilling prophecy to the observation my Twitter debater made. This is what statisticians call a selection effect. The fact that democratization rarely occurs in the small set of cases where the U.S. does not publicly promote it (North Korea comes to mind) may simply be telling us that U.S. government is capable of recognizing situations where its efforts are almost certain to be wasted and acts accordingly.

I could go on about the difficulties of designing research on the effects of U.S. democracy-promotion efforts, but I’ll save that for another day. The big idea behind this post is that causal inference depends on careful comparisons. In the case of the New York Times story, we’re lured to infer from three deaths that the Pearls’ teachings put children at risk without considering how those kids might have fared had their parents never seen the Pearls’ book. In the case of my Twitter conversation, I’m told to understand that aggressive U.S. assistance to pro-democracy advocates makes democratization happen without considering how those advocates would have fared without U.S. help. In drawing attention to the need for thinking comparatively, I’m not claiming to have disproved those hypotheses. I’m just saying that we can’t tell without more information and, in so doing, inviting the authors of those hypotheses–and us all–to dig a little deeper before forming strong beliefs.

  • Author

  • Follow me on Twitter

  • Follow Dart-Throwing Chimp on
  • Enter your email address to follow this blog and receive notifications of new posts by email.

    Join 13,613 other followers

  • Archives

%d bloggers like this: