Data Science Takes Work, Too

Yesterday, I got an email from the editor of an online publication inviting me to contribute pieces that would bring statistical analysis to bear on some topics they are hoping to cover. I admire the publication, and the topics interest me.

There was only one problem: the money. The honorarium they could offer for a published piece is less than my hourly consulting rate, and all of the suggested projects—as well most others I can imagine that would fit into this outlet’s mission—would probably take days to do. I would have to find, assemble, and clean the relevant data; explore and then analyze the fruits of that labor; generate and refine visualizations of those results; and, finally, write approximately 1,000 words about it. Extrapolating from past experience, I suspect that if I took on one of these projects, I would be working for less than minimum wage. And, of course, that estimated wage doesn’t account for the opportunity costs of foregoing other work (or leisure) I might have done during that time.

I don’t mean to cast aspersions on this editor. The publication is attached to a non-profit endeavor, so the fact that they were offering any payment at all already puts them well ahead of most peers. I’m also guessing that many of this outlet’s writers have salaried “day” jobs to which their contributions are relevant, so the honorarium is more of a bonus than a wage. And, of course, I spend hours of unpaid time writing posts for this blog, a pattern that some people might reasonably interpret as a signal of how much (or little) I think my time is worth.

Still, I wonder if part of the issue here is that this editor just had no idea how much work those projects would entail. A few days ago, Jeff Leek ran an excellent post on the Simply Statistics blog, about how “data science done well looks easy—and that is a big problem for data scientists.” As Leek points out,

Most well executed and successful data science projects don’t (a) use super complicated tools or (b) fit super complicated statistical models. The characteristics of the most successful data science projects I’ve evaluated or been a part of are: (a) a laser focus on solving the scientific problem, (b) careful and thoughtful consideration of whether the data is the right data and whether there are any lurking confounders or biases and (c) relatively simple statistical models applied and interpreted skeptically.

It turns out doing those three things is actually surprisingly hard and very, very time consuming. It is my experience that data science projects take a solid 2-3 times as long to complete as a project in theoretical statistics. The reason is that inevitably the data are a mess and you have to clean them up, then you find out the data aren’t quite what you wanted to answer the question, so you go find a new data set and clean it up, etc. After a ton of work like that, you have a nice set of data to which you fit simple statistical models and then it looks super easy to someone who either doesn’t know about the data collection and cleaning process or doesn’t care.

All I can say to all of that is: YES. On topics I’ve worked for years, I realize some economies of scale by knowing where to look for data, knowing what those data look like, and having ready-made scripts that ingest, clean, and combine them. Even on those topics, though, updates sometimes break the scripts, sources come and go, and the choice of model or methods isn’t always obvious. Meanwhile, on new topics, the process invariably takes many hours, and it often ends in failure or frustration because the requisite data don’t exist, or you discover that they can’t be trusted.

The visualization part alone can take a lot of time if you’re finicky about it—and you should be finicky about it, because your charts are what most people are going to see, learn from, and remember. Again, though, I think most people who don’t do this work simply have no idea.

Last year, as part of a paid project, I spent the better part of a day tinkering with an R script to ingest and meld a bunch of time series and then generate a single chart that would compare those time series. When I finally got the chart where I wanted it, I showed the results to someone else working on that project. He liked the chart and immediately proposed some other variations we might try. When I responded by pointing out that each of those variations might take an hour or two to produce, he was surprised and admitted that he thought the chart had come from a canned routine.

We laughed about it at the time, but I think that moment perfectly illustrates the disconnect that Gill describes. What took me hours of iterative code-writing and drew on years of accumulated domain expertise and work experience looked to someone else like nothing more than the result of a few minutes of menu-selecting and button-clicking. When that’s what people think you do, it’s hard to get them to agree to pay you well for what you actually do.

EVEN BETTER Animated Map of Coup Attempts Worldwide, 1946-2013

[Click here to go straight to the map]

A week ago, I posted an animated map of coup attempts worldwide since 1946 (here). Unfortunately, those maps were built from a country-year data set, so we couldn’t see multiple attempts within a single country over the course of a year. As it happens, though, the lists of coup attempts on which that animation was based does specify the dates of those events. So why toss out all that information?

To get a sharper picture of the distribution of coup attempts across space and time, I rebuilt my mashed-up list of coup attempts from the original sources (Powell & Thyne and Marshall), but now with the dates included. Where only a month was given, I pegged the event to the first day of that month. To avoid double-counting, I then deleted events that appeared to be duplicates (same outcome in the same country within a single week). Finally, to get the animation in CartoDB to give a proper sense of elapsed time, I embedded the results in a larger data frame of all dates over the 68-year period observed. You can find the daily data on my Google Drive (here).

WordPress won’t seem to let me embed the results of my mapping directly in this post, but you can see and interact with the results at CartoDB (here). I think this version shows more clearly how much the rate of coup attempts has slowed in the past couple of decades, and it still does a good job of showing change over time in the geographic distribution of these events.

The two things I can’t figure out how to do so far are 1) to use color to differentiate between successful and failed attempts and 2) to show the year or month and year in the visualization so we know where we are in time. For differentiating by outcome, there’s a variable in the data set that does this, but it looks like the current implementation of the Torque option in CartoDB won’t let me show multiple layers or differentiate between the events by type. On showing the date, I have no clue. If anyone knows how to do either of these things, please let me know.

  • Author

  • Follow me on Twitter

  • Follow Dart-Throwing Chimp on WordPress.com
  • Enter your email address to follow this blog and receive notifications of new posts by email.

    Join 13,609 other subscribers
  • Archives

%d bloggers like this: