Big Data Doesn’t Automatically Produce Better Predictions

At FiveThirtyEight, Neil Payne and Rob Arthur report on an intriguing puzzle:

In an age of unprecedented baseball data, we somehow appear to be getting worse at knowing which teams are — and will be — good.

Player-level predictions are as good if not better than they used to be, but team-level predictions of performance are getting worse. Payne and Arthur aren’t sure why, but they rank a couple of trends in the industry — significant changes in the age structure of the league’s players and, ironically, the increased use of predictive analytics in team management — among the likely culprits.

This story nicely illustrates a fact that breathless discussions of the power of “Big Data” often elide: more and better data don’t automatically lead to more accurate predictions. Observation and prediction are interrelated, but the latter does not move in lock step with the former. At least two things can weaken the link between those two steps in the analytical process.

First, some phenomena are just inherently difficult or impossible to predict with much accuracy. That’s not entirely true of baseball; as Payne and Arthur show, team-level performance predictions have been pretty good in the past. It is true of many other phenomena or systems, however. Take earthquakes; we can now detect and record these events with tremendous precision, but we’re still pretty lousy at anticipating when they’ll occur and how strong they will be. So far, better observation hasn’t led to big gains in prediction.

Second, the systems we’re observing sometimes change, even as we get better at observing them. This is what Payne and Arthur imply is occurring in baseball when they identify trends in the industry as likely explanations for a decline in the predictive power of models derived from historical data. It’s like trying to develop a cure for a disease that’s evolving rapidly as you work on it; the cure you develop in the lab might work great on the last version you captured, but by the time you deploy it, the disease has evolved further, and the treatment doesn’t have much effect.

I wonder if this is also the trajectory social science will follow over the next few decades. Right now, we’re getting hit by the leading edge of what will probably be a large and sustained flood tide of new data on human behavior.  That inflow is producing some rather optimistic statements about how predictable human behavior in general, and sometimes politics in particular, will become as we discover deeper patterns in those data.

I don’t share that confidence. A lot of human behavior is predictably routine, and a torrent of new behavioral data will almost certainly make us even better at predicting these actions and events. For better or for worse, though, those routines are not especially interesting or important to most political scientists. Political scientists are more inclined to focus on “high” politics, which remains fairly opaque, or on system-level outcomes like wars and revolutions that emerge from changes in individual-level behavior in non-obvious ways. I suspect we’ll get a little better at predicting these things as we accumulate richer data on various parts of those systems, but I am pretty sure we won’t ever get great at it. The processes are too complex, and the systems themselves are constantly evolving, maybe even at an accelerating rate.

Leave a comment

1 Comment

  1. This reminds me of a favorite article that I frequently assign students: “God Gave Physics the Easy Problems” (http://ejt.sagepub.com/content/6/1/43.abstract). One of the messages is that as we improve our predictions of social behavior, we make prescriptions that changes our behavior.

    Reply

Leave a Comment

  • Follow me on Twitter

  • Follow Dart-Throwing Chimp on WordPress.com
  • Enter your email address to follow this blog and receive notifications of new posts by email.

    Join 13.6K other subscribers
  • Archives