Computerworld ran a great piece last week called “12 Predictive Analytics Screw-Ups” that catalogs some mistakes commonly made in statistical forecasting projects. Unfortunately, the language and examples are all from “industry”—what we used to call the private sector, I guess—so social scientists might read it and struggle to see the relevance to their work. To make that relevance clearer, I thought I’d give social-science-specific examples of the blunders that looked most familiar to me.
1. Begin without the end in mind.
I read this one as an admonition to avoid forecasting something just because you can, even if it’s not clear what those forecasts are useful for. More generally, though, I think it can also be read as a warning against fishing expeditions. If you poke around enough in a large data set on interstate wars, you’re probably going to find some variables that really boost your R-squared or Area under the ROC Curve, but the models you get from that kind of dredging will often perform a lot worse when you try to use them to forecast in real time.
2. Define the project around a foundation your data can’t support.
It’s really important to think early and often about how your forecasting process will work in real time. The data you have in hand when generating a forecast will often be incomplete and noisier than the nice, tidy matrix you had when estimated the model(s) you’re applying, and it’s a good idea to plan around that fact when you’re doing the estimating.
Here’s the sort of thing I have in mind: Let’s say you discover that a country’s score on CIRI‘s physical integrity index at the end of one year is a useful predictor of its risk of civil war onset during the next year. Awesome…until you remember that CIRI isn’t updated until late in the calendar year. Now what? You can lag it further, but that’s liable to weaken the predictive signal if the variable is dynamic and recent changes are informative. Alternatively, you can keep the one-year lag and try to update by hand, but then you risk adding even more noise to an already-noisy set of inputs. There’s a reason you use the data made by people who’ve spent a lot of time working on the topic and the coding procedures.
Unfortunately, there’s no elegant escape from this dilemma. The only general rules I can see are 1) to try to anticipate these problems and address them in the design phase and, where possible, 2) to check the impact of these choices on the accuracy your forecasts and revise when the evidence suggests something else would work better.
3. Don’t proceed until your data is the best it can be.
4. When reviewing data quality, don’t bother to take out the garbage.
6. Don’t just proceed but rush the process because you know your data is perfect.
These three screw-ups underscore the tension between the need to avoid garbage in, garbage out (GIGO) modeling on the one hand and the need to be opportunistic on the other.
On many topics of interest to social scientists, we either have no data or the data we do have are sparse or lousy or both (see here for more on this point). Under these circumstances, you need to find ways to make the most of the information you’ve got, but you don’t want to pretend that you can spin straw into gold.
Again, there’s no elegant escape from these trade-offs. That said, I think it’s generally true that there’s almost always a significant payoff to be had from to getting familiar with the data you’re using, and from doing what you can to make those data cleaner or more complete without setting yourself up for failure at the forecasting stage (e.g., your multiple-imputation algorithm might expand your historical sample, but it won’t give you the latest values of the predictors it was used to infill.)
5. Use data from the future to predict the future.
So you’ve estimated a logistic regression model and, using cross-validation, discovered that it works really well out of sample. Then you look at the estimated coefficients and discover that one variable really seems to be driving that result. Then you look closer and discover that this variable is actually a consequence of the dependent variable. Doh!
I had this happen once when I was trying to develop a model that could be used to forecast transitions to and from democracy (see here for the results). At the exploratory stage, I found that a variable which counts the number of democracy episodes a country has experienced was a really powerful predictor of transitions to democracy. Then I remembered that this counter—which I’d coded—ticks up in the year that a transition occurs, so of course higher values were associated with a higher probability of transition. In this case, the problem was easily solved by lagging the predictor, but the problem and its solution won’t always be that obvious. Again, knowing your data should go a long way toward protecting you against this error.
8. Ignore the subject-matter experts when building your model.
For me, a forecasting tournament we conducted as part of the work of the Political Instability Task Force (PITF) really drove this point home (see here). We got better results when we restricted our vision to a smaller set of variables selected by researchers who’d been immersed in the material for a long time than we did when we applied those same methods to a much larger set of variables that we happened to have available to us.
This is probably less likely to be a problem for academics, who are more likely to try to forecast on topics they know and care about, than it is for “data scientists” and “hackers” who are often asked to throw the methods they know at problems on all sorts of unfamiliar topics. Still, even when you’re covering territory that seems familiar, it’s always a good idea to brush up on the relevant literature and ask around as you get started. A single variable often makes a significant difference in the predictive power of a forecasting algorithm.
9. Just assume that the keepers of the data will be fully on board and cooperative.
Data sets that are licensed and are therefore either too expensive to keep buying or impossible to include in replication files. Boutique data sets that cover really important topics but were created once but aren’t updated. Data that are embargoed while someone waits to publish from them. Data sets whose creators start acting differently when they hear that their data are useful for forecasting.
These are all problems I’ve run into, and they can effectively kill an applied forecasting project. Better to clear them up early or set aside the relevant data sets than to paint yourself into this kind of corner, which can be very frustrating.
10. If you build it, they will come; don’t worry about how to serve it up.
I still don’t feel like I have a great handle on how to convey probabilistic forecasts to non-statistical audiences, or which parts of those forecasts are most useful to which kinds of audiences. This is a really hard problem that has a huge effect on the impact of the work, and in my experience, having forecasts that are demonstrably accurate doesn’t automatically knock these barriers down (just ask Nate Silver).
The two larger lessons I take from my struggles with this problem are 1) to incorporate thinking about how the forecasts will be conveyed into the research design and 2) to consider presenting the forecasts in different ways to different audiences.
Regarding the first, the idea is to avoid methods that you’re intended audience won’t understand or at least tolerate. For example, if your audience is going to want information about the relative influence of various predictors on the forecasts in specific cases, you’re going to want to avoid “black box” algorithms that make it hard or impossible to recover that information.
Regarding the second, my point is not to assume that you know the single “right” way to communicate your forecasts. In fact, I think it’s a good idea to be experimental if you can. Try presenting the forecasts a few different ways—maps or dot plots, point estimates or distributions, cross-sectional comparisons or time series—see which ones resonate with which audiences, and tailor your publications and presentations accordingly.
11. If the results look obvious, throw out the model.
Even if it’s not generating a lot of counter-intuitive results, a reasonably accurate forecasting model can still be really valuable in a couple of ways. First, it’s a great baseline for further research. Second, when a model like that occasionally does serve up a counter-intuitive result, that forecast will often reward a closer review. Closer review of the cases that do land far from the regression line may hold some great leads on variables your initial model overlooked.
This often comes up in my work on forecasting rare events like coups and mass killings. Most years, most of the countries that my forecasts identify as riskiest are pretty obvious. It doesn’t take a model to tell me that Sweden probably won’t have a coup this year but Mali or Sudan might, so people often respond to the forecasts by saying, “I already knew all that.” When they slow down and give the forecasts a closer look, though, they’ll usually find at least a few cases on either tail of the distribution that don’t match their priors. To my mind, these handfuls of surprises are really the point of the exercise. The conversations that start in response to these counter-intuitive results are the reason we use statistical models instead of just asking people what they think.
If I had to sum up all of these lessons into a single suggestion, it would be to learn by doing. Applied forecasting is a very different problem from hypothesis testing and even from data mining. You have to live the process a few times to really appreciate its difficulties, and those difficulties can vary widely across different forecasting problems. Ideally, you’ll pick a problem, work it, and generate forecasts in real time for a while so you get feedback not just on your accuracy, but also on how sustainable your process is. To avoid hindsight bias, make the forecasts public as you produce them, or at least share them with some colleagues as you go.