I’ve just finished reading Nate Silver’s very good new book, The Signal and the Noise: Why So Many Predictions Fail—But Some Don’t. For me, the book was more of a tossed salad than a cake—a tasty assemblage of conceptually related parts that doesn’t quite cohere into a new whole. Still, I would highly recommend it to anyone interested in forecasting, a category that should include anybody with a pulse, as Silver persuasively argues. We can learn a lot just by listening to a skilled practitioner talk about his craft, and that, to me, is what The Signal and the Noise is really all about.
Instead of trying to review the whole book here, though, I wanted to pull on one particular thread running through it, because I worry about where that thread might lead some readers. That thread concerns the relative merits of statistical models and expert judgment as forecasting tools.
Silver is a professional forecaster who built his reputation on the clever application of statistical tools, but that doesn’t mean he’s a quantitative fundamentalist. To the contrary, one of the strongest messages in The Signal and the Noise is that our judgment may be poor, but we shouldn’t fetishize statistical models, either. Yes, human forecasters are inevitably biased, but so, in a sense, are the statistical models they build. For starters, those models entail a host of assumptions about the reliability and structure of the data, many of which will often be wrong. For another, there is often important information that’s hard to quantify but is useful for forecasting, and we ignore the signals from this space at our own peril. Third, forecasts are usually more accurate when they aggregate information from multiple, independent sources, and subjective forecasts from skilled experts can be a really useful leg in this stool.
Putting all of these concerns together, Silver arrives at a philosophy of forecasting that might be described as “model-assisted,” or maybe just “omnivorous.” Silver recognizes the power of statistics for finding patterns in noisy data and checking our mental models, but he also cautions strongly against putting blind faith in those tools and favors keeping human judgment in the cycle, including at the final stage where we finally make a forecast about some situation of interest.
To illustrate the power of model-assisted forecasting, Silver describes how well this approach has worked in a few areas: baseball scouting, election forecasting, and meteorology, to name a few. About the latter, for example, he writes that “weather forecasting is one of the success stories in this book, a case of man and machine joining forces to understand and sometimes anticipate the complexities of nature.”
All of what Silver says about the pitfalls of statistical forecasting and the power of skilled human forecasters is true, but only to a point. I think Silver’s preferred approach depends on a couple of conditions that are often absent in real-world efforts to forecast complex political phenomena, where I’ve done most of my work. Because Silver doesn’t spell those conditions out, I thought I would, in an effort to discourage readers of The Signal and the Noise from concluding that statistical forecasts can always be improved by adjusting them according to our judgment.
First, the process Silver recommends assumes that the expert tweaking the statistical forecast is the modeler, or at least has a good understanding of the strengths and weaknesses of the model(s) being used. For example, he describes experienced meteorologists improving their forecasts by manually adjusting certain values to correct for a known flaw in the model. Those manual adjustments seem to make the forecasts better, but they depend on a pretty sophisticated knowledge of the underlying algorithm and the idiosyncrasies of the historical data.
Second and probably more important, the scenarios Silver approvingly describes all involve situations where the applied forecaster gets frequent and clear feedback on the accuracy of his or her predictions. This feedback allows the forecaster to look for patterns in the performance of the statistical tool and the adjustments being made to them. It’s the familiar process of trial and error, but that process only works when we can see where the errors are and see if the fixes we attempt are actually working.
Both of these conditions hold in several of the domains Silver discusses, including baseball scouting and and meteorology. These are data-rich environments where forecasters often know the quirks of the data and statistical models they might use and can constantly see how they’re doing.
In the world of international politics, however, most forecasters—and, whether they realize it or not, every analyst is a forecaster—have little or no experience with statistical forecasting tools and are often skeptical of their value. As a result, discussions about the forecasts these tools produce are more likely to degenerate into a competitive, “he said, she said” dynamic than they are to achieve the synergy that Silver praises.
More important, feedback on the predictive performance of analysts in international politics is usually fuzzy or absent. Poker players get constant feedback from the changing size of their chip stacks. By contrast, people who try to forecast politics rarely do so with much specificity, and even when they do, they rarely keep track of their performance over time. What’s worse, the events we try to forecast—things like coups or revolutions—rarely occur, so there aren’t even that many opportunities to assess our performance even if we try. Most of the score-keeping is done in our own heads, but as Phil Tetlock shows, we’re usually poor judges of own performance. We fixate on the triumphs, forget or explain away the misses, and spin the ambiguous cases as successes.
In this context, it’s not clear to me that Silver’s ideal of “model-assisted” forecasting is really attainable, at least not without more structure being imposed from the outside. For example, I could imagine a process where a third party elicits forecasts from human prognosticators and statistical models and then combines the results in a way that accounts for the strengths and blind spots of each input. This process would blend statistics and expert judgment, just not by means of a single individual as often happened in Silver’s favored examples.
Meanwhile, the virtuous circle Silver describes is already built into the process of statistical modeling, at least when done well. For example, careful statistical forecasters will train their models on one sample of cases and then apply them to another sample they’ve never “seen” before. This out-of-sample validation lets modelers know if they’re onto something useful and gives them some sense of the accuracy and precision of their models before they rush out and apply them.
I couldn’t help but wonder how much Silver’s philosophy was shaped by the social part of his experiences in baseball and elections forecasting. In both of those domains, there’s a running culture clash, or at least the perception of one, between statistical modelers and judgment-based forecasters—nerds and jocks in baseball, quants and pundits in politics. When you work in a field like that, you can get a lot of positive social feedback by saying “Everybody’s right!” I’ve sat in many meetings where someone proposed combining statistical forecasts and expert judgment without specifying how that process would work or that we actually check how the combination is affecting forecast accuracy. Almost every time, though, that proposal is met with a murmur of assent: “Of course! Experts are experts! More is more!” I get the sense that this advice will almost always be popular, but I’m not convinced that it’s always sound.
Silver is right, of course, when he argues that we can never escape subjectivity. Modelers still have to choose the data and models they use, both of which bake a host of judgments right into the pie. What we can do with models, though, is discipline our use of those data, and in so doing, more clearly compare sets of assumptions to see which are more useful. Most political forecasters don’t currently inhabit a world where they can get to know the quirks of the statistical models and adjust for them. Most don’t have statistical models or hold them at arm’s length if they do, and they don’t get to watch them perform anywhere near enough to spot and diagnose the biases. When these conditions aren’t met, we need to be very cautious about taking forecasts from a well-designed model and tweaking them because they don’t feel right.