Superforecasting by Philip E. Tetlock & Dan Gardner
1 An Optimistic Skeptic
We are all forecasters. When we think about changing jobs, getting married, buying a home, making an investment, launching a product, or retiring, we decide based on how we expect the future will unfold. These expectations are forecasts. Often we do our own forecasting. But when big events happen—markets crash, wars loom, leaders tremble—we turn to the experts, those in the know
THE ONE ABOUT THE CHIMP
I want to spoil the joke, so I’ll give away the punch line: the average expert was roughly as accurate as a dart-throwing chimpanzee
It goes like this: A researcher gathered a big group of experts—academics, pundits, and the like—to make thousands of predictions about the economy, stocks, elections, wars, and other issues of the day. Time passed, and when the researcher checked the accuracy of the predictions, he found that the average expert did about as well as random guessing. Except that’s not the punch line because random guessing isn’t funny. The punch line is about a dart-throwing chimpanzee. Because chimpanzees are funny.
There is plenty of room to stake out reasonable positions between the debunkers and the defenders of experts and their forecasts. On the one hand, the debunkers have a point. There are shady peddlers of questionable insights in the forecasting marketplace. There are also limits to foresight that may just not be surmountable. Our desire to reach into the future will always exceed our grasp. But debunkers go too far when they dismiss all forecasting as a fool’s errand. I believe it is possible to see into the future, at least in some situations and to some extent, and that any intelligent, open-minded, and hardworking person can cultivate the requisite skills.
Call me an optimistic skeptic.
THE SKEPTIC
Lorenz poured cold rainwater on that dream. If the clock symbolizes perfect Laplacean predictability, its opposite is the Lorenzian cloud. High school science tells us that clouds form when water vapor coalesces around dust particles. This sounds simple but exactly how a particular cloud develops—the shape it takes—depends on complex feedback interactions among droplets. To capture these interactions, computer modelers need equations that are highly sensitive to tiny butterfly-effect errors in data collection. So even if we learn all that is knowable about how clouds form, we will not be able to predict the shape a particular cloud will take. We can only wait and see. In one of history’s great ironies, scientists today know vastly more than their colleagues a century ago, and possess vastly more data-crunching power, but they are much less confident in the prospects for perfect predictability.
This is a big reason for the skeptic half of my optimistic skeptic stance. We live in a world where the actions of one nearly powerless man can have ripple effects around the world—ripples that affect us all to varying degrees
THE OPTIMIST
But it is one thing to recognize the limits on predictability, and quite another to dismiss all prediction as an exercise in futility
We make mundane predictions like these routinely, while others just as routinely make predictions that shape our lives
So much of our reality is this predictable, or more so. I just Googled tomorrow’s sunrise and sunset times for Kansas City, Missouri, and got them down to the minute. Those forecasts are reliable, whether they are for tomorrow, the day after, or fifty years from now
In so many other high-stakes endeavors, forecasters are groping in the dark. They have no idea how good their forecasts are in the short, medium, or long term—and no idea how good their forecasts could become. At best, they have vague hunches. That’s because the forecast-measure-revise procedure operates only within the rarefied confines of high-tech forecasting, such as the work of macroeconomists at central banks or marketing and financial professionals in big companies or opinion poll analysts like Nate Silver.7 More often forecasts are made and then…nothing. Accuracy is seldom determined after the fact and is almost never done with sufficient regularity and rigor that conclusions can be drawn reason? Mostly it’s a demand-side problem: The consumers of forecasting—governments, business, and the public—don’t demand evidence of accuracy. So there is no measurement. Which means no revision. And without revision, there can be no improvement. Imagine a world in which people love to run, but they have no idea how fast the average person runs, or how fast the best could run, because runners have never agreed to basic ground rules—stay on the track, begin the race when the gun is fired, end it after a specified distance—and there are no independent race officials and timekeepers measuring results. How likely is it that running times are improving in this world? Not very. Are the best runners running as fast as human beings are physically capable? Again, probably not.
Superforecasting does require minimum levels of intelligence, numeracy, and knowledge of the world, but anyone who reads serious books about psychological research probably has those prerequisites. So what is it that elevates forecasting to superforecasting? As with the experts who had real foresight in my earlier research, what matters most is how the forecaster thinks. I’ll describe this in detail, but broadly speaking, superforecasting demands thinking that is open-minded, careful, curious, and—above all—self-critical. It also demands focus. The kind of thinking that produces superior judgment does not come effortlessly. Only the determined can deliver it reasonably consistently, which is why our analyses have consistently found commitment to self-improvement to be the strongest predictor of performance.
A FORECAST ABOUT FORECASTING
Machines may get better at mimicking human meaning, and thereby better at predicting human behavior, but there’s a difference between mimicking and reflecting meaning and originating meaning, Ferrucci said. That’s a space human judgment will always occupy.
In forecasting, as in other fields, we will continue to see human judgment being displaced—to the consternation of white-collar workers—but we will also see more and more syntheses, like freestyle chess, in which humans with computers compete as teams, the human drawing on the computer’s indisputable strengths but also occasionally overriding the computer. The result is a combination that can (sometimes) beat both humans and machines. To reframe the man-versus-machine dichotomy, combinations of Garry Kasparov and Deep Blue may prove more robust than pure-human or pure-machine approaches.
2 Illusions of Knowledge
THINKING ABOUT THINKING
In describing how we think and decide, modern psychologists often deploy a dual-system model that partitions our mental universe into two domains. System 2 is the familiar realm of conscious thought. It consists of everything we choose to focus on. By contrast, System 1 is largely a stranger to us. It is the realm of automatic perceptual and cognitive operations—like those you are running right now to transform the print on this page into a meaningful sentence or to hold the book while reaching for a glass and taking a sip. We have no awareness of these rapid-fire processes but we could not function without them. We would shut down
The numbering of the two systems is not arbitrary. System 1 comes first. It is fast and constantly running in the background. If a question is asked and you instantly know the answer, it sprang from System 1. System 2 is charged with interrogating that answer. Does it stand up to scrutiny? Is it backed by evidence? This process takes time and effort, which is why the standard routine in decision making is this: first System 1 delivers an answer, and only then can System 2 get involved, starting with an examination of what System 1 decided.
Whether System 2 actually will get involved is another matter. Try answering this: A bat and ball together cost $1.10. The bat costs a dollar more than the ball. How much does the ball cost? If you are like just about everybody who has ever read this famous question, you instantly had an answer: Ten cents. You didn’t think carefully to get that. You didn’t calculate anything. It just appeared. For that, you can thank System 1. Quick and easy, no effort required.
But is ten cents right? Think about the question carefully.
You probably realized a couple of things. First, conscious thought is demanding. Thinking the problem through requires sustained focus and takes an eternity relative to the snap judgment you got with a quick look. Second, ten cents is wrong. It feels right. But it’s wrong. In fact, it’s obviously wrong—if you give it a sober second thought
A defining feature of intuitive judgment is its insensitivity to the quality of the evidence on which the judgment is based. It has to be that way. System 1 can only do its job of delivering strong conclusions at lightning speed if it never pauses to wonder whether the evidence at hand is flawed or inadequate, or if there is better evidence elsewhere. It must treat the available evidence as reliable and sufficient. These tacit assumptions are so vital to System 1 that Kahneman gave them an ungainly but oddly memorable label: WYSIATI (What You See Is All There Is)
Of course, System 1 can’t conclude whatever it wants. The human brain demands order. The world must make sense, which means we must be able to explain what we see and think. And we usually can—because we are creative confabulators hardwired to invent stories that impose coherence on the world
Imagine you’re sitting at a table in a research lab, looking at rows of pictures. You pick one, a picture of a shovel. Why are you pointing at that? Of course you can’t answer without more information. But if you were actually at that table, with your finger pointing at a picture of a shovel, simply saying I don’t know would be a lot harder than you might think. Sane people are expected to have sensible-sounding reasons for their actions. It is awkward to tell others, especially white-lab-coated neuroscientists, I have no idea why—I just am.
This is a poor way to build an accurate mental model of a complicated world, but it’s a superb way to satisfy the brain’s desire for order because it yields tidy explanations with no loose ends. Everything is clear, consistent, and settled. And the fact that it all fits gives us confidence that we know the truth. It is wise to take admissions of uncertainty seriously, Daniel Kahneman noted, but declarations of high confidence mainly tell you that an individual has constructed a coherent story in his mind, not necessarily that the story is true.
BAIT AND SWITCH
So the availability heuristic—like Kahneman’s other heuristics—is essentially a bait-and-switch maneuver. And just as the availability heuristic is usually an unconscious System 1 activity, so too is bait and switch
A better metaphor involves vision. The instant we wake up and look past the tip of our nose, sights and sounds flow to the brain and System 1 is engaged. This perspective is subjective, unique to each of us. Only you can see the world from the tip of your own nose. So let’s call it the tip-of-your-nose perspective
BLINKING AND THINKING
As imperfect as the view from the tip of your nose may be, you shouldn’t discount it entirely
Popular books often draw a dichotomy between intuition and analysis—blink versus think—and pick one or the other as the way to go. I am more of a thinker than a blinker, but blink-think is another false dichotomy. The choice isn’t either/or, it is how to blend them in evolving situations. That conclusion is not as inspiring as a simple exhortation to take one path or the other, but it has the advantage of being true, as the pioneering researchers behind both perspectives came to understand
There is nothing mystical about an accurate intuition like the fire commander’s. It’s pattern recognition. With training or experience, people can encode patterns deep in their memories in vast number and intricate detail—such as the estimated fifty thousand to one hundred thousand chess positions that top players have in their repertoire.20 If something doesn’t fit a pattern—like a kitchen fire giving off more heat than a kitchen fire should—a competent expert senses it immediately
Progress only really began when physicians accepted that the view from the tip of their nose was not enough to determine what works
All too often, forecasting in the twenty-first century looks too much like nineteenth-century medicine. There are theories, assertions, and arguments. There are famous figures, as confident as they are well compensated. But there is little experimentation, or anything that could be called science, so we know much less than most people realize. And we pay the price. Although bad forecasting rarely leads as obviously to harm as does bad medicine, it steers us subtly toward bad decisions and all that flows from them—including monetary losses, missed opportunities, unnecessary suffering, even war and death
3 Keeping Score
When physicians finally learned to doubt themselves, they turned to randomized controlled trials to scientifically test which treatments work. Bringing the rigor of measurement to forecasting might seem easier to do: collect forecasts, judge their accuracy, add the numbers. That’s it. In no time, we’ll know how good Tom Friedman really is
But it’s not nearly so simple. Consider a forecast Steve Ballmer made in 2007, when he was CEO of Microsoft: There’s no chance that the iPhone is going to get any significant market share. No chance.
For various reasons, it’s impossible to say these forecasts are right or wrong beyond all dispute. The truth is, the truth is elusive.
Judging forecasts is much harder than often supposed, a lesson I learned the hard way—from extensive and exasperating experience.
JUDGING JUDGMENTS
Forecasts often rely on implicit understandings of key terms rather than explicit definitions—like significant market share in Steve Ballmer’s forecast. This sort of vague verbiage is more the rule than the exception. And it too renders forecasts untestable.
These are among the smaller obstacles to judging forecasts. Probability is a much bigger one
Some forecasts are easy to judge because they claim unequivocally that something will or won’t happen
But let’s imagine we are omnipotent beings and we can conduct that experiment. We rerun history hundreds of times and we find that 63% of those reruns end in nuclear war. Was Schell right? Perhaps. But we still can’t say with confidence—because we don’t know exactly what he meant by very likely.
Sherman Kent suggested a solution. First, the word possible should be reserved for important matters where analysts have to make a judgment but can’t reasonably assign any probability. So something that is possible has a likelihood ranging from almost zero to almost 100%. Of course that’s not helpful, so analysts should narrow the range of their estimates whenever they can. And to avoid confusion, the terms they use should have designated numerical meanings, which Kent set out in a chart
So if the National Intelligence Estimate said something is probable, it would mean a 63% to 87% chance it would happen. Kent’s scheme was simple—and it greatly reduced the room for confusion
But it was never adopted. People liked clarity and precision in principle but when it came time to make clear and precise forecasts they weren’t so keen on numbers. Some said it felt unnatural or awkward, which it does when you’ve spent a lifetime using vague language, but that’s a weak argument against change. Others expressed an aesthetic revulsion
A more serious objection—then and now—is that expressing a probability estimate with a number may imply to the reader that it is an objective fact, not the subjective judgment it is. That is a danger. But the answer is not to do away with numbers
But a more fundamental obstacle to adopting numbers relates to accountability and what I call the wrong-side-of-maybe fallacy
If a meteorologist says there is a 70% chance of rain and it doesn’t rain, is she wrong? Not necessarily. Implicitly, her forecast also says there is a 30% chance it will not rain. So if it doesn’t rain, her forecast may have been off, or she may have been exactly right. It’s not possible to judge with only that one forecast in hand. The only way to know for sure would be to rerun the day hundreds of times. If it rained in 70% of those reruns, and didn’t rain in 30%, she would be bang on. Of course we’re not omnipotent beings, so we can’t rerun the day—and we can’t judge. But people do judge. And they always judge the same way: they look at which side of maybe—50%—the probability was on. If the forecast said there was a 70% chance of rain and it rains, people think the forecast was right; if it doesn’t rain, they think it was wrong. This simple mistake is extremely common. Even sophisticated thinkers fall for it
The prevalence of this elementary error has a terrible consequence. Consider that if an intelligence agency says there is a 65% chance that an event will happen, it risks being pilloried if it does not—and because the forecast itself says there is a 35% chance it will not happen, that’s a big risk. So what’s the safe thing to do? Stick with elastic language. Forecasters who use a fair chance and a serious possibility can even make the wrong-side-of-maybe fallacy work for them: If the event happens, a fair chance can retroactively be stretched to mean something considerably bigger than 50%—so the forecaster nailed it. If it doesn’t happen, it can be shrunk to something much smaller than 50%—and again the forecaster nailed it. With perverse incentives like these, it’s no wonder people prefer rubbery words over firm numbers
When we combine calibration and resolution, we get a scoring system that fully captures our sense of what good forecasters should do. Someone who says there is a 70% chance of X should do fairly well if X happens. But someone who says there is a 90% chance of X should do better. And someone bold enough to correctly predict X with 100% confidence gets top marks. But hubris must be punished. The forecaster who says X is a slam dunk should take a big hit if X does not happen. How big a hit is debatable, but it’s reasonable to think of it in betting terms
The math behind this system was developed by Glenn W. Brier in 1950, hence results are called Brier scores. In effect, Brier scores measure the distance between what you forecast and what actually happened. So Brier scores are like golf scores: lower is better. Perfection is 0. A hedged fifty-fifty call, or random guessing in the aggregate, will produce a Brier score of 0.5. A forecast that is wrong to the greatest possible extent—saying there is a 100% chance that something will happen and it doesn’t, every time—scores a disastrous 2.0, as far from The Truth as it is possible to get.
THE MEANING OF THE MATH
Not quite. Remember that the whole point of this exercise is to judge the accuracy of forecasts so we can then figure out what works in forecasting and what doesn’t. To do that, we have to interpret the meaning of the Brier scores, which requires two more things: benchmarks and comparability.
Another key benchmark is other forecasters. Who can beat everyone else? Who can beat the consensus forecast? How do they pull it off? Answering these questions requires comparing Brier scores, which, in turn, requires a level playing field
EXPERT POLITICAL JUDGMENT
The first questions put to the experts were about themselves. Age? (The average was forty-three.) Relevant work experience? (The average was 12.2 years.) Education? (Almost all had postgraduate training; half had PhDs.) We also asked about their ideological leanings and preferred approaches to solving political problems.
Forecast questions covered time frames that ranged from one to five to ten years out, and tapped into diverse topics drawn from the news of the day: political and economic, domestic and international. They were asked about whatever topics experts could be found expounding on in the media and halls of power, which meant our experts would sometimes be asked to forecast in their zone of expertise, but more often not—which let us compare the accuracy of true subject-matter experts with that of smart, well-informed laypeople. In total, our experts made roughly twenty-eight thousand predictions
Asking questions took years. Then came the waiting, a test of patience for even the tenured. I began the experiment when Mikhail Gorbachev and the Soviet Politburo were key players shaping the fate of the world; by the time I started to write up the results, the USSR existed only on historical maps and Gorbachev was doing commercials for Pizza Hut. The final results appeared in 2005—twenty-one years, six presidential elections, and three wars after I sat on the National Research Council panel that got me thinking about forecasting. I published them in the academic treatise Expert Political Judgment: How Good Is It? How Can We Know? To keep things simple, I’ll call this whole research program EPJ.
AND THE RESULTS…
If you didn’t know the punch line of EPJ before you read this book, you do now: the average expert was roughly as accurate as a dart-throwing chimpanzee. But as students are warned in introductory statistics classes, averages can obscure. Hence the old joke about statisticians sleeping with their feet in an oven and their head in a freezer because the average temperature is comfortable
In the EPJ results, there were two statistically distinguishable groups of experts. The first failed to do better than random guessing, and in their longer-range forecasts even managed to lose to the chimp. The second group beat the chimp, though not by a wide margin, and they still had plenty of reason to be humble. Indeed, they only barely beat simple algorithms like always predict no change or predict the recent rate of change. Still, however modest their foresight was, they had some
So why did one group do better than the other? It wasn’t whether they had PhDs or access to classified information. Nor was it what they thought—whether they were liberals or conservatives, optimists or pessimists. The critical factor was how they thought
One group tended to organize their thinking around Big Ideas, although they didn’t agree on which Big Ideas were true or false
The other group consisted of more pragmatic experts who drew on many analytical tools, with the choice of tool hinging on the particular problem they faced. These experts gathered as much information from as many sources as they could
This aggregation of many perspectives is bad TV. But it’s good forecasting. Indeed, it’s essential.
DRAGONFLY EYE
Some reverently call it the miracle of aggregation but it is easy to demystify. The key is recognizing that useful information is often dispersed widely, with one person possessing a scrap, another holding a more important piece, a third having a few bits, and so on. When Galton watched people guessing the weight of the doomed ox, he was watching them translate whatever information they had into a number. When a butcher looked at the ox, he contributed the information he possessed thanks to years of training and experience. When a man who regularly bought meat at the butcher’s store made his guess, he added a little more. A third person, who remembered how much the ox weighed at last year’s fair, did the same. And so it went. Hundreds of people added valid information, creating a collective pool far greater than any one of them possessed
How well aggregation works depends on what you are aggregating. Aggregating the judgments of many people who know nothing produces a lot of nothing. Aggregating the judgments of people who know a little is better, and if there are enough of them, it can produce impressive results, but aggregating the judgments of an equal number of people who know lots about lots of different things is most effective because the collective pool of information becomes much bigger
What I should have done is look at the problem from both perspectives—the perspectives of both logic and psycho-logic—and combine what I saw
Unfortunately, aggregation doesn’t come to us naturally. The tip-of-your-nose perspective insists that it sees reality objectively and correctly, so there is no need to consult other perspectives. All too often we agree. We don’t consider alternative views—even when it’s clear that we should
Stepping outside ourselves and really getting a different view of reality is a struggle
But remember the old reflexivity-paradox joke. There are two types of people in the world: those who think there are two types and those who don’t. I’m of the second type. My fox/hedgehog model is not a dichotomy. It is a spectrum. In EPJ, my analyses extended to what I called hybrids—fox-hedgehogs, who are foxes a little closer to the hedgehog end of the spectrum, and hedgehog-foxes, who are hedgehogs with a little foxiness. But even expanding the categories to four doesn’t capture people’s styles of thinking. People can and do think differently in different circumstances—cool and calculating at work, perhaps, but intuitive and impulsive when shopping. And our thinking habits are not immutable. Sometimes they evolve without our awareness of the change. But we can also, with effort, choose to shift gears from one mode to another.23
No model captures the richness of human nature. Models are supposed to simplify things, which is why even the best are flawed. But they’re necessary. Our minds are full of models. We couldn’t function without them. And we often function pretty well because some of our models are decent approximations of reality. All models are wrong, the statistician George Box observed, but some are useful. The fox/hedgehog model is a starting point, not the end
Forget the dart-throwing-chimp punch line. What matters is that EPJ found modest but real foresight, and the critical ingredient was the style of thinking. The next step was figuring out how to advance that insight
4 Superforecasters
How accurate are the forecasts of intelligence analysts?—can’t be answered. Of course, some think they know. Senior officials may claim that the IC is right 80% or 90% of the time. But these are just guesses. Like nineteenth-century physicians who were sure that their treatments cured patients 80% or 90% of the time, they may be right, or close to right, or very wrong. Absent accuracy metrics, there is no meaningful way to hold intelligence analysts accountable for accuracy.
Note the word meaningful in that last sentence. When the director of National Intelligence is dragged into Congress for a blown call, that is accountability for accuracy. It may be ill informed or capricious, and serve no purpose beyond political grandstanding, but it is accountability. By contrast, meaningful accountability requires more than getting upset when something goes awry. It requires systematic tracking of accuracy—for all the reasons laid out earlier. But the intelligence community’s forecasts have never been systematically assessed.
What there is instead is accountability for process: Intelligence analysts are told what they are expected to do when researching, thinking, and judging, and then held accountable to those standards. Did you consider alternative hypotheses? Did you look for contrary evidence? It’s sensible stuff, but the point of making forecasts is not to tick all the boxes on the how to make forecasts checklist. It is to foresee what’s coming. To have accountability for process but not accuracy is like ensuring that physicians wash their hands, examine the patient, and consider all the symptoms, but never checking to see whether the treatment works
The intelligence community is not alone in operating this way. The list of organizations that produce or buy forecasts without bothering to check for accuracy is astonishing. But thanks to the shock of the Iraqi WMD debacle, and the prodding of the National Research Council report, and the efforts of some dedicated civil servants, the IC decided to do something about it. Or more precisely, IARPA decided.
The Intelligence Advanced Research Projects Activity is an agency few outside the intelligence community have heard of, and for good reason. IARPA doesn’t have spies doing cloak-and-dagger work, or analysts who interpret information. Its job is to identify and support high-risk, high-payoff research with the potential to improve the IC’s capabilities. That makes IARPA similar to DARPA, but DARPA is much more famous because it’s bigger, has been around longer, and often funds whiz-bang technology. Most intelligence research isn’t so exotic and yet it can be just as important to national security.
IARPA’s plan was to create tournament-style incentives for top researchers to generate accurate probability estimates for Goldilocks-zone questions.8 The research teams would compete against one another and an independent control group. Teams had to beat the combined forecast—the wisdom of the crowd—of the control group, and by margins we all saw as intimidating
A project on this scale costs many millions of dollars a year. But that’s not what made it bureaucratically gutsy for IARPA to do this. After all, the intelligence community’s annual budget is around $50 billion, which is more than most countries’ annual gross domestic product. Next to that mountain of cash, the cost of the IARPA tournament was an anthill. No, what made it gutsy was what it could reveal.
Here’s one possible revelation: Imagine you get a couple of hundred ordinary people to forecast geopolitical events. You see how often they revise their forecasts and how accurate those forecasts prove to be and use that information to identify the forty or so who are the best. Then you have everyone make lots more forecasts. This time, you calculate the average forecast of the whole group—the wisdom of the crowd—but with extra weight given to those forty top forecasters. Then you give the forecast a final tweak: You extremize it, meaning you push it closer to 100% or zero. If the forecast is 70% you might bump it up to, say, 85%. If it’s 30%, you might reduce it to 15%.
Now imagine that the forecasts you produce this way beat those of every other group and method available, often by large margins. Your forecasts even beat those of professional intelligence analysts inside the government who have access to classified information—by margins that remain classified
Think how shocking it would be to the intelligence professionals who have spent their lives forecasting geopolitical events—to be beaten by a few hundred ordinary people and some simple algorithms
It actually happened. What I’ve described is the method we used to win IARPA’s tournament. There is nothing dazzlingly innovative about it. Even the extremizing tweak is based on a pretty simple insight: When you combine the judgments of a large group of people to calculate the wisdom of the crowd you collect all the relevant information that is dispersed among all those people. But none of those people has access to all that information. One person knows only some of it, another knows some more, and so on. What would happen if every one of those people were given all the information? They would become more confident—raising their forecasts closer to 100% or zero. If you then calculated the wisdom of the crowd it too would be more extreme. Of course it’s impossible to give every person all the relevant information—so we extremize to simulate what would happen if we could
Thanks to IARPA, we now know a few hundred ordinary people and some simple math can not only compete with professionals
supported by a multibillion-dollar apparatus but also beat them
And that’s just one of the unsettling revelations IARPA’s decision made possible. What if the tournament discovered ordinary people who could—without the assistance of any algorithmic magic—beat the IC? Imagine how threatening that would be
With his gray beard, thinning hair, and glasses, Doug Lorch doesn’t look like a threat to anyone. He looks like a computer programmer, which he was, for IBM. He is retired now. He lives in a quiet neighborhood in Santa Barbara with his wife, an artist who paints lovely watercolors
So he volunteered for the Good Judgment Project. Once a day, for an hour or so, his dining room table became his forecasting center, where he opened his laptop, read the news, and tried to anticipate the fate of the world.
Doug’s accuracy was as impressive as his volume. At the end of the first year, Doug’s overall Brier score was 0.22, putting him in fifth spot among the 2,800 competitors in the Good Judgment Project. Remember that the Brier score measures the gap between forecasts and reality, where 2.0 is the result if your forecasts are the perfect opposite of reality, 0.5 is what you would get by random guessing, and 0 is the center of the bull’s-eye. So 0.22 is prima facie impressive, given the difficulty of the questions
In year 2, Doug joined a superforecaster team and did even better, with a final Brier score of 0.14, making him the best forecaster of the 2,800 GJP volunteers
By any mortal standard, Doug Lorch did astonishingly well. The only way to make Doug look unimpressive would be to compare him to godlike omniscience—a Brier score of 0—which would be like belittling Tiger Woods in his prime for failing to hit holes in one.
That made Doug Lorch a threat. This is a man with no applicable experience or education, and no access to classified information. The only payment he received was the $250 Amazon gift certificate that all volunteers got at the end of each season
Of course if Doug Lorch were a uniquely gifted oracle, he would pose little threat to the status quo. There is only so much forecasting one man can do. But Doug isn’t unique. We have already met Bill Flack, the retired Department of Agriculture employee from Nebraska. There were 58 others among the 2,800 volunteers who scored at the top of the charts in year 1. They were our first class of superforecasters
IARPA knew this could happen when it bankrolled the tournament, which is why a decision like that is so unusual. Testing may obviously be in the interest of an organization, but organizations consist of people who have interests of their own, most notably preserving and enhancing a comfortable status quo. Just as famous, well-remunerated pundits are loath to put their reputations at risk by having their accuracy publicly tested, so too are the power players inside organizations unlikely to try forecasting tournaments if it means putting their own judgment to the test. Bob in the CEO suite does not want to hear, much less let others hear, that Dave in the mail room is better at forecasting the company’s business trends than he is
And yet, IARPA did just that: it put the intelligence community’s mission ahead of the interests of the people inside the intelligence community—at least ahead of those insiders who didn’t want to rock the bureaucratic boat
RESISTING GRAVITY–BUT FOR HOW LONG?
The purpose of laying out an argument as I have done here is to convince the reader, but I hope you’re not convinced about these superforecasters—yet
Most things in life involve skill and luck, in varying proportions. The mix may be almost all luck and a little skill, or almost all skill and a little luck, or it could be one of a thousand other possible variations. That complexity makes it hard to figure out what to chalk up to skill and what to luck—a subject probed in depth by Michael Mauboussin, a global financial strategist, in his book The Success Equation. But as Mauboussin noted, there is an elegant rule of thumb that applies to athletes and CEOs, stock analysts and superforecasters. It involves regression to the mean.
Some statistical concepts are both easy to understand and easy to forget. Regression to the mean is one of them
So regression to the mean is an indispensable tool for testing the role of luck in performance
So how did superforecasters hold up across years? That’s the key question. And the answer is phenomenally well. For instance, in years 2 and 3 we saw the opposite of regression to the mean: the superforecasters as a whole, including Doug Lorch, actually increased their lead over all other forecasters.
But, as Wall Streeters well know, mortals can only defy the laws of statistical gravity for so long. The consistency in performance of superforecasters as a group should not mask the inevitable turnover among some top performers over time. The correlation between how well individuals do from one year to the next is about 0.65, modestly higher than that between the heights of fathers and sons. So we should still expect considerable regression toward the mean. And we observe just that. Each year, roughly 30% of the individual superforecasters fall from the ranks of the top 2% next year.
All of this suggests two key conclusions. One, we should not treat the superstars of any given year as infallible, not even Doug Lorch. Luck plays a role and it is only to be expected that the superstars will occasionally have a bad year and produce ordinary results—just as superstar athletes occasionally look less than stellar
But more basically, and more hopefully, we can conclude that the superforecasters were not just lucky. Mostly, their results reflected skill
Which raises the big question: Why are superforecasters so good?
5 Supersmart?
So to understand the role of intelligence and knowledge in superforecaster success, we have to go a step further. We must compare the superforecasters’ intelligence and knowledge not only with that of other forecasters, but with that of the general population of the United States.
What did we find? Regular forecasters scored higher on intelligence and knowledge tests than about 70% of the population. Superforecasters did better, placing higher than about 80% of the population.
Note three things. First, the big jumps in intelligence and knowledge are from the public to the forecasters, not from forecasters to superforecasters. Second, although superforecasters are well above average, they did not score off-the-charts high and most fall well short of so-called genius territory, a problematic concept often arbitrarily defined as the top 1%, or an IQ of 135 and up
So it seems intelligence and knowledge help but they add little beyond a certain threshold—so superforecasting does not require a Harvard PhD and the ability to speak five languages. I find that conclusion satisfying because it squares nicely with the hunch Daniel Kahneman shared with me all those years ago, when I started this research—that high-powered subject-matter experts would not be much better forecasters than attentive readers of the New York Times. It should also satisfy the reader. If you’ve made it this far, you’ve probably got the right stuff.
But having the requisite intelligence and knowledge is not enough. Many clever and informed forecasters in the tournament fell far short of superforecaster accuracy. And history is replete with brilliant people who made forecasts that proved considerably less than prescient
The foundations of our decision making were gravely flawed, McNamara wrote in his autobiography. We failed to analyze our assumptions critically, then or later.
Ultimately, it’s not the crunching power that counts. It’s how you use it.
FERMI-IZE
The Italian American physicist Enrico Fermi—a central figure in the invention of the atomic bomb
What Fermi understood is that by breaking down the question, we can better separate the knowable and the unknowable. So guessing—pulling a number out of the black box—isn’t eliminated. But we have brought our guessing process out into the light of day where we can inspect it. And the net result tends to be a more accurate estimate than whatever number happened to pop out of the black box when we first read the question.
Of course, all this means we have to overcome our deep-rooted fear of looking dumb. Fermi-izing dares us to be wrong. In that spirit
Fermi was renowned for his estimates. With little or no information at his disposal, he would often do back-of-the-envelope calculations like this to come up with a number that subsequent measurement revealed to be impressively accurate. In many physics and engineering faculties, Fermi estimates or Fermi problems—strange tests like estimate the number of square inches of pizza consumed by all the students at the University of Maryland during one semester—are part of the curriculum
I shared Levitin’s discussion of Fermi estimation with a group of superforecasters and it drew a chorus of approval. Sandy Sillman told me Fermi estimation was so critical to his job as a scientist working with atmospheric models that it became a part of my natural way of thinking.
That’s a huge advantage for a forecaster, as we shall see
This was only the beginning, but thanks to Bill’s Fermi-style analysis he had already avoided the bait-and-switch tiger trap and laid out a road map for subsequent analysis. He was off to a terrific start.
OUTSIDE FIRST
You may wonder why the outside view should come first. After all, you could dive into the inside view and draw conclusions, then turn to the outside view. Wouldn’t that work as well? Unfortunately, no, it probably wouldn’t. The reason is a basic psychological concept called anchoring.
When we make estimates, we tend to start with some number and adjust. The number we start with is called the anchor. It’s important because we typically underadjust, which means a bad anchor can easily produce a bad estimate. And it’s astonishingly easy to settle on a bad anchor. In classic experiments, Daniel Kahneman and Amos Tversky showed you could influence people’s judgment merely by exposing them to a number—any number, even one that is obviously meaningless, like one randomly selected by the spin of a wheel.10 So a forecaster who starts by diving into the inside view risks being swayed by a number that may have little or no meaning. But if she starts with the outside view, her analysis will begin with an anchor that is meaningful. And a better anchor is a distinct advantage
THE INSIDE VIEW
You’ve Fermi-ized the question, consulted the outside view, and now, finally, you can delve into the inside view
If you aimlessly examine one tree, then another, and another, you will quickly become lost in the forest. A good exploration of the inside view does not involve wandering around, soaking up any and all information and hoping that insight somehow emerges. It is targeted and purposeful: it is an investigation, not an amble.11
Again, Fermi-ization is key
Start with the first hypothesis
Each of these elements could then be researched—looking for evidence pro and con—to get a sense of how likely they are to be true, and therefore how likely the hypothesis is to be true. Then it’s on to the next hypothesis. And the next
This sounds like detective work because it is—or to be precise, it is detective work as real investigators do it, not the detectives on TV shows. It’s methodical, slow, and demanding. But it works far better than wandering aimlessly in a forest of information
THESIS, ANTITHESIS, SYNTHESIS
So you have an outside view and an inside view. Now they have to be merged, just as your brain merges the different perspectives of your two eyeballs into a single vision
There are many different ways to obtain new perspectives. What do other forecasters think? What outside and inside views have they come up with? What are experts saying? You can even train yourself to generate different perspectives
But writing his judgment down is also a way of distancing himself from it, so he can step back and scrutinize it: It’s an auto-feedback thing, he says. Do I agree with this? Are there holes in this? Should I be looking for something else to fill this in? Would I be convinced by this if I were somebody else?
That is a very smart move. Researchers have found that merely asking people to assume their initial judgment is wrong, to seriously consider why that might be, and then make another judgment, produces a second estimate which, when combined with the first, improves accuracy almost as much as getting a second estimate from another person.
The billionaire financier George Soros exemplifies it. A key part of his success, he has often said, is his mental habit of stepping back from himself so he can judge his own thinking and offer a different perspective—to himself.
DRAGONFLY FORECASTING
Outside views, inside view, other outside and inside views, second opinions from yourself…that’s a lot of perspectives—and inevitably a lot of dissonant information. David Rogg’s neat synthesis of contrary outside and inside views made it look easy, but it’s not. And the difficulty only mounts as the number of perspectives being synthesized increases
The commentary that superforecasters post on GJP forums is rife with on the one hand/on the other dialectical banter. And superforecasters have more than two hands
That is dragonfly eye in operation. And yes, it is mentally demanding. Superforecasters pursue point-counterpoint discussions routinely, and they keep at them long past the point where most people would succumb to migraines. They are the polar opposite of the people who blurt out Ten cents! on the Cognitive Reflection Test—which is why, to the surprise of no one, they did superbly on the CRT. Forget the old advice to think twice. Superforecasters often think thrice—and sometimes they are just warming up to do a deeper-dive analysis
An element of personality is also likely involved. In personality psychology, one of the Big Five traits is openness to experience, which has various dimensions, including preference for variety and intellectual curiosity. It’s unmistakable in many superforecasters
But ultimately, as with intelligence, this has less to do with traits someone possesses and more to do with behavior. A brilliant puzzle solver may have the raw material for forecasting, but if he doesn’t also have an appetite for questioning basic, emotionally charged beliefs he will often be at a disadvantage relative to a less intelligent person who has a greater capacity for self-critical thinking. It’s not the raw crunching power you have that matters most. It’s what you do with it
Active open-mindedness (AOM) is a concept coined by the psychologist Jonathan Baron, who has an office next to mine at the University of Pennsylvania. Baron’s test for AOM asks whether you agree or disagree with statements like:
Quite predictably, superforecasters score highly on Baron’s test. But more important, superforecasters illustrate the concept. They walk the talk
For superforecasters, beliefs are hypotheses to be tested, not treasures to be guarded. It would be facile to reduce superforecasting to a bumper-sticker slogan, but if I had to, that would be it
6 Superquants?
But the fact that superforecasters are almost uniformly highly numerate people is not mere coincidence. Superior numeracy does help superforecasters, but not because it lets them tap into arcane math models that divine the future. The truth is simpler, subtler, and much more interesting.
WHERE’S OSAMA?
Is Osama bin Laden there? Yes or no? This is no fucking bullshit thinking. That’s how Maya thinks. And she’s right.
Or so it seems until you engage System 2 and think twice. In reality, Maya is unreasonable.
THE THIRD SETTING
Sitting in the White House’s legendary Situation Room, Obama listened as an array of CIA officers expressed their opinions on the identity of the man in the mysterious Pakistani compound. The CIA’s team leader told the president he was almost sure it was bin Laden. He put his confidence level at 95%, Mark Bowden wrote in The Finish: The Killing of Osama bin Laden, Bowden’s account of the decision making behind one of the most famous commando raids in history
OK, this is a probability thing, the president said in response, according to Bowden’s account
Bowden reported that Obama told him in a later interview, In this situation, what you started to get was probabilities that disguised uncertainty as opposed to actually providing you with useful information. Bowden then wrote that Obama had no trouble admitting it to himself. If he acted on this, he was going to be taking a gamble, pure and simple. A big gamble.
One is that Obama means it literally. He heard an array of views and settled on 50% as the probability closest to the mark. If so, he’s misguided. The collective judgment is higher than that and based on Bowden’s account he has no reasonable basis for thinking 50% is more accurate. It’s a number plucked out of the air
If so, that may have been reasonable. Obama was an executive making a critical decision. He may well have felt that he would order the strike if there was any significant possibility that bin Laden was in the compound. It didn’t matter whether the probability was 90%, 70%, or perhaps even 30%. So rather than waste time trying to nail down a number, he cut the discussion short and moved on.
Of course I don’t know if that was Obama’s thinking. And there is another possible explanation—one that is much less defensible
PROBABILITY FOR THE STONE AGE
The deeply counterintuitive nature of probability explains why even very sophisticated people often make elementary mistakes
The confusion caused by the three-setting mental dial is pervasive
PROBABILITY FOR THE INFORMATION AGE
Scientists come at probability in a radically different way
Recommended by LinkedIn
They relish uncertainty, or at least accept it, because in scientific models of reality, certainty is illusory. Leon Panetta may not be a scientist, but he captured that insight perfectly when he said, Nothing is one hundred percent.
Robert Rubin is a probabilistic thinker. As a student at Harvard, he heard a lecture in which a philosophy professor argued there is no provable certainty and it just clicked with everything I’d sort of thought, he told me. It became the axiom that guided his thinking through twenty-six years at Goldman Sachs, as an adviser to President Bill Clinton, and as secretary of the Treasury. It’s in the title of his autobiography: In an Uncertain World. By rejecting certainty, everything became a matter of probability for Rubin, and he wanted as much precision as possible. One of the first times I met with him, he asked me if a bill would make it through Congress, and I said ‘absolutely,’ a young Treasury aide told the journalist Jacob Weisberg. He didn’t like that one bit. Now I say the probability is 60%—and we can argue over whether it’s 59 or 60%.
Probabilistic thinking and the two- and three-setting mental dials that come more naturally to us are like fish and birds—they are fundamentally different creatures. Each rests on different assumptions about reality and how to cope with it. And each can seem surpassingly strange to someone accustomed to thinking the other way.
UNCERTAIN SUPERS
Robert Rubin would not have to pound the table to make superforecasters understand that an 80% chance of something happening implies a 20% chance it won’t. Thanks in part to their superior numeracy, superforecasters, like scientists and mathematicians, tend to be probabilistic thinkers
An awareness of irreducible uncertainty is the core of probabilistic thinking, but it’s a tricky thing to measure. To do that, we took advantage of a distinction that philosophers have proposed between epistemic and aleatory uncertainty. Epistemic uncertainty is something you don’t know but is, at least in theory, knowable. If you wanted to predict the workings of a mystery machine, skilled engineers could, in theory, pry it open and figure it out. Mastering mechanisms is a prototypical clocklike forecasting challenge. Aleatory uncertainty is something you not only don’t know; it is unknowable. No matter how much you want to know whether it will rain in Philadelphia one year from now, no matter how many great meteorologists you consult, you can’t outguess the seasonal averages. You are dealing with an intractably cloud-like problem, with uncertainty that it is impossible, even in theory, to eliminate. Aleatory uncertainty ensures life will always have surprises, regardless of how carefully we plan. Superforecasters grasp this deep truth better than most. When they sense that a question is loaded with irreducible uncertainty—say, a currency-market question—they have learned to be cautious, keeping their initial estimates inside the shades-of-maybe zone between 35% and 65% and moving out tentatively. They know the cloudier the outlook, the harder it is to beat that dart-throwing chimpanzee
That’s a big improvement on a two- or three-setting dial—but it still falls short of what the most committed superforecasters can achieve on many questions.
The reward, remember, is a clearer perception of the future. And that’s invaluable in the ass-kicking contest of life
BUT WHAT DOES IT ALL MEAN?
But as psychologically beneficial as this thinking may be, it sits uneasily with a scientific worldview. Science doesn’t tackle why questions about the purpose of life. It sticks to how questions that focus on causation and probabilities. Snow building up on the side of a mountain may slip and start an avalanche, or it may not. Until it happens, or it doesn’t, it could go either way. It is not predetermined by God or fate or anything else. It is not meant to be. It has no meaning. Maybe suggests that, contra Einstein, God does play dice with the cosmos. Thus, probabilistic thinking and divine-order thinking are in tension. Like oil and water, chance and fate do not mix. And to the extent that we allow our thoughts to move in the direction of fate, we undermine our ability to think probabilistically.
Most people tend to prefer fate. With the psychologist Laura Kray and other colleagues, I tested the effect of counterfactual thinking, which is thinking about how something might have turned out differently than it actually did
For both the superforecasters and the regulars, we also compared individual fate scores with Brier scores and found a significant correlation—meaning the more a forecaster inclined toward it-was-meant-to-happen thinking, the less accurate her forecasts were. Or, put more positively, the more a forecaster embraced probabilistic thinking, the more accurate she was
So finding meaning in events is positively correlated with wellbeing but negatively correlated with foresight. That sets up a depressing possibility: Is misery the price of accuracy?
I don’t know. But this book is not about how to be happy. It’s about how to be accurate, and the superforecasters show that probabilistic thinking is essential for that. I’ll leave the existential issues to others.
7 Supernewsjunkies?
Superforecasting isn’t a paint-by-numbers method but superforecasters often tackle questions in a roughly similar way—one that any of us can follow: Unpack the question into components. Distinguish as sharply as you can between the known and unknown and leave no assumptions unscrutinized. Adopt the outside view and put the problem into a comparative perspective that downplays its uniqueness and treats it as a special case of a wider class of phenomena. Then adopt the inside view that plays up the uniqueness of the problem. Also explore the similarities and differences between your views and those of others—and pay special attention to prediction markets and other methods of extracting wisdom from crowds. Synthesize all these different views into a single vision as acute as that of a dragonfly. Finally, express your judgment as precisely as you can, using a finely grained scale of probability.
Done well, this process is as demanding as it sounds, taking a lot of time and mental energy. And yet it is literally just the beginning
Forecasts aren’t like lottery tickets that you buy and file away until the big draw. They are judgments that are based on available information and that should be updated in light of changing information. If new polls show a candidate has surged into a comfortable lead, you should boost the probability that the candidate will win
I didn’t like his attitude but I got the point. Superforecasters do monitor the news carefully and factor it into their forecasts, which is bound to give them a big advantage over the less attentive. If that’s the decisive factor, then superforecasters’ success would tell us nothing more than it helps to pay attention and keep your forecast up to date—which is about as enlightening as being told that when polls show a candidate surging into a comfortable lead he is more likely to win
But that’s not the whole story. For one thing, superforecasters’ initial forecasts were at least 50% more accurate than those of regular forecasters. Even if the tournament had asked for only one forecast, and did not permit updating, superforecasters would have won decisively.
More important, it is a huge mistake to belittle belief updating. It is not about mindlessly adjusting forecasts in response to whatever is on CNN. Good updating requires the same skills used in making the initial forecast and is often just as demanding. It can even be more challenging
THE OVER-UNDER
So there are two dangers a forecaster faces after making the initial call. One is not giving enough weight to new information. That’s underreaction. The other danger is overreacting to new information, seeing it as more meaningful than it is, and adjusting a forecast too radically
Both under- and overreaction can diminish accuracy. Both can also, in extreme cases, destroy a perfectly good forecast
UNDER
Underreaction can happen for any number of reasons, some of them prosaic
Commitment can come in many forms, but a useful way to think of it is to visualize the children’s game Jenga, which starts with building blocks stacked one on top of another to form a little tower. Players take turns removing building blocks until someone removes the block that topples the tower. Our beliefs about ourselves and the world are built on each other in a Jenga-like fashion
When a block is at the very base of the tower, there’s no way to remove it without bringing everything crashing down
This suggests that superforecasters may have a surprising advantage: they’re not experts or professionals, so they have little ego invested in each forecast
OVER
Psychologists call this the dilution effect, and given that stereotypes are themselves a source of bias we might say that diluting them is all to the good. Yes and no. Yes, it is possible to fight fire with fire, and bias with bias, but the dilution effect remains a bias. Remember what’s going on here. People base their estimate on what they think is a useful tidbit of information. Then they encounter clearly irrelevant information—meaningless noise—which they indisputably should ignore. But they don’t. They sway in the wind, at the mercy of the next random gust of irrelevant information.
CAPTAIN MINTO
It may look as though Captain Minto is sailing straight for the Charybdis of overreaction. But I haven’t yet mentioned the magnitude of his constant course corrections. In almost every case they are small. And that makes a big difference
A forecaster who doesn’t adjust her views in light of new information won’t capture the value of that information, while a forecaster who is so impressed by the new information that he bases his forecast entirely on it will lose the value of the old information that underpinned his prior forecast. But the forecaster who carefully balances old and new captures the value in both—and puts it into her new forecast. The best way to do that is by updating often but bit by bit
8 Perpetual Beta
The psychologist Carol Dweck would say Simpson has a growth mindset, which Dweck defines as believing that your abilities are largely the product of effort—that you can grow to the extent that you are willing to work hard and learn.2 Some people might think that’s so obviously true it scarcely needs to be said. But as Dweck’s research has shown, the growth mindset is far from universal. Many people have what she calls a fixed mindset—the belief that we are who we are, and abilities can only be revealed, not created and developed. People with the fixed mindset say things like I’m bad at math and see that as an immutable feature of who they are, like being left-handed or female or tall. This has serious consequences. The person who believes he is bad at math, and always will be, won’t try hard to improve, because that would be pointless, and if he is compelled to study math—as we all are in school—he will take any setback as further proof that his limits have been revealed and he should stop wasting his time as soon as possible. Whatever potential he had for improvement will never be realized. Thus, the belief I am bad at math becomes self-fulfilling
Even when the fixed-minded try, they don’t get as much from the experience as those who believe they can grow. In one experiment, Dweck scanned the brains of volunteers as they answered hard questions, then were told whether their answers were right or wrong and given information that could help them improve
To be a top-flight forecaster, a growth mindset is essential. The best illustration is the man who is reputed to have said—but didn’t—When the facts change, I change my mind.
CONSISTENTLY INCONSISTENT
Famous today only for his work on macroeconomic theory, one of John Maynard Keynes’s many remarkable accomplishments was his success as an investor
Keynes was breathtakingly intelligent and energetic, which certainly contributed to his success, but more than that he was an insatiably curious man who loved to collect new ideas—a habit that sometimes required him to change his mind. He did so ungrudgingly.
For Keynes, failure was an opportunity to learn—to identify mistakes, spot new alternatives, and try again
The one consistent belief of the consistently inconsistent John Maynard Keynes was that he could do better. Failure did not mean he had reached the limits of his ability. It meant he had to think hard and give it another go. Try, fail, analyze, adjust, try again: Keynes cycled through those steps ceaselessly
We learn new skills by doing. We improve those skills by doing more. These fundamental facts are true of even the most demanding skills. Modern fighter jets are enormously complex flying computers but classroom instruction isn’t enough to produce a qualified pilot. Not even time in advanced flight simulators will do. Pilots need hours in the air, the more the better. The same is true of surgeons, bankers, and business executives.
TRY
To demonstrate the limits of learning from lectures, the great philosopher and teacher Michael Polanyi wrote a detailed explanation of the physics of riding a bicycle: The rule observed by the cyclist is this. When he starts falling to the right he turns the handlebars to the right, so that the course of the bicycle is deflected along a curve towards the right. This results in a centrifugal force pushing the cyclist to the left and offsets the gravitational force dragging him down to the right. It continues in that vein and closes: A simple analysis shows that for a given angle of unbalance the curvature of each winding is inversely proportional to the square of the speed at which the cyclist is proceeding. It is hard to imagine a more precise description. But does this tell us exactly how to ride a bicycle? Polanyi asked. No. You obviously cannot adjust the curvature of your bicycle’s path in proportion to the ratio of your unbalance over the square of your speed; and if you could you would fall off the machine, for there are a number of other factors to be taken into account in practice which are left out in the formulation of this rule.
The knowledge required to ride a bicycle can’t be fully captured in words and conveyed to others. We need tacit knowledge, the sort we only get from bruising experience. To learn to ride a bicycle, we must try to ride one. It goes badly at first. You fall to one side, you fall to the other. But keep at it and with practice it becomes effortless—although if you had to explain how to stay upright, so they can skip the ordeal you just went through, you would succeed no better than Polanyi
That is blindingly obvious. It should be equally obvious that learning to forecast requires trying to forecast. Reading books on forecasting is no substitute for the experience of the real thing.
FAIL
But not all practice improves skill. It needs to be informed practice. You need to know which mistakes to look out for—and which best practices really are best. So don’t burn your books. As noted earlier, randomized controlled experiments have shown that mastering the contents of just one tiny booklet, our training guidelines (see the appendix), can improve your accuracy by roughly 10%
That is essential. To learn from failure, we must know when we fail. The baby who flops backward does. So does the boy who skins his knee when he falls off the bike. And the accountant who puts an easy putt into a sand trap. And because they know, they can think about what went wrong, adjust, and try again
Unfortunately, most forecasters do not get the high-quality feedback that helps meteorologists and bridge players improve. There are two main reasons why
Ambiguous language is a big one. As we saw in chapter 3, vague terms like probably and likely make it impossible to judge forecasts. When a forecaster says something could or might or may happen, she could or might or may be saying almost anything. The same is true of countless other terms—like Steve Ballmer’s reference to significant market share—that may sound precise but on close inspection prove as fuzzy as fog
The second big barrier to feedback is time lag. When forecasts span months or years, the wait for a result allows the flaws of memory to creep in. You know how you feel now about the future. But as events unfold, will you be able to recall your forecast accurately? There is a good chance you won’t
ANALYZE AND ADJUST
Whenever a question closes, it’s obvious that superforecasters—in sharp contrast to Carol Dweck’s fixed-mindset study subjects—are as keen to know how they can do better as they are to know how they did
Often, postmortems are as careful and self-critical as the thinking that goes into making the initial forecast
GRIT
The analogy between forecasting and bicycling is pretty good but, as with all analogies, the fit isn’t perfect. With bike riding, the try, fail, analyze, adjust, and try again cycle typically takes seconds. With forecasting, it can take months or years. Plus there is the bigger role of chance in forecasting. Cyclists who follow best cycling practices can usually expect excellent outcomes but forecasters should be more tentative. Following best practices improves their odds of winning but less reliably so than in games where chanceplays smaller roles.15 Even with a growth mindset, the forecaster who wants to improve has to have a lot of what my colleague Angela Duckworth dubbed grit.
Elizabeth Sloane has plenty of grit. Diagnosed with brain cancer, Elizabeth endured chemotherapy, a failed stem-cell transplant, recurrence, and two more years of chemo. But she never relented
Grit is passionate perseverance of long-term goals, even in the face of frustration and failure. Married with a growth mindset, it is a potent force for personal progress.
There is always more trying, more failing, more analyzing, more adjusting, and trying again. Computer programmers have a wonderful term for a program that is not intended to be released in a final version but will instead be used, analyzed, and improved without end. It is perpetual beta.
Superforecasters are perpetual beta.
PULLING IT ALL TOGETHER
We have learned a lot about superforecasters, from their lives to their test scores to their work habits. Taking stock, we can now sketch a rough composite portrait of the modal superforecaster
In philosophic outlook, they tend to be:
In their abilities and thinking styles, they tend to be:
In their methods of forecasting they tend to be:
In their work ethic, they tend to have:
I paint with a broad brush here. Not every attribute is equally important. The strongest predictor of rising into the ranks of superforecasters is perpetual beta, the degree to which one is committed to belief updating and self-improvement. It is roughly three times as powerful a predictor as its closest rival, intelligence. To paraphrase Thomas Edison, superforecasting appears to be roughly 75% perspiration, 25% inspiration
And not every superforecaster has every attribute. There are many paths to success and many ways to compensate for a deficit in one area with strength in another. The predictive power of perpetual beta does suggest, though, that no matter how high one’s IQ it is difficult to compensate for lack of dedication to the personal project of growing one’s synapses.
All that said, there is another element that is missing entirely from the sketch: other people. In our private lives and our workplaces, we seldom make judgments about the future entirely in isolation. We are a social species. We decide together. This raises an important question.
What happens when superforecasters work in groups?
9 Superteams
TO TEAM OR NOT TO TEAM?
In the IARPA tournament, our goal was accuracy. Would putting forecasters on teams help? We saw strong arguments for both yes and no. On the negative side, the research literature—as well as my decades of experience on university committees—suggested that teams might foster cognitive loafing. Why labor to master a complex problem when others will do the heavy lifting? When this attitude is widespread it can sink a team. Worse, forecasters can become too friendly, letting groupthink set in. These two tendencies can reinforce each other. We all agree, so our work is done, right? And unanimity within a group is a powerful force. If that agreement is ill-founded, the group slips into self-righteous complacency.
But groups also let people share information and perspectives. That’s good. It helps make dragonfly eye work, and aggregation is critical to accuracy. Of course aggregation can only do its magic when people form judgments independently, like the fairgoers guessing the weight of the ox. The independence of judgments ensures that errors are more or less random, so they cancel each other out. When people gather and discuss in a group, independence of thought and expression can be lost
But loss of independence isn’t inevitable in a group, as JFK’s team showed during the Cuban missile crisis. If forecasters can keep questioning themselves and their teammates, and welcome vigorous debate, the group can become more than the sum of its parts.
First, in the real world, people seldom make important forecasts without discussing them with others, so getting a better understanding of forecasting in the real world required a better understanding of forecasting in groups. The other reason? Curiosity. We didn’t know the answer and we wanted to, so we took Archie Cochrane’s advice and ran an experiment.
We also gave teams a primer on teamwork based on insights gleaned from research in group dynamics. On the one hand, we warned, groupthink is a danger. Be cooperative but not deferential. Consensus is not always good; disagreement not always bad. If you do happen to agree, don’t take that agreement—in itself—as proof that you are right. Never stop doubting. Pointed questions are as essential to a team as vitamins are to a human body
On the other hand, the opposite of groupthink—rancor and dysfunction—is also a danger. Team members must disagree without being disagreeable, we advised. Practice constructive confrontation, to use the phrase of Andy Grove, the former CEO of Intel
SUPERTEAMS
At the end of the year, the results were unequivocal: on average, teams were 23% more accurate than individuals
The results speak for themselves. On average, when a forecaster did well enough in year 1 to become a superforecaster, and was put on a superforecaster team in year 2, that person became 50% more accurate. An analysis in year 3 got the same result. Given that these were collections of strangers tenuously connected in cyberspace, we found that result startling
Even more surprising was how well superteams did against prediction markets
The results were clear-cut each year. Teams of ordinary forecasters beat the wisdom of the crowd by about 10%. Prediction markets beat ordinary teams by about 20%. And superteams beat prediction markets by 15% to 30%.
A team like that should promote the sort of actively open-minded thinking that is so critical to accurate forecasting, as we saw in chapter 5. So just as we surveyed individuals to test their active open-mindedness (AOM), we surveyed teams to probe their attitudes toward the group and patterns of interaction within the group—that is, we tested the team’s AOM. As expected, we found a correlation between a team’s AOM and its accuracy. Little surprise there. But what makes a team more or less actively open-minded? You might think it’s the individuals on the team. Put high-AOM people in a team and you’ll get a high-AOM team; put lower-AOM people in a team and you’ll get a lower-AOM team. Not so, as it turns out. Teams were not merely the sum of their parts. How the group thinks collectively is an emergent property of the group itself, a property of communication patterns among group members, not just the thought processes inside each member. A group of open-minded people who don’t care about one another will be less than the sum of its open-minded parts. A group of opinionated people who engage one another in pursuit of the truth will be more than the sum of its opinionated parts.
depending on the diversity of his team. The more diverse his team, the greater the chance that some advisers will possess scraps of information that others don’t. And since these scraps mostly point toward it’s bin Laden, if all the advisers were given all the scraps they don’t have, they would individually raise their estimate. And that would boost the wisdom of the crowd figure—maybe to 80% or 85%.
10 The Leader’s Dilemma
Leaders must decide, and to do that they must make and use forecasts. The more accurate those forecasts are, the better, so the lessons of superforecasting should be of intense interest to them. But leaders must also act and achieve their goals. In a word, they must lead. And anyone who has led people may have doubts about how useful the lessons of superforecasting really are for leaders
Ask people to list the qualities an effective leader must have, or consult the cottage industry devoted to leadership coaching, or examine rigorous research on the subject, and you will find near-universal agreement on three basic points. Confidence will be on everyone’s list. Leaders must be reasonably confident, and instill confidence in those they lead, because nothing can be accomplished without the belief that it can be. Decisiveness is another essential attribute. Leaders can’t ruminate endlessly. They need to size up the situation, make a decision, and move on. And leaders must deliver a vision—the goal that everyone strives together to achieve
And consider how the superteams operated. They were given guidance on how to form an effective team, but nothing was imposed. No hierarchy, no direction, no formal leadership. These little anarchist cells may work as forums for the endless consideration and reconsideration superforecasters like to engage in but they’re hardly organizations that can pull together and get things done. That takes structure—and a leader in charge
This looks like a serious dilemma. Leaders must be forecasters and leaders but it seems that what is required to succeed at one role may undermine the other
Fortunately, the contradiction between being a superforecaster and a superleader is more apparent than real. In fact, the superforecaster model can help make good leaders superb and the organizations they lead smart, adaptable, and effective
The key is an approach to leadership and organization first articulated by a nineteenth-century Prussian general, perfected by the German army of World War II, made foundational doctrine by the modern American military, and deployed by many successful corporations today. You might even find it at your neighborhood Walmart.
MOLTKE’S LEGACY
In war, everything is uncertain, wrote Helmuth von Moltke.1 In the late nineteenth century, Moltke was famous the world over after he led Prussian forces to victory against Denmark in 1864, Austria in 1866, and France in 1871—victories that culminated in the unification of Germany. His writings on war—which were themselves influenced by the great theorist Carl von Clausewitz—profoundly shaped the German military that fought the two world wars. But Moltke was no Napoleon. He never saw himself as a visionary leader directing his army like chess pieces. His approach to leadership and organization was entirely different
The Prussian military had long appreciated uncertainty—they had invented board games with dice to introduce the element of chance missing from games like chess—but everything is uncertain was for Moltke an axiom whose implications needed to be teased out
The acceptance of criticism went beyond the classroom, and under extraordinary circumstances more than criticism was tolerated
So a leader must possess unwavering determination to overcome obstacles and accomplish his goals—while remaining open to the possibility that he may have to throw out the plan and try something else. That’s a lot to ask of anyone, but the German military saw it as the essence of the leader’s role. Once a course of action has been initiated it must not be abandoned without overriding reason, the Wehrmacht manual stated. In the changing situations of combat, however, inflexibly clinging to a course of action can lead to failure. The art of leadership consists of the timely recognition of circumstances and of the moment when a new decision is required.
What ties all of this together—from nothing is certain to unwavering determination—is the command principle of Auftragstaktik. Usually translated today as mission command, the basic idea is simple. War cannot be conducted from the green table, Moltke wrote, using an expression that referred to top commanders at headquarters. Frequent and rapid decisions can be shaped only on the spot according to estimates of local conditions. Decision-making power must be pushed down the hierarchy so that those on the ground—the first to encounter surprises on the evolving battlefield—can respond quickly. Of course those on the ground don’t see the bigger picture. If they made strategic decisions the army would lose coherence and become a collection of tiny units, each seeking its own ends
How skillfully leaders perform this balancing act determines how successfully their organizations can cultivate superteams that can replicate the balancing act down the chain of command. And this is not something that one isolated leader can do on his own. It requires a wider willingness to hear unwelcome words from others—and the creation of a culture in which people feel comfortable speaking such words. What was done to the young Dwight Eisenhower was a serious mistake, Petraeus says. You have to preserve and promote the out-of-the-box thinkers, the iconoclasts.
AUFTRAGSTAKTIK IN BUSINESS
Armies are unusual organizations, but bosses everywhere feel the tension between control and innovation, which is why Moltke’s spirit can be found in organizations that have nothing to do with bullets and bombs
We let our people know what we want them to accomplish. But—and it is a very big ‘but’—we do not tell them how to achieve those goals.25 That is a near-perfect summary of mission command. The speaker is William Coyne, who was senior vice president of research and development at 3M, the famously innovative manufacturing conglomerate
Have backbone; disagree and commit is one of Jeff Bezos’s fourteen leadership principles drilled into every new employee at Amazon. It continues: Leaders are obligated to respectfully challenge decisions when they disagree, even when doing so is uncomfortable or exhausting. Leaders have conviction and are tenacious. They do not compromise for the sake of social cohesion. Once a decision is determined, they commit wholly.26 The language is a little blunt for Moltke, but it wouldn’t look out of place in the Wehrmacht command manual or in my conversation with David Petraeus.
A PECULIAR TYPE OF HUMILITY
But there’s still the vexing question of humility
No one ever called Winston Churchill or Steve Jobs humble. Same with David Petraeus. From West Point cadet onward, Petraeus believed he had the right stuff to become a top general
So how do we square all that with the apparently critical need for a forecaster to be humble? The answer lies in something Annie Duke told me
We met Duke earlier. She believes she is one of the world’s best poker players, which is no small claim, but she has a long record of accomplishments—including a World Series of Poker championship—that suggest her confidence is reasonable. But Duke also knows there is danger in confidence. When making a decision, a smart person like Duke is always tempted by a simple cognitive shortcut: I know the answer. I don’t need to think long and hard about it. I am a very successful person with good judgment. The fact that I believe my judgment is correct proves it is. Make decisions like that and you are only looking at reality from the tip of your nose. That’s a dangerous way to make decisions, no matter who you are. To avoid this trap, Duke carefully distinguishes what she is confident about—and what she’s not
You have to have tremendous humility in the face of the game because the game is extremely complex, you won’t solve it, it’s not like tic-tac-toe or checkers, she says. It’s very hard to master and if you’re not learning all the time, you will fail. That being said, humility in the face of the game is extremely different than humility in the face of your opponents. Duke feels confident that she can compete with most people she sits down with at a poker table. But that doesn’t mean I think I’ve mastered this game.
The humility required for good judgment is not self-doubt—the sense that you are untalented, unintelligent, or unworthy. It is intellectual humility. It is a recognition that reality is profoundly complex, that seeing things clearly is a constant struggle, when it can be done at all, and that human judgment must therefore be riddled with mistakes. This is true for fools and geniuses alike. So it’s quite possible to think highly of yourself and be intellectually humble. In fact, this combination can be wonderfully fruitful. Intellectual humility compels the careful reflection necessary for good judgment; confidence in one’s abilities inspires determined action.
POSTSCRIPT
There is a dangling question: Did I have to choose the Wehrmacht? Other organizations illustrate how thinking like a superforecaster can improve leader performance. So why make the point with an army that served the evilest cause in modern history?
Understanding what worked in the Wehrmacht requires engaging in the toughest of all forms of perspective taking: acknowledging that something we despise possesses impressive qualities. Forecasters who can’t cope with the dissonance risk making the most serious possible forecasting error in a conflict: underestimating your opponent.
It’s a challenge. Even superforecasters who excel at aggressive self-criticism sometimes conflate facts and values
11 Are They Really So Super?
So in the summer of 2014, when it was clear that superforecasters were not merely superlucky, Kahneman cut to the chase: Do you see them as different kinds of people, or as people who do different kinds of things?
My answer was, A bit of both. They score higher than average on measures of intelligence and open-mindedness, although they are not off the charts.
My sense is that some superforecasters are so well practiced in System 2 corrections—such as stepping back to take the outside view—that these techniques have become habitual. In effect, they are now part of their System 1. That may sound bizarre but it’s not an unusual process
So how long can superforecasters defy the laws of psychological gravity? The answer to that depends on how heavy their cognitive loads are. Turning a self-conscious System 2 correction into an unconscious System 1 operation may lighten the load considerably. So may the software tools some superforecasters have developed—like Doug Lorch’s new-source selection program designed to correct the System 1 bias in favor of the like-minded.
But still, superforecasting remains hard work. Those who do it well appreciate the fragility of their success. They expect to stumble. And when they do, they will get up, try to draw the right lessons, and keep forecasting.
But another friend and colleague is not as impressed by the superforecasters as I am. Indeed, he suspects this whole research program is misguided.
ENTER THE BLACK SWAN
Nassim Taleb is a former Wall Street trader whose thinking about uncertainty and probability has produced three enormously influential books and turned black swan into a common English phrase
The black swan is therefore a brilliant metaphor for an event so far outside experience we can’t even imagine it until it happens.
If forecasters make hundreds of forecasts that look out only a few months, we will soon have enough data to judge how well calibrated they are. But by definition, highly improbable events almost never happen. If we take highly improbable to mean a 1% or 0.1% or 0.0001% chance of an event, it may take decades or centuries or millennia to pile up enough data. And if these events have to be not only highly improbable but also impactful, the difficulty multiplies. So the first-generation IARPA tournament tells us nothing about how good superforecasters are at spotting gray or black swans. They may be as clueless as anyone else—or astonishingly adept. We don’t know, and shouldn’t fool ourselves that we do.
There is another, big reason not to dismiss forecasting tournaments. What elevates a mere surprise to black swan status are the event’s consequences. But consequences take time to develop. On July 14, 1789, a mob took control of a prison in Paris known as the Bastille, but we mean something much bigger than what happened that day when we refer today to the storming of the Bastille. We mean the actual event plus the events it triggered that spiraled into the French Revolution
We may have no evidence that superforecasters can foresee events like those of September 11, 2001, but we do have a warehouse of evidence that they can forecast questions like: Will the United States threaten military action if the Taliban don’t hand over Osama bin Laden? Will the Taliban comply? Will bin Laden flee Afghanistan prior to the invasion? To the extent that such forecasts can anticipate the consequences of events like 9/11, and these consequences make a black swan what it is, we can forecast black swans.
ALL THAT SAID…
I see Kahneman’s and Taleb’s critiques as the strongest challenges to the notion of superforecasting. We are far enough apart empirically and close enough philosophically to make communication, even collaboration, possible.
Kahneman, Taleb, and I agree on that much. But I also believe that humility should not obscure the fact that people can, with considerable effort, make accurate forecasts about at least some developments that really do matter. To be sure, in the big scheme of things, human foresight is puny, but it is nothing to sniff at when you live on that puny human scale
12 What’s Next?
Tournaments help researchers learn what improves forecasting and help forecasters sharpen their skills with practice and feedback. Tournaments could help society, too, by providing tools for structuring our thinking about what is likely to happen if we venture down one policy path or another. Vague expectations about indefinite futures are not helpful. Fuzzy thinking can never be proven wrong. And only when we are proven wrong so clearly that we can no longer deny it to ourselves will we adjust our mental models of the world—producing a clearer picture of reality. Forecast, measure, revise: it is the surest path to seeing better
I size up the strongest force resisting change, and why, despite it, the status quo may be in for a shock. Then I look at something I can control—my future research. Whether it will be conducted amid a tumultuous change, as I hope, or a stagnant status quo, as I fear, will be decided by the people whom political scientists call the attentive public. I’m modestly optimistic. But readers can make the final forecast.
APPENDIX
Ten Commandments for Aspiring Superforecasters
Focus on questions where your hard work is likely to pay off. Don’t waste time either on easy clocklike questions (where simple rules of thumb can get you close to the right answer) or on impenetrable cloud-like questions (where even fancy statistical models can’t beat the dart-throwing chimp). Concentrate on questions in the Goldilocks zone of difficulty, where effort pays off the most
Superforecasters see Fermi-izing as part of the job. How else could they generate quantitative answers to seemingly impossible-to-quantify questions about Arafat’s autopsy, bird-flu epidemics, oil prices, Boko Haram, the Battle of Aleppo, and bond-yield spreads.
We find this Fermi-izing spirit at work even in the quest for love, the ultimate unquantifiable
Superforecasters know that there is nothing new under the sun. Nothing is 100% unique. Language purists be damned: uniqueness is a matter of degree. So superforecasters conduct creative searches for comparison classes even for seemingly unique events, such as the outcome of a hunt for a high-profile terrorist (Joseph Kony) or the standoff between a new socialist government in Athens and Greece’s creditors. Superforecasters are in the habit of posing the outside-view question: How often do things of this sort happen in situations of this sort?
Belief updating is to good forecasting as brushing and flossing are to good dental hygiene. It can be boring, occasionally uncomfortable, but it pays off in the long term. That said, don’t suppose that belief updating is always easy because it sometimes is
For every good policy argument, there is typically a counterargument that is at least worth acknowledging. For instance, if you are
a devout dove who believes that threatening military action never brings peace, be open to the possibility that you might be wrong about Iran. And the same advice applies if you are a devout hawk who believes that soft appeasement policies never pay off. Each side should list, in advance, the signs that would nudge them toward the other.
Few things are either certain or impossible. And maybe isn’t all that informative. So your uncertainty dial needs more than three settings. Nuance matters. The more degrees of uncertainty you can distinguish, the better a forecaster you are likely to be
Superforecasters understand the risks both of rushing to judgment and of dawdling too long near maybe. They routinely manage the trade-off between the need to take decisive stands (who wants to listen to a waffler?) and the need to qualify their stands (who wants to listen to a blowhard?). They realize that long-term accuracy requires getting good scores on both calibration and resolution—which requires moving beyond blame-game ping-pong. It is not enough just to avoid the most recent mistake. They have to find creative ways to tamp down both types of forecasting errors—misses and false alarms—to the degree a fickle world permits such uncontroversial improvements in accuracy.
Don’t try to justify or excuse your failures. Own them! Conduct unflinching postmortems: Where exactly did I go wrong? And remember that although the more common error is to learn too little from failure and to overlook flaws in your basic assumptions, it is also possible to learn too much (you may have been basically on the right track but made a minor technical mistake that had big ramifications). Also don’t forget to do postmortems on your successes too
Master the fine arts of team management, especially perspective taking (understanding the arguments of the other side so well that you can reproduce them to the other’s satisfaction), precision questioning (helping others to clarify their arguments so they are not misunderstood), and constructive confrontation (learning to disagree without being disagreeable). Wise leaders know how fine the line can be between a helpful suggestion and micromanagerial meddling or between a rigid group and a decisive one or between a scatterbrained group and an open-minded one
Implementing each commandment requires balancing opposing errors. Just as you can’t learn to ride a bicycle by reading a physics textbook, you can’t become a superforecaster by reading training manuals. Learning requires doing, with good feedback that leaves
no ambiguity about whether you are succeeding—I’m rolling along smoothly!—or whether you are failing—crash!
It is impossible to lay down binding rules, Helmuth von Moltke warned, because two cases will never be exactly the same.5 As in war, so in all things. Guidelines are the best we can do in a world where nothing is certain or exactly repeatable. requires constant mindfulness, even when—perhaps especially when—you are dutifully trying to follow these commandments
Apple Book Store
YouTube with the author
CEO at Oxygen Events
1yInteresting&attractive I will look for this book. Thank you for sharing