As we sift through more information than ever before, I would argue it’s becoming easier to commit an error in thinking known as the Texas sharpshooter fallacy. Imagine the stereotype of a Texan cowboy who randomly shoots up the side of a barn, goes up to it and traces a bullseye around the tightest cluster of bullet holes. It’s easy to declare yourself Texas’ sharpest shooter when you draw the bullseye after the fact.
This fallacious reasoning can be seen in communities where people start obsessing over numbers. If we hear that “11:11” is a number that carries special significance, we may begin to obsess over it and to notice it everywhere: it’s on our alarm clocks, our watches, on the calendar once a year, in error numbers when a particular software crashes, in someone’s Twitter handle, on the news. We have not hit the bullseye; rather, there are numbers all around us and we have decided to trace a bullseye around the instances we see of “11:11” and not around the multiple examples of “23:23” or “4:56”. We have failed to show that “11:11” is more special than any other number combination.
Scientific research, unfortunately, does not escape from the Texas sharpshooter fallacy. The fact that many scientists now have to parse through massive data sets makes this fallacy all the more tempting. When I was a graduate student, we tested a few blood samples from cancer patients to know the levels they contained of over 200 tiny molecules known as microRNAs. Was each one of these 200 molecules present in the expected amount, or were they overexpressed or underexpressed? Sure enough, we found a small number that were present in much larger or much smaller amounts than anticipated. We could have stopped there and called it a signature for this particular type of cancer. Want to know if your patient has this cancer? Just test the expression of these specific microRNAs in their blood and if you find our signature, they have this cancer!
But that would have been the equivalent of drawing a bullseye around our findings, which could have been (and, as it turns out, were) spurious. The more things you look for, the more likely you are to find something that makes your detector go “ding-ding-ding.” So you need to validate your preliminary results, which were gathered without knowing what you were looking for, in a different set of samples. You need to put your new hypothesis through the wringer, not glorify it with an unearned crown. Failing to do so is to commit the Texas sharpshooter fallacy which, in scientific research, has a special name: HARKing, or Hypothesizing After the Results are Known.
In the world outside of academic research, this erroneous logic can manufacture fears that never really go away. A massive study coming out of Sweden in 1992 seemed to show beyond the shadow of a doubt that living near high-voltage power lines significantly increased the odds of developing childhood leukemia, a type of blood cancer. But when scientists dug into the complete report, they found that its authors had not simply looked at childhood leukemia risk. They had measured 800 risk ratios. Childhood leukemia was the one that people who lived close to these power lines had more of than people who lived further away, so the authors drew their bullseye around this particular disease and declared victory. Sweden, which had considered making policy changes based on this presumed link between power lines and cancer, eventually decided not to due to lack of good evidence.
The bottom line is that it’s OK to build a hypothesis from a data set, but you cannot draw a conclusion from it as well. You need to validate your hypothesis in a new data set. Otherwise, you will start discovering associations that are simply not true.