Subscribe to the OSS Weekly Newsletter!

Puppies Can Help Explain How Scientists Get Things Wrong

P-hacking in science is bad. Using dog owners as an example, we can understand what it means without involving math.

If you have an interest in scientific research, you may have heard a weird term that sounds traumatizing from a urological perspective: p-hacking. Apparently, you’ve read, some scientists use p-hacking to get the results they want. And this “hack,” this way of gaining unauthorized access to glory, revolves around a mysterious entity known as the p-value.

I want to explain what p-hacking is, because having a clear view of how science is conducted is important. Scientific results should not be blindly accepted on faith, nor should they be denied out of some exaggerated sense that all of science is corrupt and unworthy of trust. Science is a human enterprise. It is done within a system that contains bad incentives. Some of these bad incentives encourage p-hacking.

So what exactly is p-hacking and how does it happen?

The dogged pursuit of happiness

I am well aware that most people are allergic to mathematics, so I will stay away from numbers and technical jargon. P-hacking is a problem that arises from choice, and we can understand it well enough by putting ourselves in the shoes of a researcher.

Since people are allergic to math, we will focus instead on dogs. Do not worry: our imaginary dogs are hypoallergenic. You will not need your antihistamine while reading this.

As someone who loves dogs, I can easily imagine that owning a dog makes everyone happier. I thus decide to study this and gather data to see if this is true. Will my intuition be backed up by science?

Once my project has enough funding and has received clearance from my institutional review board, I go about recruiting dog owners and people who do not own dogs. Maybe I post flyers all over town, maybe I post a recruitment ad inside of public Facebook groups dedicated to dog lovers and cat lovers.

What these participants end up doing is filling out a lot of questionnaires. I found three different questionnaires that assess happiness, so I have them fill them all out. I also want to know about any pet they own, of course.

As participants fill out their questionnaires, they are giving me data. I need to analyze this data in light of my hypothesis. You may remember from your science classes that, typically, a scientific question contains two hypotheses: the null hypothesis and the alternative hypothesis.

The null hypothesis is, essentially, the boring hypothesis. In our case, it’s that owning a dog does not make you any happier. The alternative hypothesis is that it does, and unfortunately our entire scientific enterprise is geared toward getting really excited when data support this alternative hypothesis (dogs make you happier! this drug successfully treats cancer! stretching really does reduce injuries!) and really sad when the data do not (dogs do not make you happier… and this drug does not successfully treat cancer… and stretching really does not reduce injuries).

How do we know if the data support this alternative hypothesis? Very often, for better or for worse, it comes down to a specific result that a statistical test spits out. It’s a number, and scientists really want that number to clear a certain threshold. The number is called the p-value.

You would find it weird if football players, their coaches, and their biggest fans didn’t really understand what the numbers meant on the scoreboard, but simply knew that if their team’s number was higher than the other’s, it meant they had won. And yet, without trying to be simplistic or alarmist, we find ourselves in a similar situation when it comes to science. A shocking study coming out of Germany showed that even instructors who teach the p-value to students in university misunderstand it. Suffice to say, for our purpose here, that the p-value is often misconstrued and yet so much of science hangs on whether its value in an experiment is deemed significant or not.

A hundred years ago or so, it was proposed, quite arbitrarily, that 0.05 would be the threshold the p-value would have to clear (as cited here). If your experiment yields a p-value bigger than 0.05, it means, in scientific terms, that you cannot reject your null hypothesis. It may very well be that puppies do not make their owners happier. In common terms, it means a disappointment. Your results were not “statistically significant.” They didn’t clear this arbitrary hurdle.

If your experiment results in a p-value smaller than 0.05, you have arrived at scientific nirvana. You are sitting on a statistically significant result, and scientific journals love to publish those. A p-value smaller than 0.05 has a shine to it. It radiates novelty and surprise. It feels important, though it is crucial to note that it does not mean the effect you have found is great in scope or relevant in the real world. But it is easy to start to think that it does.

So what do you do when an entire system is geared toward accepting, publishing, and rewarding results with statistically significant p-values?

You may find yourself, maliciously or innocently, hacking your data to get the p-value that is cause for celebration.

Hounding the data

Going back to my research project on dog ownership and happiness, I can try running my data through different statistical tests. It’s sometimes the equivalent of toasting bread by using a toaster versus a stovetop pan versus a blowtorch. They will all toast your bread, but only one method has clearly been designed for this purpose. If I get a significant result on the third try, I could choose to present this result as if it came from the only test I ran. I would sweep the other tests and their negative results under the rug. That would be p-hacking.

I could also decide to split my data into age groups, even though I did not set out to do that. I could look to see if dog owners between the ages of 20 and 29 are happier, and then look at people in their thirties, forties, and so forth, and only report the age groups for which I get a positive association.

And when I say “happier,” I would have to ask myself, “happier than whom?” Do I compare the dog owners’ happiness to the happiness of cat owners? Or to that of owners of pets which aren’t dogs? Or to that of people who do not own any pets? Or to how dog owners remember their own happiness before Fido came into their lives? That’s a lot of comparisons, and like placing multiple bets on a roulette table all at once—red, even, and a column bet—, I increase the odds that one of my bets will pay off if I do all of these comparisons, especially since running all these tests, unlike betting money, doesn’t cost me anything.

I also mentioned I had three questionnaires assessing happiness. Maybe one of them gives me a p-value smaller than 0.05, but not the other two. I could then frame my data presentation around this one questionnaire and give the other two less attention… or perhaps, not even mention them.

I could also wait until I have recruited 25 participants and analyze the data then. If my results are not significant, I recruit another 25, then reanalyze. And I keep doing this stop-and-go recruitment until I get a significant result, then permanently halt my experiment.

All of these decisions represent p-hacking. They arise because of the need for me to make choices. Research offers me a “garden of forking paths,” as first described by Andrew Gelman and Eric Loken. Do I cut my data this way or that way? Do I compare group 1 to group 2 or to group 3? And what do I do with extreme data points called outliers? Do I keep them in or remove them?

The garden of forking paths offers researchers “degrees of freedom.” In creating a checklist to avoid making bad choices, a team of Dutch scientists found 34 of those degrees of freedom, choices that researchers make and that can either lead to an honest result or to an opportunistic bid toward fake success.

The reason why p-hacking is frowned upon in good scientific circles—and this is the core of the issue—is that the more tests you conduct, the higher the odds of getting a positive result by chance alone if you do not correct for this. It’s called torturing the data until it confesses. This is how we end up with false positive results that cannot be reproduced. Or, at least, it is one of the ways.

We have become aware of a reproducibility crisis in science: the novel and surprising results of many studies cannot be replicated by independent teams of scientists. Is p-hacking to blame? In part, yes, but it is hard to truly know the extent of the problem, because p-hacking is difficult to spot. After all, if I run my data through seven different statistical tests before publishing the results of the one that gives me a significant p-value, you have no way of knowing about the six other tests.

There are ways in which large chunks of the literature can be visualized by plotting their p-values onto a graph and looking at their distribution, with a certain skew pointing the finger at p-hacking. But it’s not an easy thing to quantify, and some scientific disciplines may contain more or less p-hacked results than others. The evidence for widespread p-hacking, reassuringly, appears to be “ambiguous at best,” and its effect on the reproducibility crisis is “unlikely to be massive,” contrary to what was initially feared. Still, it’s a problem that requires a solution.

Teaching old dogs old and new tricks

You may believe that the p-hacking of my dog owner happiness data was deliberate and self-serving, but it doesn’t have to be. Too many scientists are not taught the proper way to use statistics and don’t realize they are torturing their data to extract a positive result out of it. An infamous example is Brian Wansink, once heralded as a food psychology hero, who blogged about his student p-hacking their data. He was actually proud of it. But repeatedly putting your data through the wringer simply squeezes illusions out of it.

There are solutions to the p-hacking problem. Some have argued that we need to get rid of p-values altogether, banishing temptation as it were. Others have pushed back, calling this “scapegoating” and pointing out that alternatives to p-values have their own problems. Another possibility: stop using 0.05 as a threshold. In fact, stop using any threshold and discourage scientists from calling their results either significant or nonsignificant. But given that bashing p-values in scientific circles has been going on for a hundred years and that we prominently report these values now more than ever, this kind of change is unlikely to come about. We just love our p-values too much.

The key, as it often is, is better education and better regulation. Researchers often do not appreciate that p-hacking should be seen as a form of misconduct. We should teach them. As for regulation, we can metaphorically tie a scientist’s hands behind their back by encouraging them to publicly tell the world how they will analyze their data once they have it, in a way that is specific, precise, and exhaustive. This is called preregistering your study and is becoming somewhat more common. And if you are tempted to turn your new data set up, down, and sideways, looking for anything interesting, being honest about the preliminary nature of anything you find is the right thing to do. Exploratory science (or “fishing expeditions”) is fine, so long as it is honestly labelled as such.

Scientists are not robots. They do not perform experiments and analyze their data in a completely detached way. When the road forks and a choice must be made, the incentives of their institutions, grant organizations, and scientific journals—all looking for novelty and surprise—can subconsciously push them into exploiting their degrees of freedom in their chase for statistical significance.

As Mark Twain once wrote, “Facts are stubborn things, but statistics are pliable.”

P.S.: If you really want to know how a p-value is defined—what it is and what it isn’t—click here. Warning: it will do a number on your brain.

Take-home message:
- Scientists doing research must make choices on how they gather and analyze their data, and some of these choices increase the odds of getting a false-positive result
- They can select to only publish the analyses that gave them a positive result and hide those that were negative, a process known as p-hacking
- Potential solutions to this problem include moving away from declaring results either “significant” or “nonsignificant,” making scientists aware that p-hacking is bad practice, and encouraging them to declare at the beginning of a research project exactly how they will analyze their data and ensuring that they stick to their plan


Back to top