Scott Zuccarino

The same forces that lead to publication bias in academic journals also lead to ship decision bias in product development organizations.

Publication bias

Publication bias causes academic journals to overrepresent fluke findings. In some fields, this has caused replication crises, where published results don’t hold in replicated experiments. To get a sense of how serious of a problem this is [source]:

In a 2015 study, 100 experimental psychology results were re-tested, and 60% of them failed to replicate.
In a 2016 study, 18 experimental economics results were re-tested from two of the field’s top-journals, and 39% (i.e., 7 of them) failed to replicate.

Those rates of replication failure are uncomfortably high.

How does publication bias work again?

Statistical noise

The central problem is that statistical tests are typically run with a 5% false positive threshold (i.e., p-value < 0.05).

Imagine that you’re working in an extreme situation where every single experiment executed in your field is actually a control versus control test under the hood, a 5% significance threshold is standard, and journals only publishes significant findings. In this situation, every single journal publication would represent a fluke result.

The reality, of course, is not as harsh as this hypothetical. But because a disproportionate fraction of hypotheses tested are false, the final pool of ‘significant’ hypotheses has a disproportionate fraction of false positive results. And a 5% confidence threshold is evidently only stringent enough in fields like economics and psychology to yield 40%-60% replicability rates.

Incentives

Academics like to find significance because it helps them publish. Tech employees like to find significance because proof of impact helps them get promoted. Obviously, nobody (or, almost nobody) is purposefully claiming that falsehoods are true, but incentives can have subtle impacts on people’s behavior that worsen the problem.

What can we do?

In product development organizations, tens, hundreds, or even thousands of A/B tests might be running in production at any given time. By deciding which experiment to ship and which to hold back, builders are effectively reproducing academia’s publication process. To combat reproducibility issues, you can:

Check whether your secondary metrics tell a consistent story.
Test at higher than 95% confidence, if you have the sample volume for it.

Most importantly, do the obvious thing and build features that are backed by user research and sound design thinking. This way, even if a test misfires, you may still be choosing to ship a bona fide product improvement.