P-hacking is prevalent in reality but absent from classical hypothesis testing theory. As a consequence, significant results are much more common than they are supposed to be when the null hypothesis is in fact true. In this paper, we build a model of hypothesis testing with p-hacking. From the model, we construct critical values such that, if the values are used to determine significance, and if scientists’ p-hacking behavior adjusts to the new significance standards, significant results occur with the desired frequency. Such robust critical values allow for p-hacking so they are larger than classical critical values. To illustrate the amount of correction required by p-hacking, we calibrate the model using evidence from the medical sciences. In the calibrated model the robust critical value for any test statistic is the classical critical value for the same test statistic with one fifth of the significance level.
Scientific journals prefer publishing significant results. Publications, in turn, determine a scientist’s career path: promotions, salary, and honors. Scientists therefore have strong incentives to hunt for statistical significance. Such p-hacking reduces the informativeness of hypothesis tests, threatening the credibility of science—leading for instance to the current replication crisis. In this paper, we develop a model of hypothesis testing with p-hacking and use it to construct critical values robust to p-hacking, which guarantee that significant results occur with the desired frequency. As an illustration, we calibrate the model to the medical sciences. For a common two-sided z-test with significance level of 5%, the robust critical value is 2.58—somewhat higher than the classical critical value of 1.96.
Figure 3: Critical values robust to p-hacking for z-tests
McCloskey, Adam, and Pascal Michaillat. 2023. “Critical Values Robust to P-hacking.” arXiv:2005.04141. https://doi.org/10.48550/arXiv.2005.04141 .