P-hacking occurs when researchers engage in various behaviors that increase their chances of reporting statistically significant results. P-hacking is problematic because it reduces the informativeness of hypothesis tests—by making significant results much more common than they are supposed to be in the absence of true significance. Despite its prevalence, p-hacking is not taken into account in hypothesis testing theory: the critical values used to determine significance assume no p-hacking. To address this problem, we build a model of p-hacking and use it to construct critical values such that, if these values are used to determine significance, and if researchers adjust their p-hacking behavior to these new significance standards, then significant results occur with the desired frequency. Because such robust critical values allow for p-hacking, they are larger than classical critical values. As an illustration, we calibrate the model with evidence from the social and medical sciences. We find that the robust critical value for any test is the classical critical value for the same test with one fifth of the significance level—a form of Bonferroni correction. For instance, for a $z$-test with a significance level of 5%, the robust critical value is 2.31 instead of 1.65 if the test is one-sided and 2.57 instead of 1.96 if the test is two-sided.

Figure 3: Critical values robust to p-hacking for one-sided and two-sided $z$-tests


McCloskey, Adam, and Pascal Michaillat. 2022. “Critical Values Robust to P-Hacking.” arXiv:2005.04141. .