Revenge of the Statisticians: Where Research Goes Wrong
Scientists have recently taken the first-ever photograph of a black hole, sequenced the DNA of a wooly mammoth and discovered that although the earth may not be flat, the deep-space rock Arrokoth sort of is. Another breakthrough has come from the field of mathematics. In 2019, The American Statistician published a special issue in which forty-three prominent scholars debunked a procedure known as the p<0.05-test. (1) This may change our understanding of our world. It certainly should.
If you are unsure what a p<0.05 test is, you are in good company. As Beatrice Grabowski discussed in Journal of the National Cancer Institute, many scholars who rely on this test in their research seem unclear about its nature. (2) Those seeking a precise definition of p<0.05 may consult Grabowski’s article, which is available on-line. A rougher explanation is that the test assesses the probability that, under the assumptions of a specified statistical model, a certain outcome could have occurred by chance.
When research produces outcomes that are unlikely to have occurred randomly, it appears persuasive. Therefore, scholars refer to the odds that results are more than chance as “statistical significance.” Academics in many fields — notably psychology, the so-called social sciences and medicine — have behaved as if studies which generate statistically significant findings self-evidently strengthen the researchers’ hypotheses. The flip side of this is that academics have behaved as if studies which did not achieve statistical significance — or perhaps did not even try — were failures.
Since the scientific method revolves around hypothesis testing, this has made p<0.05 supremely important. Much of what scholars in statistically-based fields believe they know is based on it. Unfortunately, an embarrassing amount of this putative knowledge is turning out to be wrong. Another essential tenet of the scientific method is that when scholars perform research, others working independently should be able to do the same things and get the same results. As scholars have double-checked each other’s findings, they have found that perhaps half the studies which produced impeccable p<0.05 values for the original researchers do not work the second time around. (3)
This is not a case of new discoveries making older assumptions obsolete. Statisticians have not learned anything about the p<0.05 test which they did not already know. What is more, the test itself performs as advertised. Few, if any, question its mathematical validity.
The problem is not that testing for statistical significance is wrong. It is that the way in which scholars have been using the test was wrong-headed. A correlation between two variables is only as important as the variables themselves. By focusing excessively on p<0.05 and brushing aside larger questions about what the numbers represent, researchers have produced “significant” measurements of meaningless things.
No one should be surprised. Graboski documents scholarly arguments and court cases going back decades which warned about this issue. I participated in the debate in the 1990s and 2000s. Those interested in my thoughts on related matters may consult the article my colleague Dr. and US Army Captain Kevin Falk and I published in the journal of the U.S. Army War College. (4) If I may be permitted a comment in the aftermath of the American Statistician special issue, I would say that I hope researchers will reconsider their methods in light of the statisticians’ points, but that it will be even more important for them to reconsider the aspects of academic culture which allowed a plainly dubious approach to reign at the expense of all others for so long.
1. The American Statistician, Volume 73, Issue sup1 (2019), available on-line, accessed January 9, 2022.
2. Beatrice Grabowski, “P<0.05 Might Not Mean What You Think: American Statistical Association Clarifies P Values,” Journal of the National Cancer Institute, (August 10, 2016), available on-line, accessed January 9, 2022
3. See, for instance, Marc N. Branch, “The ‘Reproducibility Crisis:’ Might the Methods Used Frequently in Behavior-Analysis Research Help?,” Perspectives on Behavior Science, (June 4, 2018), available on-line, accessed January 9, 2022.
Kevin S. Falk and Thomas M. Kane, “The Maginot Mentality in International Relations Models,” Parameters, Vol. 28, No, 2 (Summer 1998), pp. 80–92, available on-line, accessed January 10, 2022.