P Values
Last updated: January 18, 2021
Consensus in the literature that p values are bad
- Amrhein, Valentin, Sander Greenland, and Blake McShane. “Scientists Rise up against Statistical Significance.” Nature 567, no. 7748 (March 2019): 305–7. https://doi.org/10.1038/d41586-019-00857-9.
- Hubbard, Raymond, and R. Murray Lindsay. “Why P Values Are Not a Useful Measure of Evidence in Statistical Significance Testing.” Theory & Psychology 18, no. 1 (February 2008): 69–88. https://doi.org/10.1177/0959354307086923.
- Wasserstein, Ronald L., Allen L. Schirm, and Nicole A. Lazar. “Moving to a World Beyond ‘ p < 0.05.’” The American Statistician 73, no. sup1 (March 29, 2019): 1–19. https://doi.org/10.1080/00031305.2019.1583913.
Examples of misuse of p values
David Spiegelhalter on Twitter:
This paper motivates the call for the end of significance. A 25% mortality reduction, but because P=0.06 (two-sided), they declare it ‘did not reduce’ mortality. Appalling.
Response to journals requiring p values
That brings me back to Dan Scharfstein’s query on what to do about journals and coauthors obsessed with significance testing. What I’ve been doing and teaching for 40 years now is reporting the CI and precise P-value and never using the word “significant” in scientific reports. When I get a paper to edit I delete all occurrences of “significant” and replace all occurrences of inequalities like “(P<0.05)” with “(P=p)” where p is whatever the P-value is (e.g., 0.03), unless p is so small that it’s beyond the numeric precision of the approximation used to get it (which means we may end up with “P<0.0001”). And of course I include or request interval estimates for the measures under study.
Only once in 40 years and about 200 reports have I had to remove my name from a paper because the authors or editors would not go along with this type of editing. And in all those battles I did not even have the 2016 ASA Statement and its Supplement 1 to back me up! Although I did supply recalcitrant coauthors and editors copies of articles advising display and focus on precise P-values. One strategy I’ve since come up with to deal with those hooked on “the crack pipe of significance testing” (as Poole once put it) is to add alongside every p value for the null a p value for a relevant alternative, so that for example their “estimated OR=1.7 (p=0.06, indicating no significant effect)” would become “estimated OR=1.7 (p=0.06 for OR=1, p=0.20 for OR=2, indicating inconclusive results).” So far every time they cave to showing just the CI in parens instead, with no “significance” comment.