Stuart H. Hurlbert, Celia M. Lombardi
Annales Zoologici Fennici 46 (5), 311-349, (1 October 2009) https://doi.org/10.5735/086.046.0501
This essay grew out of an examination of one-tailed significance testing. One-tailed tests were little advocated by the founders of modern statistics but are widely used and recommended nowadays in the biological, behavioral and social sciences. The high frequency of their use in ecology and animal behavior and their logical indefensibility have been documented in a companion review paper. In the present one, we trace the roots of this problem and counter some attacks on significance testing in general. Roots include: the early but irrational dichotomization of the P scale and adoption of the ‘significant/non-significant’ terminology; the mistaken notion that a high P value is evidence favoring the null hypothesis over the alternative hypothesis; and confusion over the distinction between statistical and research hypotheses. Resultant widespread misuse and misinterpretation of significance tests have also led to other problems, such as unjustifiable demands that reporting of P values be disallowed or greatly reduced and that reporting of confidence intervals and standardized effect sizes be required in their place. Our analysis of these matters thus leads us to a recommendation that for standard types of significance assessment the paleoFisherian and Neyman-Pearsonian paradigms be replaced by a neoFisherian one. The essence of the latter is that a critical α (probability of type I error) is not specified, the terms ‘significant’ and ‘non-significant’ are abandoned, that high P values lead only to suspended judgments, and that the so-called “three-valued logic” of Cox, Kaiser, Tukey, Tryon and Harris is adopted explicitly. Confidence intervals and bands, power analyses, and severity curves remain useful adjuncts in particular situations. Analyses conducted under this paradigm we term neoFisherian significance assessments (NFSA). Their role is assessment of the existence, sign and magnitude of statistical effects. The common label of null hypothesis significance tests (NHST) is retained for paleoFisherian and Neyman-Pearsonian approaches and their hybrids. The original Neyman-Pearson framework has no utility outside quality control type applications. Some advocates of Bayesian, likelihood and information-theoretic approaches to model selection have argued that P values and NFSAs are of little or no value, but those arguments do not withstand critical review. Champions of Bayesian methods in particular continue to overstate their value and relevance.