ࡱ > 4 bjbj ;8 r r , . , , , , , , , $ 0 2 ` , , - * r# ^ , F! n! G~ ^! ^# - 0 . f! -3 -3 n! n! -3 ! , , . -3 : Title: Clinical and practical importance versus statistical significance: limitations of conventional statistical inference.
Submission Type: Research methodology
Authors: Michael Wilkinson
Affiliation: Faculty of Health and Life Sciences
Northumbria University
Correspondence address: Dr Michael Wilkinson
Department of Sport, Exercise and Rehabilitation
Northumbria University
Northumberland Building
Newcastle-upon-Tyne
NE1 8ST
ENGLAND
Email: mic.wilkinson@northumbria.ac.uk
Phone: 44(0)191-243-7097
Abstract word count: 193
Text only word count: 4727
Abstract
Decisions about support for therapies in light of data are made using statistical inference. The dominant approach is null-hypothesis-significance-testing. Applied correctly it provides a procedure for making dichotomous decisions about zero-effect null hypotheses with known and controlled error rates. Type I and type II error rates must be specified in advance and the latter controlled by a priori sample size calculation. This approach does not provide the probability of hypotheses or the strength of support for hypotheses in light of data. Outcomes allow conclusions only about the existence of non-zero effects, and provide no information about the likely size of true effects or their practical / clinical value. Magnitude-based inference, allows scientists to estimate the true / large sample magnitude of effects with a specified likelihood, and how likely they are to exceed an effect magnitude of practical / clinical importance. Magnitude-based inference integrates elements of subjective judgement central to clinical practice into formal analysis of data. This allows enlightened interpretation of data and avoids rejection of possibly highly-beneficial therapies that might be not significant. This approach is gaining acceptance, but progress will be hastened if the shortcomings of null-hypothesis-significance testing are understood.
Introduction
The scientific method is characterised by the formulation of theories and the evaluation of specific predictions derived from those theories against experimental data. Decisions about whether predictions and their parent theories are supported or not by data are made using statistical inference. Thus the examination of theories and the evaluation of therapies in light of data and progression of knowledge hinge directly upon how well the inferential procedures are used and understood. The dominant approach to statistical inference is null-hypothesis-significance testing (NHST). NHST has a particular underpinning logic that requires strict application if its use is to be of any value at all. Even when this strict application is followed, it has been argued that the underpinning yes or no decision logic and the value of the sizeless outcomes produced from NHST are at best questionable and at worst can hinder scientific progress ADDIN EN.CITE ADDIN EN.CITE.DATA (Ziliak and McCloskey, 2008, Batterham and Hopkins, 2006, Krantz, 1999, Sterne and Smith, 2001). The failure to understand and apply methods of statistical inference correctly can lead to mistakes in the interpretation of results and subsequently to bad research decisions. Misunderstandings have a practical impact on how research is interpreted and what future research is conducted, so impacts not only researchers but any consumer of research. This paper will clarify NHST logic, highlight limitations of this approach and suggest an alternative approach to statistical inference that can provide more useful answers to research questions while simultaneously being more rational and intuitive.
Scientific inference
With a clear picture of scientific inference, it is easier to understand the fit of different statistical approaches to what we wish the scientific method to achieve. Science is a way of working involving formulation of theories or guesses about how the world works, calculation of the specific consequences of those guesses (i.e. hypotheses about what should be observed if the theory is correct), and the comparison of actual observations to those predictions ADDIN EN.CITE Chalmers1999222(Chalmers, 1999)2222226Chalmers, A.FWhat is this thing called science?3rd1999BuckinghamOpen University Press(Chalmers, 1999). This description of the scientific method is not contentious. How observations are used to establish the truth of theories is more contentious. The nature of philosophy is such that there will never be wholesale agreement, but a philosophy of science proposed by Sir Karl Popper is generally considered an ideal to strive towards. Popper wrote that theories must make specific predictions and importantly, those predictions should be potentially falsifiable through experiment ADDIN EN.CITE Popper1972b191(Popper, 1972b)1911916Popper, K.RConjectures and Refutations:The Growth of Scientific Knowledge4th1972bLondonRoutledge and Kegan Paul Ltd(Popper, 1972b). It was Poppers falsifiability criteria that differentiated his philosophy from the consensus approach of truth by verification that predominated previously, while simultaneously overcoming the problem of inductive reasoning highlighted by Scottish philosopher David Hume ADDIN EN.CITE Hume1963223(Hume, 1963)2232236Hume, DAn enquiry concerning human understanding1963OxfordOxford University Press(Hume, 1963). In short, Popper showed that it was impossible to prove a theory no matter how many observations verified it, but that a single contrary observation could disprove or falsify a theory. This thesis is often explained using the white swan example. Imagine a hypothesis that all swans are white. No amount of observations of white swans could prove the hypothesis true as this assumes all other swans yet to be observed will also be white and uses inductive reasoning. A single observation of a black (or other non-white) swan could however, by deductive reasoning, disprove the hypothesis ADDIN EN.CITE Ladyman2008224(Ladyman, 2008)2242246Ladyman, JUnderstanding philosophy of science2008OxfordRoutledge(Ladyman, 2008). In Poppers philosophy, scientists should derive specific hypotheses from general theories and design experiments to attempt to falsify those hypotheses. If a hypothesis withstands attempted falsification, it and the parent theory are not proven, but have survived to face further falsification attempts. Theories that generate more falsifiable predictions and more specific predictions are to be preferred to theories whose falsifiable predictions are fewer in number and vague. This latter point is particularly important in relation to NHST and will be expanded upon later.
Truth, variability and probability
Critics of Popper argue that, in reality, scientists would never reject a theory on the basis of a single falsifying observation and that there is no absolute truth that more successful theories move towards ADDIN EN.CITE Kuhn1996188(Kuhn, 1996)1881886Kuhn, T.SThe Stucture of Scientific Revolutions3rd1996ChicagoThe University of Chicago Press(Kuhn, 1996). Popper agreed and acknowledged that it would be an accumulation of falsifying evidence that on balance of probability would lead to the conclusion that a theory had been disproven ADDIN EN.CITE Popper1972b191(Popper, 1972b)1911916Popper, K.RConjectures and Refutations:The Growth of Scientific Knowledge4th1972bLondonRoutledge and Kegan Paul Ltd(Popper, 1972b). Herein lie two important links between statistical and scientific inference, namely that probability must be the basis for conclusions about theories because of variability in the results of different experiments on the same theory. Uncertainty is inescapable, but statistics can allow quantification of uncertainty in the light of variability. British polymath Sir Ronald Fisher first suggested a method of using probability to assess strength of evidence in relation to hypotheses ADDIN EN.CITE Fisher1950194(Fisher, 1950, Fisher, 1973)1941946Fisher, RAStatistical methods for research workers.1950LondonOliver and BoydFisher19731951951956Fisher, RAStatisitcal methods and scientific inference1973LondonCollins Macmillan(Fisher, 1950, Fisher, 1973). Fishers contributions to statistics include the introduction of terms such as null hypothesis (denoted as H0), significance and the concept of degrees of freedom, random allocation to experimental conditions and the distinction between populations and samples ADDIN EN.CITE Fisher1950194(Fisher, 1950, Fisher, 1973)1941946Fisher, RAStatistical methods for research workers.1950LondonOliver and BoydFisher19731951951956Fisher, RAStatisitcal methods and scientific inference1973LondonCollins Macmillan(Fisher, 1950, Fisher, 1973). He also developed techniques including analysis of variance amongst others. He is perhaps better known for suggesting a p (probability) of 0.05 as an arbitrary threshold for decisions about H0 that has now achieved unjustified, sacrosanct status ADDIN EN.CITE Fisher1973195(Fisher, 1973)1951956Fisher, RAStatisitcal methods and scientific inference1973LondonCollins Macmillan(Fisher, 1973).
Fishers null
Fishers definition of the null hypothesis was very different from what we currently understand it to mean and is possibly the root cause of philosophical and practical problems with NHST that will be discussed in this paper. In Fishers work, the null was simply the hypothesis we attempt to nullify or in other words falsify ADDIN EN.CITE Fisher1973195(Fisher, 1973)1951956Fisher, RAStatisitcal methods and scientific inference1973LondonCollins Macmillan(Fisher, 1973). With this understanding, he was actually referring to what we now call the experimental hypothesis (denoted as H1) and his procedures were well aligned with Poppers falsification approach. The conventional zero-point null hypothesis and the procedures for establishing a decision-making procedure about H0 (i.e. retain or reject) that predominate today were created by Polish mathematician Jerzy Neyman and British statistician Egon Pearson ADDIN EN.CITE Neyman1933196(Neyman and Pearson, 1933)19619617Neyman, JPearson, E.SOn the problem of the most efficient tests of statistical hypothesesPhilosophical Transactions of the Royal Society of London, Series A289-3372311933(Neyman and Pearson, 1933). Despite the p < 0.05 being attributed to Fisher as a threshold for making a decision about (his version of) H0, he was opposed to the idea of using threshold probabilities and argued vigorously in the literature with Neyman and Pearson about this ADDIN EN.CITE Ziliak2008209(Ziliak and McCloskey, 2008)2092096Ziliak, S.TMcCloskey, D.NThe Cult of Statistical Significance: how the standard error costs us jobs, justice, and lives2008MichiganUniversity of Michigan Press(Ziliak and McCloskey, 2008). Instead, Fisher argued that probability could be used as a continuous measure of strength of evidence against the null hypothesis ADDIN EN.CITE Fisher1973195(Fisher, 1973)1951956Fisher, RAStatisitcal methods and scientific inference1973LondonCollins Macmillan(Fisher, 1973), a point that, despite his genius, he was gravely mistaken about.
Defining probability
Generally speaking, there are two interpretations of probability in statistics. The first is subjective and the second objective. Subjective probability is the most intuitive and describes a personal degree of belief that an event will occur. It also forms the basis of the Bayesian method of inference. In contrast, the objective interpretation of probability is that probabilities are not personal but exist independent of our beliefs. The NHST approach and Fishers ideas are based on an objective interpretation of probability proposed by Richard von Mises ADDIN EN.CITE von Mises1928197(von Mises, 1928)1971976von Mises, RProbability, Statistics and Truth2nd1928LondonAllen and Unwin(von Mises, 1928). This interpretation is best illustrated using a coin-toss example. In a fair coin, the probability of heads is 0.5 and reflects the proportion of times we expect the coin to land on heads. However, it cannot be the proportion of times it lands on heads in any finite number of tosses (e.g. if in 10 tosses we see 7 heads, the probability of heads is not 0.7). Instead, the probability refers to an infinite number of hypothetical coin tosses referred to as a collective or in more common terms a population of scores of which the real data are assumed to be a sample. The population must be clearly defined. In this example, it could be all hypothetical sets of 10 tosses of a fair coin using a precise method under standard conditions. Clearly, 7 heads from 10 tosses is perfectly possible even with a fair coin, but the more times we toss the coin, the more we would expect the proportion of heads to approach 0.5. The important point is that the probability applies to the hypothetical-infinite collective and not to a single toss or even a finite number of tosses. It follows that objective probabilities also do not apply to hypotheses as a hypothesis in the NHST approach is simply retained or rejected in the same way that a single event either happens or does not, and has no associated population to which an objective probability can be assigned. Most scientists believe a p value from a significance test reveals something about the probability of the hypothesis being tested (generally the null). Actually a p value in NHST says nothing about the likelihood of H0 or H1 or the strength of evidence for or against either one. It is the probability of data as extreme or more extreme than that collected occurring in a hypothetical-infinite series of repeats of an experiment if H0 were true ADDIN EN.CITE Oakes1986193(Oakes, 1986)1931936Oakes, MStatistical inference: A commentary for the social and behavioural sciences1986New JerseyWiley(Oakes, 1986). In other words, the truth of H0 is assumed and is fixed, p refers to all data from a hypothetical distribution probable under or consistent with H0. It is the conditional probability of the observed data assuming the null hypothesis is true, written as p(D|H).
Null-Hypothesis-Significance Testing logic
Based on the objective interpretation of probability, the NHST approach was designed to provide a dichotomous decision-making procedure with known and controlled long-run error rates. Neyman and Pearson were clear about this and in the introduction of their classic paper to the Royal Society stated as far as a particular hypothesis is concerned, no test based on the (objective) theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis ADDIN EN.CITE Neyman1933196p 291(Neyman and Pearson, 1933)19619617Neyman, JPearson, E.SOn the problem of the most efficient tests of statistical hypothesesPhilosophical Transactions of the Royal Society of London, Series A289-3372311933(Neyman and Pearson, 1933). Instead, they set about defining rules to govern decisions about retaining or rejecting hypotheses such that wrong decisions would not often be made, but the probability of making them in the long run would be known.
The starting point of the N-P approach is a pair of contrasting hypotheses (H 0 a n d H 1 ) . F o r e x a m p l e , H 0 c o u l d b e t h a t I ( p o p u l a t i o n m e a n a n k l e d o r s i - f l e x i o n a n g l e g i v e n t h e r a p y 5e) = C ( p o p u l a t i o n m e a n a n k l e d o r s i - f l e x i o n a n g l e g i v e n n o t h e r a p y i . e . c o n t r o l g r o u p ) , o r t o p u t i t a n o t h e r w a y , t h e d i f f e r e n c e b e t w e e n I a n d C i s z e r o . T h e a l t e r n a t i v e ( H 1 ) i s t h e n g e n e r a l l y o f t h e f o r m I `" C i . e . t h e p o p u l a t i o n m e a n o f t h e t h e r a p y a n d c o n t r o l g r o u p s w i l l n o t b e e q u a l / w i l l d i f f e r . H e r e w e h a v e t h e f i r s t p h i l o s o p h i c a l i s s u e w i t h t h e c o n v e n t i o n a l u s e o f N H S T . U n d e r t h e p h i l o s o p h y of Popper, a hypothesis should be a specific prediction such that it is highly falsifiable. Popper argued a theory that allows everything explains nothing ADDIN EN.CITE Popper1972a190(Popper, 1972a)1901906Popper, K.RThe Logic of Scientific Discovery6th1972aLondonHutchinson & Co Ltd(Popper, 1972a) i.e. falsifying a null of no difference simply allows for any magnitude of difference in any direction hardly a severe test of a theory! Furthermore, the hypothesis under consideration (i.e. a zero-effect null) is not actually the hypothesis of interest, but is simply a straw man that the researcher does not believe or they would not be performing the experiment. Not surprisingly, Popper was not a supporter of NHST ADDIN EN.CITE Dienes2008201(Dienes, 2008)2012016Dienes, ZUnderstanding Psychology as a Science: an introduction to scientific and statistical inference2008BasingstokePalgrave Macmillan(Dienes, 2008). A practical issue is also raised here. Ignoring the philosophical problem, if a null of no difference is rejected, a question that remains is how big is the effect? It is generally the size of effect of a therapy versus a control condition / group that is of real interest, not simply that the effect is different from zero in some unspecified amount and direction.
The illogic of NHST
Because H0 and H1 are mutually exclusive, if H0 is rejected, by deduction H1 is assumed true and vice versa, if H0 is not rejected, H1 is assumed false. However, statistical inference and indeed science does not deal in absolute proofs, truths or falsehoods, there is always uncertainty. If this uncertainty is extended to this example, we have: If H0 then probably NOT H1, data arise consistent with H1, therefore H0 is probably false. This logic has been challenged. Pollard and Richardson ADDIN EN.CITE Pollard1987199(1987)19919917Pollard, PRichardson, J T EOn the probability of making type I errors.Psychological Bulletin159-16310211987(1987) highlight a flaw using the following example: if a person is American, they a r e p r o b a b l y n o t a m e m b e r o f C o n g r e s s ; p e r s o n 5e i s a m e m b e r o f C o n g r e s s t h e r e f o r e p e r s o n 5e i s p r o b a b l y n o t A m e r i c a n . F u r t h e r m o r e , O a k e s A D D I N E N . C I T E <