Statistical power calculations reflect our love affair with P-values and hypothesis testing: time for a fundamental change.

Spinal Cord (2014) 52, 2 & 2014 International Spinal Cord Society All rights reserved 1362-4393/14 www.nature.com/sc

EDITORIAL

Statistical power calculations reflect our love affair with P-values and hypothesis testing: time for a fundamental change Spinal Cord (2014) 52, 2; doi:10.1038/sc.2013.117 Statistical power calculations are still considered essential when planning trials in spinal cord injury (SCI) and some may think it heresy to suggest otherwise.1 Power calculations are reported in grant applications, ethic submissions and journal publications. The power of a trial refers to the probability (usually set at 80%) that the trial will reach statistical significance if there truly is the specified difference between two groups.2 Power calculations rely on estimating variability and nominating an a-level (usually, 0.05). As variability increases, then power decreases. Power is only a relevant concept during the planning stages of a trial. Once a trial is completed then power is meaningless.2,3 To determine whether a completed trial has a sufficient sample size requires examination of the precision of treatment estimates (for example, the 95% confidence intervals of the mean between-group differences). Power calculations are problematic for one fundamental reason; they perpetuate an out-dated focus on the hypothesis-testing approach to statistical analysis.2,4 The hypothesis-testing approach dichotomises results as significant or not significant making interpretation misleadingly simplistic. Clinical trialists and ‘statisticians have for a long time argued that this [approach] is inappropriate and that trials should be used to estimate treatment effects and not to test hypotheses’ (p. 806).5 This might go against everything some have been taught. However, the push away from P-values and hypothesis testing is not new, and journals are increasingly trying to encourage researchers and readers to focus on estimates of treatment effects6 (the uncertainty of those estimates can be quantified with confidence intervals). One leading journals tried abolishing P-values altogether, and a number of others insist that estimates of treatment effects be reported in an effort to stop researchers and readers simply dichotomising results as significant or not.6 An alternative to statistical power calculations is called ‘precision for planning’ (p. 372).3 This involves calculating the sample size required to ensure a pre-set level of precision in estimates. Precise estimates require larger sample sizes than imprecise estimates. This

approach still requires predicting variability that can be problematic but, importantly, it does not focus on P-values. Before we see a shift to ‘precision for planning’ there will need to be a fundamental change in attitudes and understanding. Trials will need to be valued for their ability to provide estimates of treatment effects rather than their ability to provide statistically significant findings. This shift in attitudes and understanding is not only important for how we determine sample size requirements for trials but also for how we conduct and interpret clinical trials, systematic reviews and clinical practice guidelines. We need this shift to advance evidence-based care for people with SCI. CONFLICT OF INTEREST The author declares no conflict of interest. ACKNOWLEDGEMENTS Thanks to Geoff Cumming, Joanne Glinsky and Ian Cameron for their valuable feedback.

LA Harvey Sydney Medical School/Northern, University of Sydney, Sydney, New South Wales, Australia E-mail: [email protected]

1 Schulz KF, Grimes DA. Sample size calculations in randomised trials: mandatory and mystical. Lancet 2005; 365: 1348–1353. 2 Goodman SN, Berlin JA. The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results. Ann Intern Med 1994; 121: 200–206. 3 Cumming G. Understanding the New Statistics: Effect Sizes, Confidence Intervals and Meta-Analysis. Routledge: London, UK, 2012. 4 Schmidt F, Hunter J. Eight common but false objections to the discontinuation of significance testing in the analysis of research data. In: Harlow L, Mulaik S and Steiger J (eds) What If There Were No Significance Tests?. Erlbaum: Mahwah, NJ, 1997. 5 Edwards SJL, Lilford RJ, Braunholtz D, Jackson J. Why ‘underpowered’ trials are not necessarily unethical. Lancet 1997; 350: 804–807. 6 Fidler F, Thomason N, Cumming G, Finch S, Leeman J. Editors can lead researchers to confidence intervals, but can’t make them think: statistical reform lessons from medicine. Psychol Sci 2004; 15: 119–126.

Statistical significance and statistical power in hypothesis testing.

Editorial: Our love affair with technology and the choices we make.

Psychotherapy and the iatrogenic "love affair".

Paracoccidioides brasiliensis AND Paracoccidioides lutzii, A SECRET LOVE AFFAIR.

Statistical power and significance testing in large-scale genetic studies.

Chemical structure informing statistical hypothesis testing in metabolomics.

Introduction to biostatistics: Part 5, Statistical inference techniques for hypothesis testing with nonparametric data.

COPD treatment: time to change our algorithm?

Empirical Statistical Power for Testing Multilocus Genotypic Effects under Unbalanced Designs Using a Gibbs Sampler.

Hypothesis testing for an extended cox model with time-varying coefficients.

The potential for increased power from combining P-values testing the same hypothesis.

Suctioning neonates at birth: time to change our approach.

Statistical identifiability and sample size calculations for serial seroepidemiology.

'Whether … or not?' Our on-and-off affair with pubic hair.

Metastability of Neuronal Dynamics during General Anesthesia: Time for a Change in Our Assumptions?

Testing the low-density hypothesis for reversed sex change in polygynous fish: experiments in Labroides dimidiatus.

Testing a Preliminary Live with Love Conceptual Framework for cancer couple dyads: A mixed-methods study.

Some statistical characteristics of voice fundamental frequency.

Uthando Lwethu ('our love'): a protocol for a couples-based intervention to increase testing for HIV: a randomized controlled trial in rural KwaZulu-Natal, South Africa.

Diabetes and appendicectomy: testing a hypothesis.

Sample Size and Power Calculations for Additive Interactions.

Power of statistical tests.

Power and sample size calculations. A review and computer program.

Patch testing for vulval symptoms: our experience with 282 patients.