CORRESPONDENCE

This third type of interval allows an empirical assessment of power based on a P value alone. These three types of intervals can be used for a separate analysis of the follow-up or replication sample, and also for a ‘meta-analysis’ combining both samples. Intervals for P To the Editor: We congratulate Halsey et al. for their important dis- values from both one- and two-sided tests are included. Note that cussion of the random nature of P values in an earlier issue of this these intervals do not correct for selection bias or multiple testing. journal1. P values from identical experiments can differ greatly in It is striking that all the proposed intervals can be calculated a way that is surprising to many2–6. The failure to appreciate this from the original P value alone, without other data. Importantly, wide variability can lead researchers to expect, without adequate they do not depend on the sample size of the study. For example, justification, that statistically significant findings will be replicated, suppose a published study reports a two-sided test with P = 0.049. only to be disappointed later. Thus, we agree with the authors that Using the calculator, the 95% confidence interval for p is found to be “Data … interpretation must incorporate the uncertainty embedded [8.5-5–0.99] and barely excludes 1, which is the null value of p for a two-sided test. This interval and the variability it reflects are exactly in a P value.” Herein, by describing novel statistical intervals to assess the same whether the sample size is 100 or 100,000. The prediction P-value uncertainty, we provide the means to easily do just that. The Excel calculator (Supplementary Software) presented here and power intervals require specification of the relative size of the uses the observed P value from an initial study to provide informa- two studies, but not the absolute sample sizes. Suppose a replication study of the published finding above was planned using the same tion about the uncertainty of the initial P value itself, as well as about the future P value and power of a follow-up or replication study that sample size. The 95% prediction interval for P, which will cover the actual follow-up P value 95% of the time, is [2.1-6–1], demonis identical except for sample size. We have previously published prediction intervals, based on initial P values, for follow-up or rep- strating how uncertain the outcome of the same-sized replication lication P values5. Prediction intervals are useful for designing stud- study would be. The 90% confidence interval for de facto power is ies, but they are unsuited for some purposes because they combine [6.2%–95.1%] and conveys a similar message (Fig. 1). (A 90% interuncertainty from two different sources, the initial and follow-up val can be interpreted as giving 95% total confidence that the power P values, and vary according to the ratio of the two sample sizes. To is greater than the lower end of the interval.) For an initial finding characterize the uncertainty of the initial P value alone, we introduce of P = 0.049 with any sample size, there can be little confidence in P-value confidence intervals for the “true population P value” or p the power of a same-sized replication study. If, instead, the initial value, which we define as the value of P when parameter estimates finding is P = 0.001, the 95% p interval is [1.5-7–0.18]; the 95% equal their unknown population values. The unobserved p value prediction interval is [1.3-9–0.60]; and the 90% power interval is reflects the statistical evidence available in the population and as [37.7%–99.9%]. Despite the fact that P = 0.001 is usually thought to such contrasts with the P value, which reflects the statistical evidence be a highly significant finding, nonsignificant results in a same-sized present in the sample. Lastly, we introduce confidence intervals for replication study cannot be considered surprising. In fact, an initial the de facto power of a study, which is the ­probability of rejecting the finding of P < 0.00001 is needed to have 95% confidence that a samenull hypothesis given the true, but unknown, population p value. sized replication study will have 80% power; P < 0.0003 is needed if the replication study is twice the size of the N2 equals N1 N2 is twice N1 original study (Fig. 1). Furthermore, trying to replicate an initial test result of P = 0.05 100 100 with the same sample size gives 50% confidence of having 50% power to detect an 80 80 effect in the same direction, but a sample 79 times that size is needed to ensure 95% 60 60 confidence of having 80% power. See the calculator (Supplementary Software) and 40 40 user guide (Supplementary Note) for addi20 20 tional information. The constancy of the proposed intervals 0 0 reflects a remarkable property of P val1 0.1 0.01 0.001 0.0001 0.00001 1 0.1 0.01 0.001 0.0001 0.00001 ues as compared to possible alternatives: Observed P Observed P a P value’s scale and interpretation do not depend on context. As a result, P values can Figure 1 | Confidence limits for power based on observed P. Estimated de facto power (solid blue curve) be compared across different types of studand 90% confidence limits (dashed blue curves) for study 2 with sample size N2 based on an observed ies, providing a concise, universal summary P value from study 1 with sample size N1. Assumes a two-sided test at significance level α = 0.05. of different statistical tests. We believe that Dashed red vertical line is at P = 0.05, and solid gray horizontal lines at 80% and 90% power. Power (%)

Power (%)

npg

© 2016 Nature America, Inc. All rights reserved.

Solutions for quantifying P-value uncertainty and replication power

NATURE METHODS | VOL.13 NO.2 | FEBRUARY 2016 | 107

CORRESPONDENCE no other statistic fills this particular niche. Moreover, the alternatives (such as estimates, confidence intervals, false discovery rates, etc.) are also subject to random variation and, like P values, can behave badly if experiments are poorly designed or implemented. Nonetheless, we do not suggest that researchers rely on P values alone. Parameter estimates and confidence intervals, in particular, can describe data in a more detailed, contextual way. The P value gained its unique prominence because it is simple and interpretable across a variety of settings, despite the fact that it is sometimes misunderstood. P values are variable, but this variability reflects the real uncertainty inherent in statistical results. Thus, we believe P values will continue to have an important role in research, but an explicit understanding of P-value uncertainty can improve their interpretation.

npg

© 2016 Nature America, Inc. All rights reserved.

Note: Any Supplementary Information and Source Data files are available in the online version of the paper (doi:10.1038/nmeth.3741). ACKNOWLEDGMENTS We thank the reviewers for their helpful comments and suggestions. This work was supported by the US National Institutes of Health (grant MH086135 to L.C.L. as part of the Consortium on the Genetics of Schizophrenia (COGS)) and the Cooperative Studies Program of the US Department of Veterans Affairs. COMPETING FINANCIAL INTERESTS The authors declare no competing financial interests.

Laura C Lazzeroni1, Ying Lu2,3 & Ilana Belitskaya-Lévy2

studies can be combined for meta-analysis, enabling researchers to home in on the true effect. Although we do not encourage the use of power analysis, Lazzeroni et al.’s figure supports our own illustration of the variability in P. As both our models7 and Lazzeroni et al.’s models demonstrate, unless the results of an experiment show a very marked pattern in the data, the reported P value will be accompanied by limits so broad as to render P uninterpretable. Put simply, P is untrustworthy unless the statistical power is very high (above 90%), which offsets advantages of P such as its simplicity. As researchers better appreciate the typically artificial nature of the null hypothesis3 and the limited capacity of P to support hypothesis testing, we believe that P will become much less highly valued. COMPETING FINANCIAL INTERESTS The authors declare no competing financial interests.

Lewis G Halsey1, Douglas Curran-Everett2,3, Sarah L Vowler4 & Gordon B Drummond5 1Department of Life Sciences, University of Roehampton, London, UK. 2Division

of Biostatistics and Bioinformatics, National Jewish Health, Denver, Colorado, USA. 3Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Denver, Denver, Colorado, USA. 4Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge, UK. 5University of Edinburgh, Edinburgh, UK. e-mail: [email protected] Open Science Collaboration Science 349, aac4716 (2015). Cumming, G. Perspect. Psychol. Sci. 3, 286–300 (2008). Cohen, J. Am. Psychol. 49, 997–1003 (1994). Johnson, D. J. Wildl. Mgmt. 63, 763–772 (1999). Rosnow, R. & Rosenthal, R. Am. Psychol. 44, 1276–1284 (1989). Nakagawa, S. & Cuthill, I. Biol. Rev. Camb. Philos. Soc. 82, 591–605 (2007). Halsey, L.G., Curran-Everett, D., Vowler, S. & Drummond, G. Nat. Methods 12, 179–185 (2015).

of Medicine, Stanford, California, USA.2 Cooperative Studies Program Palo Alto Coordinating Center, Department of Veterans Affairs, Mountain View, California, USA. 3Department of Health Research and Policy, Stanford University School of Medicine, Stanford, California, USA. Correspondence should be addressed to L.C.L. ([email protected]).

1. 2. 3. 4. 5. 6. 7.

1. Halsey, L.G., Curran-Everett, D., Vowler, S.L. & Drummond, G.B. Nat. Methods 12, 179–185 (2015). 2. Goodman, S.N. Stat. Med. 11, 875–879 (1992). 3. Cumming, G. Perspect. Psychol. Sci. 3, 286–300 (2008). 4. Boos, D.D. & Stefanski, L.A. Am. Stat. 65, 213–221 (2011). 5. Lazzeroni, L.C., Lu, Y. & Belitskaya-Lévy, I. Mol. Psychiatry 19, 1336–1340 (2014). 6. Nuzzo, R. Nature 506, 150–152 (2014).

Estimation statistics should replace significance testing

Halsey et al. reply: We agree with Lazzeroni et al. that researchers often believe P values are infallible1. If the intervals Lazzeroni et al. propose were obligatory with each presentation of P, the unthinking use of the unqualified P value would be undermined2. In theory this would be an excellent outcome. However, in practice, simply providing tools for quantifying the fickleness of P will highlight an endemic problem without offering any treatment. Whereas Lazzeroni et al. suggest providing information to support P, we have suggested using measures that supersede P for interpreting data3,4. Effect sizes can be standardized, are not based on dichotomous decision making (the flaws of which severely limit the value of statistical power5) and address the more natural research question of how big the effect is, rather than simply asking whether there is an effect3,6. And 95% confidence intervals for the effect size provide a more consistent indication of the true (population-level) condition than does P. Thus comparing the effect sizes and confidence intervals of several similar studies typically uncovers a coherent pattern that is masked when only the P values of those studies are compared2. Furthermore, and crucially, the sample effect sizes and confidence limits of multiple

To the Editor: For more than 40 years, null-hypothesis significance testing and P values have been questioned by statistical commentators, their utility criticized on philosophical and practical grounds1. Luckily, the preferred statistical methodology is accessible with modest retraining. An obstacle to the adoption of this alternative seems to be the lack of a widely used name; we suggest the term ‘estimation statistics’ to describe the group of methods that focus on the estimation of effect sizes (point estimates) and their confidence intervals (precision estimates). Estimation statistics offers several key benefits with respect to current methods. Estimation is an informative way to analyze and interpret data. For example, for an experiment with two independent groups, the estimation counterpart to a t-test is calculation of the mean difference (MD) and its confidence interval2. One calculates the MD by subtracting the mean for one group from the mean for the other, and its confidence interval falls between MD – (1.96 × SEMD) and MD + (1.96 × SEMD), where SEMD is the pooled standard error of the MD3. For quantitative science, it is more useful to know and think about the magnitude and precision of an effect than it is to contemplate the probability of observing data of at least that extremity, assuming absolutely no effect. An old joke about study-

1Department of Psychiatry and Behavioral Sciences, Stanford University School

108 | VOL.13 NO.2 | FEBRUARY 2016 | NATURE METHODS

Solutions for quantifying P-value uncertainty and replication power.

Solutions for quantifying P-value uncertainty and replication power. - PDF Download Free
227KB Sizes 1 Downloads 13 Views