CORRESPONDENCE no other statistic fills this particular niche. Moreover, the alternatives (such as estimates, confidence intervals, false discovery rates, etc.) are also subject to random variation and, like P values, can behave badly if experiments are poorly designed or implemented. Nonetheless, we do not suggest that researchers rely on P values alone. Parameter estimates and confidence intervals, in particular, can describe data in a more detailed, contextual way. The P value gained its unique prominence because it is simple and interpretable across a variety of settings, despite the fact that it is sometimes misunderstood. P values are variable, but this variability reflects the real uncertainty inherent in statistical results. Thus, we believe P values will continue to have an important role in research, but an explicit understanding of P-value uncertainty can improve their interpretation.

npg

© 2016 Nature America, Inc. All rights reserved.

Note: Any Supplementary Information and Source Data files are available in the online version of the paper (doi:10.1038/nmeth.3741). ACKNOWLEDGMENTS We thank the reviewers for their helpful comments and suggestions. This work was supported by the US National Institutes of Health (grant MH086135 to L.C.L. as part of the Consortium on the Genetics of Schizophrenia (COGS)) and the Cooperative Studies Program of the US Department of Veterans Affairs. COMPETING FINANCIAL INTERESTS The authors declare no competing financial interests.

Laura C Lazzeroni1, Ying Lu2,3 & Ilana Belitskaya-Lévy2

studies can be combined for meta-analysis, enabling researchers to home in on the true effect. Although we do not encourage the use of power analysis, Lazzeroni et al.’s figure supports our own illustration of the variability in P. As both our models7 and Lazzeroni et al.’s models demonstrate, unless the results of an experiment show a very marked pattern in the data, the reported P value will be accompanied by limits so broad as to render P uninterpretable. Put simply, P is untrustworthy unless the statistical power is very high (above 90%), which offsets advantages of P such as its simplicity. As researchers better appreciate the typically artificial nature of the null hypothesis3 and the limited capacity of P to support hypothesis testing, we believe that P will become much less highly valued. COMPETING FINANCIAL INTERESTS The authors declare no competing financial interests.

Lewis G Halsey1, Douglas Curran-Everett2,3, Sarah L Vowler4 & Gordon B Drummond5 1Department of Life Sciences, University of Roehampton, London, UK. 2Division

of Biostatistics and Bioinformatics, National Jewish Health, Denver, Colorado, USA. 3Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Denver, Denver, Colorado, USA. 4Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge, UK. 5University of Edinburgh, Edinburgh, UK. e-mail: [email protected] Open Science Collaboration Science 349, aac4716 (2015). Cumming, G. Perspect. Psychol. Sci. 3, 286–300 (2008). Cohen, J. Am. Psychol. 49, 997–1003 (1994). Johnson, D. J. Wildl. Mgmt. 63, 763–772 (1999). Rosnow, R. & Rosenthal, R. Am. Psychol. 44, 1276–1284 (1989). Nakagawa, S. & Cuthill, I. Biol. Rev. Camb. Philos. Soc. 82, 591–605 (2007). Halsey, L.G., Curran-Everett, D., Vowler, S. & Drummond, G. Nat. Methods 12, 179–185 (2015).

of Medicine, Stanford, California, USA.2 Cooperative Studies Program Palo Alto Coordinating Center, Department of Veterans Affairs, Mountain View, California, USA. 3Department of Health Research and Policy, Stanford University School of Medicine, Stanford, California, USA. Correspondence should be addressed to L.C.L. ([email protected]).

1. 2. 3. 4. 5. 6. 7.

1. Halsey, L.G., Curran-Everett, D., Vowler, S.L. & Drummond, G.B. Nat. Methods 12, 179–185 (2015). 2. Goodman, S.N. Stat. Med. 11, 875–879 (1992). 3. Cumming, G. Perspect. Psychol. Sci. 3, 286–300 (2008). 4. Boos, D.D. & Stefanski, L.A. Am. Stat. 65, 213–221 (2011). 5. Lazzeroni, L.C., Lu, Y. & Belitskaya-Lévy, I. Mol. Psychiatry 19, 1336–1340 (2014). 6. Nuzzo, R. Nature 506, 150–152 (2014).

Estimation statistics should replace significance testing

Halsey et al. reply: We agree with Lazzeroni et al. that researchers often believe P values are infallible1. If the intervals Lazzeroni et al. propose were obligatory with each presentation of P, the unthinking use of the unqualified P value would be undermined2. In theory this would be an excellent outcome. However, in practice, simply providing tools for quantifying the fickleness of P will highlight an endemic problem without offering any treatment. Whereas Lazzeroni et al. suggest providing information to support P, we have suggested using measures that supersede P for interpreting data3,4. Effect sizes can be standardized, are not based on dichotomous decision making (the flaws of which severely limit the value of statistical power5) and address the more natural research question of how big the effect is, rather than simply asking whether there is an effect3,6. And 95% confidence intervals for the effect size provide a more consistent indication of the true (population-level) condition than does P. Thus comparing the effect sizes and confidence intervals of several similar studies typically uncovers a coherent pattern that is masked when only the P values of those studies are compared2. Furthermore, and crucially, the sample effect sizes and confidence limits of multiple

To the Editor: For more than 40 years, null-hypothesis significance testing and P values have been questioned by statistical commentators, their utility criticized on philosophical and practical grounds1. Luckily, the preferred statistical methodology is accessible with modest retraining. An obstacle to the adoption of this alternative seems to be the lack of a widely used name; we suggest the term ‘estimation statistics’ to describe the group of methods that focus on the estimation of effect sizes (point estimates) and their confidence intervals (precision estimates). Estimation statistics offers several key benefits with respect to current methods. Estimation is an informative way to analyze and interpret data. For example, for an experiment with two independent groups, the estimation counterpart to a t-test is calculation of the mean difference (MD) and its confidence interval2. One calculates the MD by subtracting the mean for one group from the mean for the other, and its confidence interval falls between MD – (1.96 × SEMD) and MD + (1.96 × SEMD), where SEMD is the pooled standard error of the MD3. For quantitative science, it is more useful to know and think about the magnitude and precision of an effect than it is to contemplate the probability of observing data of at least that extremity, assuming absolutely no effect. An old joke about study-

1Department of Psychiatry and Behavioral Sciences, Stanford University School

108 | VOL.13 NO.2 | FEBRUARY 2016 | NATURE METHODS

npg

© 2016 Nature America, Inc. All rights reserved.

CORRESPONDENCE ing metal springs: estimation reveals the proportionality between force and extension, Hooke’s law; P tells you, “When you pull on it, it gets longer”4. Medical research has led the way in adopting estimation statistics. Using the effect size in the clinical research context rightfully places the focus on the magnitude of a treatment’s benefit, a perspective that has greatly advanced clinical decision making. For basic research, adopting effect sizes would better facilitate quantitative comparisons and models (such as Hooke’s law). Importantly, thinking about effect sizes during data interpretation encourages an analyst to have greater awareness of the metrics being used and how they relate to the natural processes under study. The second key benefit of estimation statistics is that it allows for the synthesis of data from published sources by means of systematic literature review and meta-analysis. Meta-analysis is a way to average effect sizes from different studies, which produces a more precise overall estimate (that is, a narrower confidence interval). In medical research, meta-analytic studies of randomized controlled clinical trials are considered the strongest form of medical evidence; they are used to reconcile discordant results, produce precise estimates of treatment effects, identify knowledge gaps, guide clinical practice and inform further investigation. Meta-analyses are published in numerous medical journals as a quantitative alternative to the conventional ‘he said/she said’ narrative review. Meta-analytic studies are now also being used in preclinical research; for example, a recent study showed that the animal-model literature on stroke overstates efficacy5. Estimation statistics’ third important benefit is its use of model construction to quantify trends in heterogeneous primary or ­published data. Models can be basic or more advanced, such as multivariate meta-regression, a method that accounts for sources of experimental heterogeneity in complex data. Like clinical data, basic research results are well suited to the use of multivariate models to analyze both primary and pooled published data from complex experimental designs. When the data have high integrity, such models can resolve discordance and misinterpretation caused by significance tests6. The use of estimation statistics remains rare in basic research. We suggest that as researchers become increasingly aware of the limitations of significance testing, they should use estimation in its place. COMPETING FINANCIAL INTERESTS The authors declare no competing financial interests.

Adam Claridge-Chang1–3 & Pryseley N Assam4,5 1Program in Neuroscience and Behavioral Disorders, Duke-NUS Medical School,

Singapore. 2Institute for Molecular and Cell Biology, Singapore. 3Department of Physiology, National University of Singapore, Singapore. 4Singapore Clinical Research Institute, Singapore. 5Centre for Quantitative Medicine, Duke-NUS Graduate Medical School, Singapore. e-mail: [email protected]

1. Halsey, L.G., Curran-Everett, D., Vowler, S.L. & Drummond, G.B. Nat. Methods 12, 179–185 (2015). 2. Cumming, G. Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis (Routledge, 2012). 3. Altman, D.G., Machin, D., Bryant, T.N. & Gardner, M.J. (eds) Statistics with Confidence: Confidence Intervals and Statistical Guidelines (BMJ Books, 2000). 4. Tukey, J.W. Am. Psychol. 24, 83–91 (1969). 5. Sena, E.S., van der Worp, H.B., Bath, P.M.W., Howells, D.W. & Macleod, M.R. PLoS Biol. 8, e1000344 (2010). 6. Yildizoglu, T. et al. PLoS Genet. 11, e1005718 (2015).

The mutation significance cutoff: genelevel thresholds for variant predictions To the Editor: Next-generation sequencing (NGS) identifies about 20,000 variants per exome, of which only a few may underlie genetic diseases. Variant-level methods such as PolyPhen-2 (polymorphism phenotyping version 2), SIFT (sorting intolerant from tolerant) and CADD (combined annotation–dependent depletion) attempt to predict whether a given variant is benign or deleterious1–3. These methods are commonly interpreted in a binary manner as a means of filtering out benign variants from NGS data, with a single significance cutoff value across all genes. CADD developers propose (but do not recommend for categorical usage) a fixed cutoff value between 10 and 20 on a scale of 1–99, with 99 being the most deleterious. Gene-level methods, including RVIS (residual variation intolerance score, which applies combined fixed geneand variant-level cutoffs), de novo excess and GDI (gene damage index), are also useful4–6. However, a uniform cutoff is unlikely to be accurate genome-wide (see Supplementary Note). Here we describe the mutation significance cutoff (MSC), a quantitative approach that provides gene-level and gene-specific phenotypic impact cutoff values to improve the use of existing variant-level methods, and a public server for utilizing it (http://lab.rockefeller.edu/casanova/MSC). We first showed that with fixed cutoffs CADD outperformed PolyPhen-2 and SIFT (Supplementary Fig. 1a). We found that 40.84% of Human Gene Mutation Database (HGMD)7 curated disease-associated mutations were not missense (but rather nonsense, frameshift, regulatory, etc.) (Fig. 1a), contributing to low true positive predictions with PolyPhen-2 and SIFT. The 95% confidence interval (CI) of CADD scores for disease-associated mutations of a given HGMD gene overlapped, on average, with only 37.63% (41.89% median) of the 95% mutation CIs of all other HGMD genes (Fig. 1b). The CADD scores of private disease-associated mutations were significantly higher than those of non-private disease-associated mutations (P < 10–300, Supplementary Fig. 1b), resulting in lower overall impact prediction scores when the allele frequency of a mutation was considered (Supplementary Fig. 2). We defined the MSC of a gene as the lower limit of the CI (90%, 95% or 99%) for the CADD, PolyPhen-2 or SIFT score of all its high-quality mutations described as pathogenic in HGMD or the ClinVar database8 (see Supplementary Methods). The MSC values (Fig. 1c, Supplementary Tables 1–9) varied considerably from gene to gene. We estimated the MSC values of the remaining protein-coding genes by an extrapolation from their rare nonsynonymous 1,000 Genomes Project9 alleles and validated these values by bootstrapping simulations (see Supplementary Methods, Supplementary Fig. 3, Supplementary Tables 1–9 for MSC based on CADD, PolyPhen-2 and SIFT with 90%, 95% and 99% CIs, respectively, and Supplementary Fig. 4a for 95% CI MSC scores). We found significant correlations between MSC and GDI values (P < 10-5, Supplementary Fig. 4b)6 and between MSC and purifying selection values (P < 10–5, Supplementary Fig. 4c). Low-MSC genes were associated with immune system pathways, whereas genes with high MSC values were enriched in ribosomal function (Supplementary Figs. 4d,e and Supplementary Table 10). MSC showed significantly better performance in distinguishing benign from deleterious alleles compared with CADD scores NATURE METHODS | VOL.13 NO.2 | FEBRUARY 2016 | 109

Estimation statistics should replace significance testing.

Estimation statistics should replace significance testing. - PDF Download Free
143KB Sizes 0 Downloads 7 Views