Ten Statistics Commandments That Almost Never Should Be Broken Thomas R. Knapp, Jean K. Brown

Correspondence to Thomas R. Knapp E-mail: [email protected] Thomas R. Knapp Professor Emeritus University of Rochester and The Ohio State University 78-6800 Alii Dr #10 Kailua-Kona, HI 96740

Abstract: Quantitative researchers must choose among a variety of statistical methods when analyzing their data. In this article, the authors identify ten common errors in how statistical techniques are used and reported in clinical research and recommend stronger alternatives. Useful references to the methodological research literature in which such matters are discussed are provided. ß 2014 Wiley Peri-

odicals, Inc. Keywords: statistics; statistical significance; measurement; research design Research in Nursing & Health, 2014, 37, 347–351 Accepted 4 June 2014

Jean K. Brown Professor and Dean Emerita University at Buffalo State University of New York Buffalo, NY

DOI: 10.1002/nur.21605 Published online 29 June 2014 in Wiley Online Library (wileyonlinelibrary.com).

The realities of clinical nursing research are often at odds with the assumptions and peer expectations for statistical analyses. Random sampling is often impossible, so many clinical researchers employ sequential convenience sampling over several years and rely on multiple clinical sites for participant accrual. Many sample sizes are smaller than those determined by a priori power analysis because of funding limitations, changes in clinical practice, and “publish or perish” pressure. In addition, ordinal rather than intervallevel measures are used for many study variables. Last, peer reviewers and journal editors sometimes make demands that are not always consistent with ideal statistical desiderata. So what is the clinical researcher to do? In the spirit of Knapp and Brown (1995), “Ten measurement commandments that often should be broken,” we present ten statistics commandments that almost never should be broken. These are not the only statistics commandments that should be obeyed, but these often have serious consequences when not followed.

The Ten Commandments Thou Shalt Not Carry Out Significance Tests for Baseline Comparability in Randomized Clinical Trials It has become increasingly common in randomized clinical trials to test the baseline comparability of the

participants who have been randomly assigned to the various arms. There are several problems with this: a. It reflects a mistrust of probability for providing balance across study arms. The significance test or confidence interval for the principal analysis takes into account preexperimental differences that might be attributable to chance. b. It involves the selection of a significance level that is often not based upon the consequences of making a Type I error and therefore is arbitrary. c. If several baseline variables are involved in the test, it is likely that there will be at least one variable for which the difference(s) is (are) statistically significant. If that were to happen, should the researcher ignore it? Use such variables as covariates in the principal analysis after the fact? Both of those are poor scientific practices. Covariates should be chosen based upon their relationships to the dependent variable, not because of their imbalance at baseline. For more on this matter, see Senn (1994) and Assmann, Pocock, Enos, and Kasten (2000). But why do some people break this commandment? Because of small samples, for which random assignment might not produce pre-experimental equivalence? Because editors and/or reviewers require it? A better strategy is to create blocks of participants in small sequential groups and then randomly assign participants within each block, in order to improve the comparability of the study  C

2014 Wiley Periodicals, Inc.

348

RESEARCH IN NURSING & HEALTH

arms (see Efird [2011]). For example, a randomized clinical trial with a desired sample of 100 could be divided into 10 blocks of 10 participants each. Within each block of 10, 5 participants would be randomly assigned to the experimental treatment group and 5 to the usual care group.

Thou Shalt Not Pool Data Across Research Sites Without First Testing for “Poolability” In a multi-site investigation, it is common that researchers combine the data across sites in order to get a larger sample size, without first determining the extent to which it is appropriate to do so. For example, in a study of the relationship between age and pulse rate, it is possible that the relationship between those two variables might be quite different at Site A than at Site B. The subjects at Site A might be older and have higher pulse rates than the subjects at Site B. If the data are pooled across sites, the relationship could be artificially inflated. (The same thing happens if you pool the data for females and for males in investigating the relationship between height and weight. Males are generally taller and heavier, which if viewed on a scatter diagram would result in stretching out and thinning out the pattern to give a better elliptical fit.) In randomized clinical trials involving two or more sites, the researcher must first test the treatment-by-site interaction effect. Only if that effect is small, is it justified to pool the data across sites and test the main effect of treatment (see Kraemer [2000]).

Thou Shalt Not Report a Variety of p Values After Having Chosen A Priori a Particular Significance Level or Confidence Level in Order to Determine Sample Size Cohen (1988, 1992) initially urged researchers to use a (a tolerable probability of making a Type I error, usually .05), desired power (1  b, the probability of not making a Type II error, usually .80), and the alternatively hypothesized effect size (usually “medium”' and clinically observable) in order to determine the appropriate sample size for a given study. Having carried out such a study, researchers need only compare their obtained p value with the pre-specified a in order to claim statistical significance (if p is less than a) or statistical non-significance (if it is not). Later, Cohen (1994)1 and others regretted putting so much emphasis on significance testing and argued that confidence intervals (usually 95%) should be used instead. The two approaches are very closely related. Statistical significance (or non-significance) is demonstrated using confidence intervals because if the null hypothesized parameter is inside the 95% confidence interval, the finding is not

statistically significant at the .05 level, and if the null hypothesized parameter is outside the 95% confidence interval, the finding is statistically significant at the .05 level. See Cumming and Finch (2005) for guidelines regarding the proper use of confidence intervals. The confidence interval approach was slow to catch on in the nursing research literature, but its use has noticeably increased. More recently, there has been a tendency to report 95% confidence intervals along with the actual magnitudes of the p values. For example, the 95% confidence interval for a population Pearson correlation coefficient might be said to be from .45 to .55, and p might be expressed as less than .01, or equal to .0029 or another number. There are several reasons why this is bad practice. First of all, if a particular value of a (for hypothesis testing) or 1  a (for interval estimation) has been specified in order to determine an appropriate sample size, there is no need to be concerned about the size of p, with or without the use of single, double, and triple asterisks (see Slakter, Wu, and Suzuki-Slakter [1991]). Second, p is not a measure of the strength of an effect. A correlation coefficient or a difference between means is appropriate for that and should always be reported. Third, if the credibility of the inference depends upon .05 (for 95% confidence) on the one hand and upon .01 (for statistical significance) on the other hand, which should the reader care about? Finally, if more than one inferential procedure has been carried out, with only 95% confidence intervals but with p values that are all over the place, the reader is subjected to information overload. The actual magnitude of the p-value is not important. All that matters is whether or not it is less than a. (Claiming that a finding “approached statistical significance,” “just missed being statistically significant,” or the like is simply indefensible.) The strength of the effect is indicated by the test statistic.

Thou Shalt Not Use the Word “Significant” in the Statement of a Hypothesis In the nursing research literature, it is common to find the phrasing of a hypothesis as “There is no significant relationship between X and Y” (null) or “There is a significant relationship between X and Y” (alternative). Both are wrong. Hypotheses are about populations and their parameters, even when they are tested in a given study by using statistical results for samples. Statistical significance or non-significance is a consequence of choice of significance level and sample size as well as the actual magnitude of the test statistic. The primary interest is in the population, from which the sample has been drawn. Example of poor wording:

1 There is an unfortunate error in this article. Cohen claimed that many people think the p-value is the probability that the null hypothesis is false. Not so; they think the p-value is the probability that the null hypothesis is true. It is neither. The p-value is the probability of the obtained result or anything more extreme, if the null hypothesis is true.

Research in Nursing & Health

TEN STATISTICS COMMANDMENTS/ KNAPP AND BROWN

There is no significant relationship between height and weight. Example of appropriate wording: There is no relationship between height and weight. The relationship between height and weight in the sample might or might not be statistically significant. In the population, there is either a relationship between height and weight or there is not. See Polit and Beck (2008) for a good discussion of the wording of hypotheses.

Thou Shalt Not Refer to Power After a Study Has Been Completed Power is an a priori concept, as is significance level. Before a study is carried out, researchers specify the power they desire, reflecting the desired probability of accurately rejecting the null hypothesis when it is false, and accrue a sample of the size that provides such power. After the study has been completed, whether or not the null hypothesis has been rejected, it is inappropriate to address the power that they actually had after the fact. If the null hypothesis is rejected, the researchers had sufficient power to do so. If it is not rejected, either there was not sufficient power or the null hypothesis is true. To calculate an observed (post hoc, retrospective) power is worthless. It is perfectly correlated (inversely) with the observed p value; that is, the higher the observed power, the smaller the p value associated with it. For excellent discussions of problems with observed power, see Zumbo and Hubley (1998) and Hoenig and Heisey (2001).

Thou Shalt Not Use Descriptive Statistics Developed for Interval and Ratio Scales to Summarize Ordinal Scale Data Consider the (arithmetic) mean, the standard deviation, and the Pearson product-moment correlation coefficient: These three ways of summarizing data, whether for a population or for a sample drawn from a population, are used very often in quantitative research. All of them require that the scale of measurement be at least interval (they are also fine for ratio scales), because they all involve the addition and/or the multiplication of various quantities. Such calculations are meaningless for nominal and ordinal scales. For example, you should not add a 1 for “strongly agree” to a 5 for “strongly disagree,” divide by 2 (i.e., multiply by 1/2), and get a 3 for “undecided.” Some people see nothing wrong with describing ordinal data through the use of means, standard deviations, and Pearson rs. Others are vehemently opposed to so doing. There have been heated arguments about this since Stevens (1946) proposed his nominal, ordinal, interval, and ratio taxonomy [see, for example, Gaito (1980) and Townsend and Ashby (1984)], but cooler heads soon prevailed (Moses, Emerson, & Hosseini, 1984). An entire book (Agresti, 2010) has been written about the proper analysis of ordinal data. For one of the best discussions of the

Research in Nursing & Health

349

inappropriateness of using traditional descriptive statistics with ordinal scales, see Marcus-Roberts and Roberts (1987). Because of the prevalence of ordinal measurement in clinical research and the favoring of parametric inference rather than non-parametric inference, it is difficult to forgo treating ordinal scales like interval scales, given that parametric tests usually have greater power than nonparametric tests. However, the latter can have greater power when the assumptions for the level of measurement (i.e., interval or ratio) for parametric tests are not satisfied [see, for example, Sawilowsky (2005)].

Thou Shalt Not Report Percentages That Do Not Add to 100 Without a Note Indicating Why They Do Not When reading a journal article that contains percentages for a particular variable, it is instructive to check to see if they add to 100, or very close to 100 (if they have been rounded). Why? First of all, if they do not add to 100, it may indicate a lack of care that the authors might have taken elsewhere in the article. Second, almost every study has missing data, and it is often the case that the percentages are taken out of the total sample size rather than the nonmissing sample size. Finally, in an excellent article, Mosteller, Youtz, and Zahn (1967), show that the probability of percentages adding to exactly 100 is perfect for variables with two categories, approximately 3/4 for three categories, approximately 2/3 for four categories, and approximately √6/cp for c  5, where c is the number of categories and p is the well-known ratio of the circumference of a circle to its diameter (¼approximately 3.14). The greater the number of categories, the less likelihood of an exact total of 100%. A related matter is the accurate use and interpretation of percentages in cross-tabulations. For a contingency (“cross-tabs”) table, for example, both SAS and SPSS output can provide three percentages (row, column, and total) for each of the cell frequencies. Usually only one of those is relevant to the research question. Using the wrong one can have a serious effect on the interpretation of the finding. See Garner (2010) for an interesting example of a study of the relationship between pet ownership and home location, in which using percentages the wrong way resulted in failure to answer the research question.

Thou Shalt Not Use Pearson Product-Moment Correlation Coefficients Without First Testing for Linearity This commandment is arguably broken more often than any of the others. The “Pearson r,” as it is affectionately called, is a measure of the direction and the magnitude of the linear relationship between two variables, but it is unusual to read an article in which the researcher(s) has(have) provided a scatterplot or otherwise indicated that they tested

350

RESEARCH IN NURSING & HEALTH

the linearity of relationship before calculating Pearson r. The principal offenders are those who report Cronbach's a as a measure of the internal consistency reliability of a multi-item test without exploring the linearity of the relationships involved. Non-linear inter-item relationships can seriously influence (upward or downward) that well-known indicator of reliability. (See Sijtsma [2009] for a critique of the uses and misuses of Cronbach's a.) In order to test for linearity, there are three choices: (1) “eyeball” the scatterplot to see whether or not the pattern looks linear; (2) use residual analysis (Verran & Ferketich, 1987); or (3) test for statistical significance the difference between the sample r2 and the sample h2. (This test has been available for over a century. See, for example, Blakeman [1905]). A brief sentence regarding linearity in the research report would assure the reader that the matter has been addressed.

significant, the interaction takes precedence in interpreting the results (the effect of Factor A depends upon the level of Factor B). Interaction effects constrain the interpretation of the findings, so there is no need to test the main effects when the interaction is statistically significant. Such findings are often theoretically undesirable if the aim is to answer definitively whether or not a treatment is effective, but they can be useful in practice. For example, an experimental intervention might be better for males than for females, whereas the corresponding control intervention might be better for females than for males. See the second commandment (above) for the special case of treatmentby-site interaction, and see Ameringer, Serlin, and Ward (2009) regarding the phenomenon known as Simpson's Paradox, whereby main and interaction effects can be contradictory.

Conclusion Thou Shalt Not Dichotomize or Otherwise Categorize Continuous Variables Without a Very Compelling Reason for Doing So Many people think that in order to test main effects and interaction effects of independent variables, those variables must always be categorical. Not so. Categorizing a continuous variable always throws away interesting information, violates the assumptions of many parametric statistics, and is not advisable. (See, for example, MacCallum, Zhang, Preacher, and Rucker [2002], Streiner [2002], and Owen and Froman [2005]). A multiple regression approach to the analysis of variance permits testing of main effects and interaction effects of both continuous and categorical independent variables. A continuous variable that is frequently categorized is age. It could be argued that continuous age is always categorized, even for single years of age of 1, 2, 3, etc. But dichotomization is an extreme case of coarseness of categorization, and is rarely necessary. Age can be determined to the nearest year, month, week, day, or even minute, using statistical software such as SAS or SPSS. Just input the date of birth and the date that is of concern in the study, and age can easily be calculated. There is no need to transform actual ages into intervals such as 0–4, 5–9, 10– 14, etc.; or into popular categories such as “Silent Generation” (those born 1922–1945), “Baby Boomers” (those born 1946–1964), “Generation X” (those born 1965–1980), and “Millennial Generation” (those born 1981–2000), unless those categories are of direct interest in the study.

Thou Shalt Not Test for Main Effects Without First Testing for Interaction Effects in a Factorial Design All combinations of significance of main effects and interaction effects are possible (both significant, both not, and one significant and the other not). When both are statistically

Research in Nursing & Health

Unlike the 10 measurement commandments (Knapp & Brown, 1995) that should be broken much of the time, these 10 statistics commandments should be kept most of the time, in order to be methodologically correct. The penalty for breaking them will not be eternal damnation, but might be erroneous interpretation of data and reduced scholarly recognition.

References Agresti, A. (2010). Analysis of ordinal categorical data (2nd ed.). New York: Wiley. Ameringer, S., Serlin, R. C., & Ward, S. (2009). Simpson’s Paradox and experimental research. Nursing Research, 58, 123–127. doi: 10.1097/NNR.0b013e318199b517 Assmann, S., Pocock, S. J., Enos, L. E., & Kasten, L. E. (2000). Subgroup analysis and other (mis)uses of baseline data in clinical trials. The Lancet, 355, 1064–1069. doi: 10.1016/S0140-6736(00) 02039-0 Blakeman, J. (1905). On tests for linearity of regression in frequency distributions. Biometrika, 4, 332–350. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum. Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155–159. Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997–1003. doi: 10.1037/0003-066X.49.12.997 Cumming, G., & Finch, S. (2005). Inference by eye: Confidence intervals and how to read pictures of data. American Psychologist, 60, 170–180. Efird, J. (2011). Blocked randomization with randomly selected block sizes. International Journal of Environmental Research and Public Health, 8, 15–20. doi: 10.3390/ijerph8010015 Gaito, J. (1980). Measurement scales and statistics: Resurgence of an old misconception. Psychological Bulletin, 87, 564–567. doi: 10.1037//0033-2909.87.3.564 Garner, R. (2010). The joy of stats (2nd ed.). Toronto: University of Toronto Press.

TEN STATISTICS COMMANDMENTS/ KNAPP AND BROWN

Hoenig, J. M., & Heisey, D. M. (2001). The pervasive fallacy of power calculations for data analysis. The American Statistician, 55, 19–24.

351

Sawilowsky, S. S. (2005). Misconceptions leading to choosing the t test over the Wilcoxon Mann–Whitney U test for shift in location parameter. Journal of Modern Applied Statistical Methods, 4, 598–600.

Knapp, T. R., & Brown, J. K. (1995). Ten measurement commandments that often should be broken. Research in Nursing & Health, 18, 465–469. doi: 10.1002/nur.4770180511

Senn, S. (1994). Testing for baseline balance in clinical trials. Statistics in Medicine, 13, 1715–1726. doi: 10.1002/sim.4780131703

Kraemer, H. C. (2000). Pitfalls of multisite randomized clinical trials of efficacy and effectiveness. Schizophrenia Bulletin, 26, 533– 541.

Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach’s alpha. Psychometrika, 74, 107–120. doi: 10.1007/s11336-008-9101-0

MacCallum, R. C., Zhang, S., Preacher, K. J., & Rucker, D. D. (2002). On the practice of dichotomization of quantitative variables. Psychological Methods, 7, 19–40. doi: 10.1037//1082989X.7.1.19

Slakter, M. J., Wu, Y.-W. B., & Suzuki-Slakter, N. S. (1991).  ,   , and  : Statistical nonsense at the .00000 level. Nursing Research, 40, 248–249.

Marcus-Roberts, H. M., & Roberts, F. S. (1987). Meaningless statistics. Journal of Educational Statistics, 12, 383–394. doi: 10.3102/ 10769986012004383 Moses, L. E., Emerson, J. D., & Hosseini, H. (1984). Analyzing data from ordered categories. New England Journal of Medicine, 311, 442–448. doi: 10.1056/NEJM198408163110705 Mosteller, F., Youtz, C., & Zahn, D. (1967). The distribution of sums of rounded percentages. Demography, 4, 850–858. Owen, S. V., & Froman, R. D. (2005). Why carve up your continuous data? Research in Nursing & Health, 28, 496–503. doi: 10.1002/ nur.20107 Polit, D. F., & Beck, C. T. (2008). Nursing research: Generating and assessing evidence for nursing practice. Philadelphia: Lippincott.

Research in Nursing & Health

Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103, 677–680. Streiner, D. L. (2002). Breaking up is hard to do: The heartbreak of dichotomizing continuous data. Canadian Journal of Psychiatry, 47, 262–266. Townsend, J. C., & Ashby, F. G. (1984). Measurement scales and statistics: The misconception misconceived. Psychological Bulletin, 96, 394–401. doi: 10.1037/0033-2909.96.2.394 Verran, J. A., & Ferketich, S. L. (1987). Testing linear model assumptions: Residual analysis. Nursing Research, 36, 127– 129. Zumbo, B., & Hubley, A. M. (1998). A note on misconceptions concerning prospective and retrospective power. The Statistician, 47, 385–388. doi: 10.1111/1467-9884.00139

Ten statistics commandments that almost never should be broken.

Quantitative researchers must choose among a variety of statistical methods when analyzing their data. In this article, the authors identify ten commo...
98KB Sizes 2 Downloads 2 Views