This article was downloaded by: [The University of Manchester Library] On: 14 October 2014, At: 09:53 Publisher: Routledge Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

The Journal of General Psychology Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/vgen20

Correcting Overestimated Effect Size Estimates in Multiple Trials a

a

Wolfgang Wiedermann , Bartosz Gula , Paul Czech a

& Denise Muschik

a

a

University of Klagenfurt Published online: 10 Oct 2011.

To cite this article: Wolfgang Wiedermann , Bartosz Gula , Paul Czech & Denise Muschik (2011) Correcting Overestimated Effect Size Estimates in Multiple Trials, The Journal of General Psychology, 138:4, 292-299, DOI: 10.1080/00221309.2011.604657 To link to this article: http://dx.doi.org/10.1080/00221309.2011.604657

PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content.

Downloaded by [The University of Manchester Library] at 09:53 14 October 2014

This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

The Journal of General Psychology, 2011, 138(4), 292–299 C 2011 Taylor & Francis Group, LLC Copyright 

Downloaded by [The University of Manchester Library] at 09:53 14 October 2014

Correcting Overestimated Effect Size Estimates in Multiple Trials WOLFGANG WIEDERMANN BARTOSZ GULA PAUL CZECH DENISE MUSCHIK University of Klagenfurt

ABSTRACT. In a simulation study, Brand, Bradley, Best, and Stoica (2011) have shown that Cohen’s d is notably overestimated if computed for data aggregated over multiple trials. Although the phenomenon is highly important for studies and meta-analyses of studies structurally similar to the simulated scenario, the authors do not comprehensively address  how the problem could be handled. In this comment, we first suggest a corrective term dc that includes the number and correlation of trials. Next, the results of a simulation study provide evidence that the proposed dc  results in a more precise estimation of trial-level  effects. We conclude that, in practice, dc together with plausible estimates of inter-trial correlation will produce a more precise effect size range compared to that suggested by Brand and colleagues (2011). Keywords: aggregation bias, Cohen’s d effect size, multiple trials

RECENTLY, BRAND, BRADLEY, BEST, AND STOICA (2011) addressed the important topic of highly overestimated effect size estimates obtained from data aggregated across multiple trials, henceforth referred to as aggregation bias. Via simulations, the authors showed that Cohen’s d (Cohen, 1988)—the standardized effect size of the mean difference of two independent samples—is generally overestimated if computed from the sums or means of repeated measurements, assuming that the effects of interest occur at trial-level. Under some circumstances (e.g., 30 independent trials per person), the population effect is overestimated up to nearly six times. The authors explain this by the fact that the pooled variance of aggregated trials decreases as the number of trials increases. They suggest the The authors wish to thank Rainer Alexandrowicz and Reinhold Hatzinger for valuable comments on an earlier version of the article. Address correspondence to Wolfgang Wiedermann, University of Klagenfurt, Department of Psychology, Universitaetsstrasse 65-67, Klagenfurt, 9020 Austria; wolfgang. [email protected] (e-mail). 292

Downloaded by [The University of Manchester Library] at 09:53 14 October 2014

Wiedermann, Gula, Czech, & Muschik

293

additional use of more conservative split-plot ANOVA and the reporting of the average inter-trial correlation together with the number of trials. In this comment, (1) we argue that the suggested computation of a range is intransparent and of marginal practical value, (2) we propose a simple corrective term which attenuates the aggregation bias, and (3) we demonstrate the adequacy of the proposed correction method in a simulation study. As a consequence of effect size overestimation, Brand and colleagues (2011, p. 9) suggest “ . . . a liberal test and the more conservative split-plot design to gain a full understanding of a range of potential effect sizes.” Beyond this suggestion to incorporate trials as a separate factor in analyses, the authors provide no further information on why and how this should be an improvement. Although many effect size measures for split-plot ANOVAs exist, such as ε2, ω2, or generalized ω2 (Olejnik & Algina, 2003), η2 is commonly reported in the psychological literature (Pierce, Block, & Aguinis, 2004). Partial η2 is part of the standard output of statistical packages such as SPSS, for example, which is likely to provoke an imprecise reporting of η2 values, where partial η2 is erroneously referred to as classic η2 (Levine & Hullet, 2002; Pierce, Block, & Aguinis, 2004). Partial η2 is defined as SSeffect /(SSeffect + SSerror ), whereas classic η2 as SSeffect /SStotal , SS being the sum of squares. For single-trial studies, the results are identical for either Both η2 can be transformed into the same measure as SStotal = SSeffect + SSerror .  2

η 2 metric as Cohen’s d applying d = 2 · 1−η 2 (e.g. see Cohen, 1988). If partial η is used, then the overestimation will be the same as that for Cohen’s d calculated from aggregated variables. This follows from the fact that both aggregated Cohen’s d and partial η2 for the main effect of treatment in the corresponding split-plot ANOVA rely on the same within-group sums of squares. Only classical η2 provides a more conservative estimate of the true underlying trial-level effect. However, even if an effect size range is computed from the classical η2 (transformed into the Cohen’s d metric) and the mean-aggregated Cohen’s d, this interval will be of marginal value and misleading for the following reasons: First, depending on the inter-trial correlation and the number of trials, the range may be too large to be informative. Second, should classical η2 provide an accurate estimate for the true effect, then the least-biased estimate of the effect is not located somewhere in the middle of the range but equals its lower bound. It has been shown that such split-plot analyses result in a power loss for the between subjects effect with higher number of trials and increasing inter-trial correlation (Bradley & Russell, 1998). Therefore, we suggest that classical η2 and the corrected Cohen’s d suggested below should be compared in future studies. However, in the present article, we focus on the improvement of Cohen’s d, because it is more commonly reported in studies dealing with two group comparisons. A further conclusion the authors arrive at concerns a more sophisticated reporting practice, in which the number of trials as well as the average inter-trial correlation should be routinely reported to allow researchers to draw inferences about the strength of the results. Although we generally support the idea of a

294

The Journal of General Psychology

Downloaded by [The University of Manchester Library] at 09:53 14 October 2014

more sophisticated reporting routine in the presentation of empirical results, we doubt that this additional information allows for empirically sound conclusions about the actual inflation of effect size estimates. Instead of simply reporting the additional numbers, we suggest an empirical approach to assess the amount of overestimation, which is outlined in more detail in the next section. A Simple Correction Method The central limit theorem states that as the sample size (n) of independent samples increases, the actual distribution of the sample means approaches a normal distribution with mean µ and variance σ 2/n. Thus, for k independent trials the individual means follow a normal distribution with mean µ and variance σ 2/k. In this case, σ 2 denotes the variance of the means of each respondent’s trials. Consequently, the bias in the effect size estimate da obtained from average responses √ simply depends on the number of trials and can be corrected using dc = da / k. However, this approach requires all responses to be mutually independent, which seems rather implausible in practice. For correlated variables the variance of mean-aggregated measures of k trials is defined as σ2 k−1 2 + ρσ , k k with ρ and σ 2 being the inter-trial correlation and trial variance (assuming equal trial variances), respectively. Transforming the latter equation shows that the meanaggregated effect size estimate can be corrected using  dc = da /(k/ k + ρ(k 2 − k)). Here, knowledge of the true population correlation ρ is required. Therefore, we performed a simulation study to investigate the properties of this correction, applying the sample-based average inter-trial correlation instead of the true population correlation. Methods In essence, we adopted the simulation design of Brand and colleagues (2011)1 in order to ensure comparability of results. However, instead of rearranging simulated values to obtain an average correlation between trials, two independent n × k matrices (n = number of participants, k = number of trials) following a multivariate normal distribution were generated. Each trial had mean of 10 and a standard deviation of 2. Sample size was n = 38 per group. Covariance structures of the multivariate distributions were varied to obtain the average correlation (ρ = 0, 0.2, 0.5, 0.8). Means of the second matrix were increased to obtain three most common

Wiedermann, Gula, Czech, & Muschik

295

effect sizes, d = 0.20 (small), 0.50 (medium), 0.80 (large). For each of 100,000 iterations mean-aggregated variables were computed and the Cohen’s effect size estimate (da ) as well as the corrected dc were logged. The proposed correction assumes knowledge of the true population correlation ρ, thus additionally

Downloaded by [The University of Manchester Library] at 09:53 14 October 2014

 dc = da /(k/ k + r(k 2 − k)) was calculated by replacing ρ with the averaged sample estimate r. We will consider the case of mean-aggregated measures only and not the aggregation based on sums, which was also simulated by Brand and colleagues (2011). The mean-aggregation is more common in practice because missing values impair the sum aggregation due to unequal number of values per participant. The simulation study was performed in R (R Development Core Team, 2011). Results Table 1 shows the amount of effect size inflation as a function of ρ and k for small, medium, and large true effects. For the uncorrected estimates (da ), the results are virtually identical to those of Brand and colleagues (2011). Generally, da values are heavily inflated with increasing number of trials. For uncorrelated trials this overestimation is generally more pronounced (with distortions ranging from 129% to 461%) than for correlated trials. Correcting Cohen’s d using the true population correlation (dc ) eliminates the inflation for all simulated scenarios. However, dc values are still slightly above the true population effect. Across all scenarios overestimation ranged from 0% to 5%. This might be attributable to the fact that ordinary Cohen’s d is biased in the case of small sample sizes (Hedges & Olkin, 1985). Correcting these values using the Hedges and Olkin’s formula  3 , where N is the total sample size, would additionally mitigate g = d · 1 − 4N−9 the overestimation (for space limitation not shown in Table 1). Because the true  population correlation is unknown in practice, the rows in Table 1 labeled dc refer to the corrected effect size measures using the sample-based average intertrial correlation. For independent trials (ρ = 0) and small effects (d = 0.20) the distortion ranges from 0% to 15%. For medium true effects (d = 0.50) a maximum distortion of 68% is observed. In the case of large true effects (d = 0.80) and k > 1 overestimation ranges from 28% to 130%. However, as Brand and colleagues (2011) already noted, zero inter-trial correlation appears extremely unrealistic. Therefore, scenarios involving correlated trials are of greater interest for practice. For small and medium true effects, the correction approach performs almost as well as using the true population correlation. Across all levels of trials and correlations the maximum distortion is 12%. For large true effects (d = 0.80) together with a rather small inter-trial correlation (ρ = 0.2) effect size inflation ranges from 15% to 24% depending on the number of trials.

0.2

0

0.8

0.5

0.2

0

Correlation

da dc  dc da dc  dc

da dc  dc da dc  dc da dc  dc da dc  dc

Effect Size

0.51 0.51 0.51 0.51 0.51 0.51

0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20

(2.0) (2.0) (2.0) (2.0) (2.0) (2.0)

(0.0) (0.0) (0.0) (0.0) (0.0) (0.0) (0.0) (0.0) (0.0) (0.0) (0.0) (0.0)

1

1.15 0.51 0.57 0.85 0.51 0.54

0.46 0.20 0.21 0.34 0.20 0.21 0.27 0.21 0.21 0.22 0.21 0.21 (130.0) (2.0) (14.0) (70.0) (2.0) (8.0)

(130.0) (0.0) (5.0) (70.0) (0.0) (5.0) (35.0) (5.0) (5.0) (10.0) (5.0) (5.0)

5 Cohen’s d = 0.2 (small effect) 0.65 (225.0) 0.92 0.20 (0.0) 0.20 0.21 (5.0) 0.22 0.39 (95.0) 0.42 0.20 (0.0) 0.20 0.21 (5.0) 0.21 0.28 (40.0) 0.28 0.20 (0.0) 0.20 0.21 (5.0) 0.21 0.23 (15.0) 0.23 0.20 (0.0) 0.20 0.20 (0.0) 0.21 Cohen’s d = 0.5 (medium effect) 1.62 (224.0) 2.29 0.51 (2.0) 0.51 0.63 (26.0) 0.75 0.97 (94.0) 1.05 0.51 (2.0) 0.51 0.55 (10.0) 0.56

10

Number of Trials

2.80 0.51 0.84 1.07 0.51 0.56

1.12 0.20 0.23 0.43 0.21 0.21 0.28 0.20 0.21 0.23 0.21 0.21

(460.0) (2.0) (68.0) (114.0) (2.0) (12.0)

(460.0) (0.0) (15.0) (115.0) (5.0) (5.0) (40.0) (0.0) (5.0) (15.0) (5.0) (5.0)

30

(Continued on next page)

(358.0) (2.0) (50.0) (110.0) (2.0) (12.0)

(360.0) (0.0) (10.0) (110.0) (0.0) (5.0) (40.0) (0.0) (5.0) (15.0) (0.0) (5.0)

20

TABLE 1. Averaged Cohen’s d (see text) of Either Small, Medium, or Large True Effects Derived from Aggregation Over 1, 5, 10, 20, and 30 Trials. Values in Parentheses Represent the Percentage of Overestimation Computed Based on Rounded Estimates.

Downloaded by [The University of Manchester Library] at 09:53 14 October 2014

296 The Journal of General Psychology

da dc  dc da dc  dc da dc  dc da dc  dc

0

0.8

0.5

0.2

0.8

da dc  dc da dc  dc

0.5

Correlation

Effect Size

0.82 0.82 0.82 0.82 0.82 0.82 0.82 0.82 0.82 0.82 0.82 0.82

0.51 0.51 0.51 0.51 0.51 0.51

(2.5) (2.5) (2.5) (2.5) (2.5) (2.5) (2.5) (2.5) (2.5) (2.5) (2.5) (2.5)

(2.0) (2.0) (2.0) (2.0) (2.0) (2.0)

1

1.83 0.82 1.02 1.37 0.82 0.92 1.06 0.82 0.86 0.89 0.82 0.83

0.66 0.51 0.52 0.56 0.51 0.51 (128.8) (2.5) (27.5) (71.3) (2.5) (15.0) (32.5) (2.5) (7.5) (11.3) (2.5) (3.8)

(32.0) (2.0) (4.0) (12.0) (2.0) (2.0)

5 0.69 (38.0) 0.70 0.51 (2.0) 0.51 0.52 (4.0) 0.52 0.57 (14.0) 0.57 0.51 (2.0) 0.51 0.52 (4.0) 0.51 Cohen’s d = 0.8 (large effect) 2.59 (223.8) 3.66 0.82 (2.5) 0.82 1.23 (53.8) 1.56 1.55 (93.8) 1.67 0.82 (2.5) 0.82 0.95 (18.8) 0.98 1.10 (37.5) 1.13 0.82 (2.5) 0.82 0.86 (7.5) 0.87 0.90 (12.5) 0.91 0.82 (2.5) 0.82 0.83 (3.8) 0.83

10

Number of Trials

(357.5) (2.5) (95.0) (108.8) (2.5) (22.5) (41.3) (2.5) (8.58) (13.8) (2.5) (3.8)

(40.0) (2.0) (4.0) (14.0) (2.0) (2.0)

20

4.49 0.82 1.84 1.72 0.82 0.99 1.14 0.82 0.87 0.91 0.82 0.83

0.71 0.51 0.53 0.57 0.51 0.52

(461.3) (2.5) (130.0) (115.0) (2.5) (23.8) (42.5) (2.5) (8.75) (13.8) (2.5) (3.8)

(42.0) (2.0) (6.0) (14.0) (2.0) (4.0)

30

TABLE 1. Averaged Cohen’s d (see text) of Either Small, Medium, or Large True Effects Derived from Aggregation Over 1, 5, 10, 20, and 30 Trials. Values in Parentheses Represent the Percentage of Overestimation Computed Based on Rounded Estimates. (Continued)

Downloaded by [The University of Manchester Library] at 09:53 14 October 2014

Wiedermann, Gula, Czech, & Muschik 297

298

The Journal of General Psychology

Downloaded by [The University of Manchester Library] at 09:53 14 October 2014

Conclusions The simulation shows that distortion of effect size estimates can be corrected if the true underlying inter-trial correlation is known. If, on the other hand, the true correlation is unknown, the sample estimate can be used to notably reduce the overestimation reported by Brand and colleagues (2011). In the more realistic case of ρ > 0 and the most extreme case of true Cohen’s d = 0.8 the overestimation  was 24% (average dc = 0.99) compared to the distortion of 115% obtained from the uncorrected measure. Moreover, this correction does not vary as much as a  function of trials in contrast to the uncorrected Cohen’s d. Hence, dc will be less prone to inflations in meta-analyses, where studies with different numbers of trials are combined. A related problem and correction approach for inflated effect sizes has been discussed by Dunlap, Cortina, Vaslow, and Burke (1996) for dependent samples. For meta-analyses, where the sample estimate of the correlation between two dependent variables is unavailable, they suggested using estimates from previous studies. Similarly, if in meta-analyses of studies employing independent samples neither the true nor the sample estimates of the inter-trial ρ are given, surrogate values should be derived from previous studies. These would still allow for the  application of the more precise dc . For example, given a study with 30 trials, each trial drawn from a population with true d = 0.5 and ρ = 0.5, and previous  studies suggested a plausible range of correlationbetween 0.4 and 0.6, then dc is bilower = 0.71/(30/ 30 + 0.4(302 − 30)) = 0.46 and ased by merely 8% to 12% (dc upper 2 = 0.71/(30/ 30 + 0.6(30 − 30)) = 0.56). This range still constitutes a dc considerable improvement over da . An important question for future research on meta-analytical methods is to what degree the effect sizes reported in practice are  distorted and how corrective terms such as the suggested dc , should be included in the methodological repertoire in order to obtain valid knowledge of the empirical phenomena of interest. NOTE 1. The authors want to thank Andrew Brand for supplying the R code used in the study. AUTHOR NOTES Wolfgang Wiedermann is a research associate at the Applied Psychology and Methods Research Unit, University of Klagenfurt and at the Department of Health Care Management, Carinthia University of Applied Sciences. His research focuses on statistical methods, addiction research, and health-related cognition. Bartosz Gula is assistant professor at the Cognitive Psychology Unit, University of Klagenfurt. His main research focuses on judgment and decision making, learning, and memory. Paul Czech is graduate student at the Applied Psychology and Methods Research Unit, University of Klagenfurt. His primary interests are statistical methods. Denise Muschik is graduate student at the Applied Psychology and Methods Research Unit, University of Klagenfurt. Her primary interests are applied statistics and consumer research.

Wiedermann, Gula, Czech, & Muschik

299

Downloaded by [The University of Manchester Library] at 09:53 14 October 2014

REFERENCES Bradley, D. R., & Russell, R. L. (1998). Some cautions regarding statistical power in splitplot designs. Behavior Research Methods, Instruments, & Computers, 30, 462–477. Brand, A., Bradley, M. T., Best L. A., & Stoica, G. (2011). Multiple trials may yield exaggerated effect size estimates. The Journal of General Psychology, 138, 1–11. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum. Dunlap, W. P., Cortina, J. M., Vaslow, J. B., Burke, M. J. (1996). Meta-analysis of experiments with matched groups or repeated measures designs. Psychological Methods, 1, 170–177. Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-analysis. Orlando, FL: Academic Press. Levine, T. R., & Hullet, C. G. (2002). Eta squared, partial eta squared, and misreporting of effect size in communication research. Human Communication Research, 28, 612–625. Olejnik, S., & Algina, J. (2003). Generalized eta and omega squared statistics: Measures of effect size for some common research designs. Psychological Methods, 8, 434–447. Pierce, C. A., Block, R. A., Aguinis, H. (2004). Cautionary note on reporting eta-squared values from multifactor ANOVA designs. Educational and Psychological Measurement, 64, 916–924. R Development Core Team (2011). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.

Original manuscript received May 9, 2011 Final version accepted July 8, 2011

Correcting overestimated effect size estimates in multiple trials.

In a simulation study, Brand, Bradley, Best, and Stoica (2011) have shown that Cohen's d is notably overestimated if computed for data aggregated over...
101KB Sizes 0 Downloads 3 Views