Got power? A systematic review of sample size adequacy in health professions education research David A. Cook • Rose Hatala

Received: 16 January 2014 / Accepted: 29 April 2014 Ó Springer Science+Business Media Dordrecht 2014

Abstract Many education research studies employ small samples, which in turn lowers statistical power. We re-analyzed the results of a meta-analysis of simulation-based education to determine study power across a range of effect sizes, and the smallest effect that could be plausibly excluded. We systematically searched multiple databases through May 2011, and included all studies evaluating simulation-based education for health professionals in comparison with no intervention or another simulation intervention. Reviewers working in duplicate abstracted information to calculate standardized mean differences (SMD’s). We included 897 original research studies. Among the 627 no-interventioncomparison studies the median sample size was 25. Only two studies (0.3 %) had C80 % power to detect a small difference (SMD [ 0.2 standard deviations) and 136 (22 %) had power to detect a large difference (SMD [ 0.8). 110 no-intervention-comparison studies failed to find a statistically significant difference, but none excluded a small difference and only 47 (43 %) excluded a large difference. Among 297 studies comparing alternate simulation approaches the median sample size was 30. Only one study (0.3 %) had C80 % power to detect a small difference and 79 (27 %) had power to detect a large difference. Of the 128 studies that did not detect a statistically significant effect, 4 (3 %) excluded a small difference and 91 (71 %) excluded a large difference. In conclusion, most education research studies are powered only to detect effects of large magnitude. For most studies that do not reach statistical significance, the possibility of large and important differences still exists.

D. A. Cook (&) Division of General Internal Medicine, Mayo Clinic College of Medicine, Mayo 17, 200 First Street SW, Rochester, MN 55905, USA e-mail: [email protected] D. A. Cook Mayo Clinic Online Learning and Mayo Clinic Multidisciplinary Simulation Center, Mayo Clinic College of Medicine, Rochester, MN, USA R. Hatala Department of Medicine, University of British Columbia, Vancouver, BC, Canada

123

D. A. Cook, R. Hatala

Keywords Medical education Data interpretation, statistical Research design Comparative effectiveness research Noninferiority trials Cohen’s d Introduction Research in health professions education is booming. Education researchers increasingly employ relevant conceptual frameworks, strong studies designs, and meaningful outcomes. However, prior reviews have suggested that in many education research studies the sample size is small (Issenberg et al. 2005; Lineberry et al. 2013), and that authors rarely consider the anticipated effect size (Cook 2012), plan the sample size in advance (Michalczyk and Lewis 1980), or interpret results in light of the actual precision (Cook et al. 2011b). All of this suggests a potential problem with the statistical power of education research studies. Although the question of statistical power has been addressed in clinical research (Chan and Altman 2005; Mills et al. 2005; Charles et al. 2009; Brody et al. 2013), we found only two studies specifically evaluating the power of research in health professions education (Michalczyk and Lewis 1980; Woolley 1983). These studies, published over 30 years ago, both looked at a small sample of studies in a single journal and conducted retrospective power analyses showing that most studies had inadequate power to detect small effects. Given the paucity of prior work, we sought to empirically evaluate the issue of power in health professions education research. Statistical power is defined as ‘‘the probability of rejecting the null hypothesis in the sample if the actual effect in the population equals the effect size’’ (Hulley et al. 2001, pg 57). Stated differently, power is the probability that a study will find a statistically significant effect if such an effect truly exists. Higher power is better, since studies with low power may fail to detect potentially significant associations. Power[90 % is desirable, and power [80 % is generally accepted as a minimum. Power is a function of the number of observations (the sample size), the magnitude of effect (the effect size), and type I error probability (the chance of identifying a ‘‘significant’’ difference when in reality there is no difference, denoted as alpha). Since alpha is usually fixed at 0.05 (i.e., the threshold of statistical significance), increased power requires either a larger sample size or a larger effect size. Clinicians and clinical researchers are familiar with the concept of effect size or magnitude of effect for therapeutic interventions. A study might show a statistically significant benefit, but if the benefit is extremely small (small effect size) then it will have trivial clinical importance. Clinical effect sizes are often reported as an absolute risk difference, relative risk difference, or number-needed-to-treat for events (e.g., death or stroke) or as the difference between means (e.g., mean reduction in blood pressure or depression score). When an outcome measure is not standardized or widely accepted (as is typically the case in education research) it is often helpful to translate the results into an effect size metric that can be intuitively interpreted and compared across studies. For example, when comparing the results of two studies using different multiple-choice knowledge tests we would want to account for the scale and difficulty of each test. The most common metric for comparative studies is the standardized mean difference (SMD), which expresses the magnitude of the difference relative to the variability in the measures. The SMD can be calculated using various formulas, including Cohen’s d and Hedges’ g. Calculating power requires one to specify the magnitude of an anticipated or actual effect size. Ideally, researchers prospectively determine a desired ‘‘educationally

123

A systematic review of sample size adequacy

important’’ effect, and estimate the sample size needed to detect this effect during the study planning stages. Once a study is complete, researchers and readers can use the confidence interval (CI) to interpret the results relative to the important effect threshold (critical value). Studies with higher power have narrower CI’s and greater precision. Figure 1 illustrates how varying the power of a study (varying the width of the CI) can influence interpretations even when the effect size remains unchanged. Researchers occasionally calculate the power of a study after-the-fact (‘‘retrospective’’ power analysis), but such estimations are less useful than CI’s and are thus discouraged (Goodman and Berlin 1994; Hoenig and Heisey 2001). To illustrate how these issues play out in practice, we sought to empirically answer three questions related to the adequacy of power and sample size in health professions education research: 1. Among published studies comparing two training interventions, what is the precision and statistical power? 2. How often do studies fail to find a statistically significant difference? 3. Of studies that did not find a statistically significant difference, what effect could be excluded with acceptable certainty? To answer these questions we used data from a previously-reported comprehensive review of simulation-based education (Cook et al. 2011a, 2013). While this constitutes a ‘‘convenience sample’’ in a focused field, it offers an opportunity to explore these questions with a robust sample of nearly 900 original studies.

Methods This project adhered to Preferred Reporting Items for Systematic Reviews and MetaAnalyses (PRISMA) standards (Moher et al. 2009). We summarize herein very briefly the most salient methods, and refer readers to the original publications for additional details (Cook et al. 2011a, 2013). Original data collection The data for the present study derive from a dataset of 897 original research studies. The method for identifying these studies and extracting the standardized mean differences (SMD’s), including trial flow diagrams and full search strategies, have been reported previously (Cook et al. 2011a, 2013). Very briefly, with the assistance of a research librarian we searched multiple literature databases for potentially relevant articles, and then two independent reviewers used predefined inclusion criteria to select articles for full review. The last date of search was May 11, 2011. For each outcome of each included study we calculated a Hedges’ g SMD. For studies with[1 comparison (e.g., a three-arm study) an SMD was calculated for each comparison. For studies with multiple outcomes or comparisons, we selected the SMD of greatest magnitude (i.e., the between-intervention difference most likely to demonstrate a statistically significant effect). The Hedges’ g SMD standardizes the difference between two means (i.e., the effect of the intervention relative to the comparison) to a standard deviation of 1, so that this difference (the effect size) can be compared across studies using different outcome measures. An SMD of 1.0 thus represents a difference between intervention and comparison of exactly 1 standard deviation. Cohen (1988) proposed classifications for

123

D. A. Cook, R. Hatala

Fig. 1 Use of confidence intervals in interpreting study results. This figure shows mock results (six different sets of data) from an hypothetical study comparing interventions X and Y. The dark boxes represent the difference between groups (the standardized mean difference effect size), and the horizontal lines represent the confidence interval (CI) around the mean. For this study, the investigators determined during the planning stages that an effect size C0.35 (dashed vertical line) would be ‘‘educationally important’’—the pre-specified ‘‘critical value.’’ In examples A1–3, the effect size of 0.1 is small and not statistically significantly different from zero. However, for A1 and A2 the CI is wide enough that it does not exclude an important difference; we cannot be certain that there is not a difference larger than 0.35 (i.e., for A1, the upper confidence limit is 0.8, suggesting that the true effect could be as high as 0.8—which would be considered large). A3, however, has a narrower CI (upper limit 0.3, which is smaller than the pre-specified critical value of 0.35), and thus the true effect is unlikely to be educationally important. In examples B1–3, the effect size of 0.6 is moderate. For B1 the CI crosses 0, indicating that the difference is not statistically significantly different from zero. The CI for B2 suggests that the difference is statistically significant, but the CI is wide enough that it does not exclude an unimportant difference (i.e., the lower confidence limit is 0.2, suggesting that the true effect could be as small 0.2—which would be considered small and unimportant). B3, however, has a narrower CI (lower limit 0.4, which is larger than the pre-specified critical value of 0.35), and thus the true effect is unlikely to be educationally unimportant. The wide CI for A1, A2, B1, and B2 suggest that all of these studies were inadequately powered for the pre-specified educationally important effect

effect sizes of this type, suggesting that \0.2 is negligible, 0.2–0.49 is small, 0.5–0.79 is moderate, and C0.8 is large. Analyses All analyses use SMD’s (with standard deviation of 1) and assume an alpha error probability of 0.05. Analyses were conducted separately for studies making comparison with no intervention and studies making comparison with an alternate simulation-based intervention. We did not distinguish outcomes (e.g., skills or behaviors) in these analyses, since the focus was on the statistical power rather than the result itself. For each study we calculated the 95 % confidence interval (CI) around the SMD as ±1.96* the estimated standard error. The standard error was estimated using formulas appropriate for the study design (single-group pre-post, two-group post-only, two-group pre-post, or crossover) (Curtin et al. 2002; Morris and DeShon 2002; Hunter and Schmidt 2004; Borenstein 2009). These estimations occasionally differ slightly from the results of statistical tests reported in the original study, and calculate slightly lower variance for studies with smaller SMDs. We report the precision (CI half-width) of each study in Fig. 2. We looked at the issue of power from three perspectives. First, we estimated the power, for varying SMD critical values, that would have been anticipated for each study based on the sample size (see Fig. 3).

123

A systematic review of sample size adequacy

Fig. 2 Sample size and precision of included studies. Precision is indicated by the half-width of the confidence interval (CI) for the standardized mean difference effect size (e.g., if the CI is 0.2–0.8 the halfwidth would be 0.3). A narrower CI (smaller half-width) indicates higher precision. a Studies comparing simulation training with no intervention (N = 627). b Studies comparing simulation training with an alternate simulation approach (N = 297)

Second, we identified all studies that actually detected a ‘‘statistically significant’’ difference. Because the included studies used widely varying approaches (and many did not report p values), we determined statistical significance if the estimated CI excluded 0. Third, for all studies that did not detect a statistically significant difference, we determined what SMD the study was adequately powered to detect (see Fig. 4). We compared the CI with varying SMD critical values. Power was deemed adequate for a given critical value SMD if the SMD was larger than one-half the CI [e.g., for a critical value SMD of 0.2, a CI of (0.38–0.56) would indicate adequate power, whereas a CI of (0.37–0.57) would not].

Results We included 897 original research studies, including 627 studies making comparison with no intervention and 297 studies making comparison with another simulation-based

123

D. A. Cook, R. Hatala

Fig. 3 Power of included studies to detect effects of varying magnitude. We estimated power for each study across a range of standardized mean difference (SMD) effect sizes (x axis). Columns indicate the percentage of studies with the specified power for a given SMD. The different colors/patterns indicate different levels of power: power [90 % is desirable, and power [80 % is generally accepted as a minimum. For example, in panel A, only two studies (0.3 %) had C90 % power to detect an SMD of 0.2; the other 622 studies had \80 % power to detect an SMD of this magnitude. a Studies comparing simulation training with no intervention (N = 624). Three studies enrolled only two participants each, and formal power analyses were not possible for these studies. b Studies comparing simulation training with an alternate simulation approach (N = 297)

intervention (some studies reported [1 comparison). The sample size for no interventioncomparison studies ranged from 2 to 1,333, with a median (interquartile range) of 25 (15, 47). The sample size for active-comparison studies ranged from 4 to 817, with a median (interquartile range) of 30 (20, 53). Figure 2 depicts the number of participants in these studies, as well as the precision of these studies. As would be expected, smaller studies nearly always have lower precision—but there are exceptions for certain study designs (e.g., crossover studies and other studies with repeated measures have higher precision for a given sample size). Also, our methods for estimating the standard error (Curtin et al. 2002; Morris and DeShon 2002; Hunter and Schmidt 2004; Borenstein 2009) ascribe

123

A systematic review of sample size adequacy

Fig. 4 Adequacy of ‘‘negative’’ studies to exclude effects of varying magnitude. The ‘‘negative’’ studies included in this figure found no statistically significant difference between interventions (p C 0.05). Each column in this figure indicates the percentage of studies that were adequately powered to exclude the corresponding standardized mean difference (SMD effect size) shown on the x axis. ‘‘Adequate’’ indicates that the confidence interval is narrow enough that an SMD this large or larger would have been detected with 95 % probability (and ‘‘inadequate’’ suggests that the study was underpowered to exclude this SMD, or in other words, that the study might be falsely negative for the specified effect size). We repeated the calculation for each SMD magnitude, i.e., each column in each panel represents the same studies but the SMD critical value varied. a Studies comparing simulation training with no intervention (N = 110). b Studies comparing simulation training with an alternate simulation approach (N = 128)

slightly higher precision to studies with smaller effect sizes (although this is not always true in original research). In order to illustrate the relationship between effect size and power we have calculated the adequacy of each study to detect SMD’s of varying magnitude. Figure 3 shows the power for all studies across a range of SMD’s, while Fig. 4 illustrates the adequacy to exclude a given effect, again across a range of SMD’s. As shown in Fig. 3, among the 627 no-intervention-comparison studies, only 2 (0.3 %) were powered to detect a small effect (SMD 0.2), 41 (7 %) had C80 % power to detect a moderate effect (SMD 0.5), and 136 (22 %) were powered to detect a large effect (SMD 0.8). Even for a very large effect of 2, only 538 (86 %) had [80 % power. Five hundred

123

D. A. Cook, R. Hatala

and seventeen studies (82 %) detected a statistically significant effect. Among the 110 studies that did not detect a statistically significant effect (so-called ‘‘negative’’ studies), none excluded a small effect of ±0.2, 22 (20 %) excluded a moderate effect of ±0.5, and only 47 (43 %) excluded a large effect of ±0.8 (see Fig. 4). Stated differently, 80 % of the ‘‘negative’’ studies were underpowered for moderate effects, and 57 % were underpowered even for large effects. Studies comparing alternate simulation approaches (N = 297) reflected overall slightly greater power. Specifically, although only 1 study (0.3 %) had C80 % power to detect a small effect, 26 (9 %) had power to detect a moderate effect, and 79 (27 %) had power to detect a large effect. A smaller proportion (N = 169; 57 %) demonstrated a statistically significant effect. Of the 128 studies that did not detect a statistically significant effect, 4 (3 %) excluded a small effect, 47 (37 %) excluded a moderate effect, and 91 (71 %) excluded a large effect.

Discussion This brief report, using a convenience sample of original studies identified in a systematic review of simulation-based education, illustrates two important points regarding sample size determinations in education research. First, the vast majority of studies in this sample were powered only to detect effects of large magnitude (SMD [ 0.8), and some had power to detect only extremely large effects ([2 standard deviations). Second, and related to the first, most ‘‘negative’’ studies (those detecting no statistically significant difference) had very wide CI’s, indicating the possibility of large and potentially important differences. In such studies, the failure to find a statistically significant effect does not confirm equivalence or non-superiority between the interventions under study. It might seem, at face value, that studies showing a statistically significant difference have adequate power, and that those failing to show a difference might lack adequate power. However, power is ideally determined during study planning, and must target a specific effect size. As such, a study reporting a very large, statistically significant effect might nonetheless have been prospectively underpowered if a realistically-estimated a priori effect size were somewhat smaller (for example, a study might find a statistically significant SMD of 2.2, but might have been underpowered for a still-large and important SMD of 1.0). Conversely, an adequately powered study might fail to detect a statistically significant difference if the true effect is very small and educationally unimportant. Thus, rather than reporting the actual SMD’s from each study, we reported the adequacy of each study to detect SMD’s of varying magnitude. Limitations This study is limited by all of the weaknesses noted in the reports from which these data derive (Cook et al. 2011a, 2013), including heterogeneity in interventions and outcomes and variable quality of the original studies. Moreover, the results apply strictly only to simulation-based education, although the overall message that underpowered studies are common in education likely generalizes broadly (Michalczyk and Lewis 1980; Woolley 1983). Strengths include the use of actual data, multiple analyses, and illustrations to highlight an important problem in education research.

123

A systematic review of sample size adequacy

Integration with prior literature Recent meta-analyses (Cook et al. 2008b, 2010, 2011a, 2012, 2013) have provided estimates of reasonable expected effect sizes (SMD’s). However, the magnitude varies with the comparison (Cook 2012). For comparisons with no intervention, the anticipated effect is typically large ([0.8). By contrast, for comparisons with active intervention the anticipated effect is much smaller, although there is wide variation depending on the relative effectiveness of the intervention and the outcome measure’s sensitivity to change. Although Cohen defined ranges of small, moderate, and large, when contemplating an ‘‘educationally important difference’’ researchers may find it useful to know that in our experience (Cook et al. 2011a, 2012, 2013) 1 standard deviation (i.e., 1 SMD unit) typically represents a 10–14 % difference in performance scores. Thus, a study powered to detect an SMD of 0.3 might be expected to detect a 3–4 % change in scores. There is little consensus regarding what level of effect should be considered ‘‘important.’’ However, in our anecdotal experience most educators agree that a B2 % score difference (negligible effect) is indeed negligible and that a C6 % difference (moderate effect) warrants attention, whereas opinions vary regarding small effects (3–5 % difference). We believe that effects of small magnitude can have important educational implications if they align with theory-based predictions, can be economically implemented (e.g., low cost or many learners), and derive from studies with adequate power (Lineberry et al. 2013; Cook et al. 2014). A review of research in Web-based learning (Cook et al. 2011b) found that fewer than 10 % of studies reported a CI for the difference between interventions, suggesting that researchers fail to appreciate the importance of this reporting element. The problems of inadequate sample size, incomplete reporting, and inappropriate interpretation are not limited to education research, as shown by reviews of clinical research (Chan and Altman 2005; Mills et al. 2005; Le Henanff et al. 2006; Charles et al. 2009; Vavken et al. 2009; Brody et al. 2013). We strongly encourage researchers to prospectively calculate and then report study power (Schulz et al. 2010; Piaggio et al. 2012). Editors and reviewers can help by encouraging explicit planning for study power and consistent use of confidence intervals. How to increase power As noted earlier, power is a function of the number of observations (the sample size), the magnitude of effect (the effect size), and type I error probability. Each of these variables suggests a target for researchers who need higher power (Hansen and Collins 1994). First, as suggested in the above analyses, increasing the sample size is an obvious solution. This involves not only enrolling more participants, but also retaining everyone enrolled and minimizing missing data. Second, the effect size is a function of between-group differences and score variance. We expect between-group observed differences to increase as the between-intervention difference in effectiveness grows. Lower score variance will increase the effect size, and this can be achieved by increasing the homogeneity of the sample, using repeated measures on individuals (e.g., crossover and longitudinal studies), or decreasing the error of measurement. Finally, selecting a higher Type I error probability will increase power; but ‘‘alpha = 0.05’’ (or lower) is well-entrenched in current research tradition, and we suggest for purely pragmatic reasons that a higher level should be employed sparingly and only with good justification.

123

D. A. Cook, R. Hatala

Implications Our findings indicate that the problem of inadequate power in health professions research remains much the same as it was 30 years ago (Michalczyk and Lewis 1980; Woolley 1983). To remedy this disappointing situation, we suggest two important actions for education researchers and readers. First, researchers planning a study need to prospectively and appropriately estimate sample size and power—which usually targets the smallest educationally important difference. This in turn requires a clear conceptualization of both the intervention and the comparison (and the salient differences), and a realistic idea of the anticipated effect of these differences. In this regard, it is important to note the critical nature of the comparator: studies making comparison with an active educational intervention (i.e., comparative effectiveness studies) will typically result in a much smaller effect (and hence require a larger sample size) than studies making comparison with no intervention. Since nointervention-comparison studies are quite predictable (Cook 2012) and do little to advance the science of education (Cook et al. 2008a), sample sizes for research in simulation-based education (and perhaps medical education as a whole) will need to increase. A parallel issue is found in clinical research: studies of intervention versus placebo typically have larger effects and require smaller sample sizes than comparative effectiveness studies comparing two active interventions (Congressional Budget Office 2007), yet it is the comparative effectiveness studies that address the most important clinical questions (Sox et al. 2009). Second, in interpreting their studies, researchers need to frame results—especially those that fail to reach statistical significance—in terms of the largest or smallest effect that can reasonably be excluded. This is best done using the CI around the mean difference (rather than the CI surrounding the mean point estimates from each group). For a ‘‘positive’’ study (i.e., statistically significant difference found) if the CI lower pole is smaller than the specified educationally important effect size, an unimportant difference cannot be excluded (see Fig. 1, example B2). Similarly, in a ‘‘negative’’ study if the CI includes the educationally important effect size, the study is underpowered and inconclusive (Fig. 1, examples A1, A2, and B1). Conclusive results require a CI that does not include the specified educationally important effect (Fig. 1, examples A3 and B3).

References Borenstein, M. (2009). Effect sizes for continuous data. In H. Cooper, L. V. Hedges, & J. C. Valentine (Eds.), The handbook of research synthesis (2nd ed., pp. 221–235). New York: Sage. Brody, B. A., Ashton, C. M., Liu, D., Xiong, Y., Yao, X., & Wray, N. P. (2013). Are surgical trials with negative results being interpreted correctly? Journal of the American College of Surgeons, 216(1), 158–166. Chan, A. W., & Altman, D. G. (2005). Epidemiology and reporting of randomised trials published in PubMed journals. Lancet, 365, 1159–1162. Charles, P., Giraudeau, B., Dechartres, A., Baron, G., & Ravaud, P. (2009). Reporting of sample size calculation in randomised controlled trials: Review. British Medical Journal, 338, b1732. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum. Congressional Budget Office. (2007). Research on the comparative effectiveness of medical treatments: Issues and options for an expanded federal role. Congress of the United States. Cook, D. A. (2012). If you teach them, they will learn: Why medical education needs comparative effectiveness research. Advances in Health Sciences Education, 17, 305–310.

123

A systematic review of sample size adequacy Cook, D. A., Bordage, G., & Schmidt, H. G. (2008a). Description, justification, and clarification: A framework for classifying the purposes of research in medical education. Medical Education, 42, 128–133. Cook, D. A., Brydges, R., Hamstra, S. J., Zendejas, B., Szostek, J. H., Wang, A. T., et al. (2012). Comparative effectiveness of technology-enhanced simulation versus other instructional methods: A systematic review and meta-analysis. Simulation in Healthcare, 7, 308–320. Cook, D. A., Hamstra, S. J., Brydges, R., Zendejas, B., Szostek, J. H., Wang, A. T., et al. (2013). Comparative effectiveness of instructional design features in simulation-based education: Systematic review and meta-analysis. Medical Teacher, 35, e867–e898. Cook, D. A., Hatala, R., Brydges, R., Zendejas, B., Szostek, J. H., Wang, A. T., et al. (2011a). Technologyenhanced simulation for health professions education: A systematic review and meta-analysis. The Journal of the American Medical Association, 306, 978–988. Cook, D. A., Levinson, A. J., & Garside, S. (2011b). Method and reporting quality in health professions education research: A systematic review. Medical Education, 45, 227–238. Cook, D. A., Levinson, A. J., Garside, S., Dupras, D. M., Erwin, P. J., & Montori, V. M. (2008b). Internetbased learning in the health professions: A meta-analysis. The Journal of the American Medical Association, 300, 1181–1196. Cook, D. A., Levinson, A. J., Garside, S., Dupras, D. M., Erwin, P. J., & Montori, V. M. (2010). Instructional design variations in internet-based learning for health professions education: A systematic review and meta-analysis. Academic Medicine, 85, 909–922. Cook, D. A., Thompson, W. G., & Thomas, K. G. (2014). Test-enhanced web-based learning: Optimizing the number of questions (a randomized crossover trial). Academic Medicine, 89, 169–175. Curtin, F., Altman, D. G., & Elbourne, D. (2002). Meta-analysis combining parallel and cross-over clinical trials. I: Continuous outcomes. Statistics in Medicine, 21, 2131–2144. Goodman, S. N., & Berlin, J. A. (1994). The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results. Annals of Internal Medicine, 121, 200–206. Hansen, W. B., & Collins, L. M. (1994). Seven ways to increase power without increasing N. NIDA Research Monograph, 142, 184–195. Hoenig, J. M., & Heisey, D. M. (2001). The abuse of power. The American Statistician, 55, 19–24. Hulley, S. B., Cummings, S. R., Browner, W. S., Grady, D., Hearst, N., & Newman, T. B. (2001). Designing clinical research: An epidemiologic approach (2nd ed.). Philadelphia: Lippincott Williams & Wilkins. Hunter, J. E., & Schmidt, F. L. (2004). Methods of meta-analysis: Correcting error and bias in research findings. Thousand Oaks, CA: Sage. Issenberg, S. B., McGaghie, W. C., Petrusa, E. R., Lee Gordon, D., & Scalese, R. J. (2005). Features and uses of high-fidelity medical simulations that lead to effective learning: A BEME systematic review. Medical Teacher, 27, 10–28. Le Henanff, A., Giraudeau, B., Baron, G., & Ravaud, P. (2006). Quality of reporting of noninferiority and equivalence randomized trials. The Journal of the American Medical Association, 295, 1147–1151. Lineberry, M., Walwanis, M., & Reni, J. (2013). Comparative research on training simulators in emergency medicine: A methodological review. Simulation in Healthcare, 8, 253–261. Michalczyk, A. E., & Lewis, L. A. (1980). Significance alone is not enough. Journal of Medical Education, 55, 834–838. Mills, E. J., Wu, P., Gagnier, J., & Devereaux, P. J. (2005). The quality of randomized trial reporting in leading medical journals since the revised CONSORT statement. Contemporary Clinical Trials, 26, 480–487. Moher, D., Liberati, A., Tetzlaff, J., & Altman, D. G. (2009). Preferred reporting items for systematic reviews and meta-analyses: The PRISMA statement. Annals of Internal Medicine, 151, 264–269. Morris, S. B., & DeShon, R. P. (2002). Combining effect size estimates in meta-analysis with repeated measures and independent-groups designs. Psychological Methods, 7, 105–125. Piaggio, G., Elbourne, D. R., Pocock, S. J., Evans, S. W., Altman, D. G., for the CONSORT Group. (2012). Reporting of noninferiority and equivalence randomized trials: Extension of the consort 2010 statement. The Journal of the American Medical Association, 308, 2594–2604. Schulz, K. F., Altman, D. G., & Moher, D. (2010). CONSORT 2010 statement: Updated guidelines for reporting parallel group randomized trials. Annals of Internal Medicine, 152, 726–732. Sox, H. C., Greenfield, S., & for the Institute of Medicine Committee on Comparative Effectiveness Research Prioritization. (2009). Initial national priorities for comparative effectiveness research: Report brief. Washington, DC: National Academies Press. Vavken, P., Heinrich, K. M., Koppelhuber, C., Rois, S., & Dorotka, R. (2009). The use of confidence intervals in reporting orthopaedic research findings. Clinical Orthopaedics and Related Research, 467(12), 3334–3339. Woolley, T. W. (1983). A comprehensive power-analytic investigation of research in medical education. Journal of Medical Education, 58, 710–715.

123