Psychological Methods 2015, Vol. 20, No. 3, 394 – 406

© 2015 American Psychological Association 1082-989X/15/$12.00 http://dx.doi.org/10.1037/met0000032

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

Varying Coefficient Meta-Analysis Methods for Odds Ratios and Risk Ratios Douglas G. Bonett

Robert M. Price, Jr.

University of California, Santa Cruz

East Tennessee State University

Odds ratios and risk ratios are useful measures of effect size in 2-group studies in which the response variable is dichotomous. Confidence interval methods are proposed for combining and comparing odds ratios and risk ratios in multistudy designs. Unlike the traditional fixed-effect meta-analysis methods, the proposed varying coefficient methods do not require effect-size homogeneity, and unlike the randomeffects meta-analysis methods, the proposed varying coefficient methods do not assume that the effect sizes from the selected studies represent a random sample from a normally distributed superpopulation of effect sizes. The results of extensive simulation studies suggest that the proposed varying coefficient methods have excellent performance characteristics under realistic conditions and should provide useful alternatives to the currently used meta-analysis methods. Keywords: 2 ⴛ 2 contingency table, fixed-effects, dichotomous variables, moderation effects Supplemental materials: http://dx.doi.org/10.1037/met0000032.supp

The odds ratio is a symmetric measure of association (i.e., the columns of Table 1 can represent levels of the predictor variable or the response variable), whereas the risk difference and risk ratio are asymmetric measures of association (i.e., the columns of Table 1 are assumed to be the levels of the predictor variable). Consequently, the odds ratio is an appropriate measure of effect size in retrospective studies in which the columns in Table 1 represent the response variable and rows of Table 1 represent the predictor variable. The risk difference and risk ratios are approximate measures of effect size only in prospective designs in which the columns of Table 1 (groups) are the levels of the predictor variable and the rows of Table 1 are the levels of the response variable. The odds ratio is a popular measure of effect size partly because inferential statistical methods for odds ratios have good small-sample properties. However, this advantage is no longer relevant given recently developed inferential methods for the risk ratio (Price & Bonett, 2008) and the risk difference (Agresti & Caffo, 2000) that have been shown to have excellent small-sample properties. Some researchers prefer the odds ratio because it can be used to approximate a tetrachoric correlation, a phi coefficient, or a point-biserial correlation using the transformation methods described by Bonett and Price (2005, 2007) and Bonett (2007). Bonett and Price (2014) present statistical methods for comparing and combining risk differences from multiple studies. Inferential methods for comparing and combining odds ratios and risk ratios in multistudy designs will be presented here. Interval estimation for ␻ and ␪ in a single study are described first. We review three statistical models (constant coefficient [CC], varying coefficient [VC], and random coefficient [RC]) for an average effect size in a multistudy design. VC statistical methods for comparing odds ratios and risk ratios are presented next followed by the results of simulation studies. We then provide a

When the response variable is dichotomous in a two-group study, each sampling unit (e.g., person) is classified into one of four possible outcomes, and the probabilities of being classified into these four outcomes can be summarized in a 2 ⫻ 2 contingency table, as shown in Table 1, where ␲ij is the unknown probability of a sampling unit in group j exhibiting level i of the dichotomous (event vs. nonevent) response variable. Let ␲1 ⫽ ␲11/(␲11 ⫹ ␲21) and ␲2 ⫽ ␲12/(␲12 ⫹ ␲22). The risk ratio ␪ ⫽ ␲1/␲2, odds ratio ␻ ⫽ ␲11␲22/␲12␲21, and risk difference ⌬ ⫽ ␲1 – ␲2, are three measures of effect size that are used in meta-analyses of 2-group studies with a dichotomous response (see, e.g., Fleiss & Berlin, 2009). The risk ratio and risk difference are more intuitive and easier to interpret than the odds ratio. Researchers mistakenly interpret an odds ratio as if it were a risk ratio (Schmidt & Kohlmann, 2008), which is problematic because an odds ratio effect size can be substantially larger than a risk ratio effect size. The risk difference is popular because the inverse of the absolute risk difference, referred to as the number needed to treat (NNT), has a useful interpretation in treatment versus control designs. NNT is the number of people that need to be treated to prevent one person from having an adverse event. A criticism of the risk difference is that it tends to exhibit greater variability across studies than the odds ratio or risk ratio (Fleiss & Berlin, 2009).

This article was published Online First March 9, 2015. Douglas G. Bonett, Department of Psychology, University of California, Santa Cruz; Robert M. Price, Jr., Department of Mathematics and Statistics, East Tennessee State University. Correspondence concerning this article should be addressed to Douglas G. Bonett, Department of Psychology, University of California, 1156 High Street, Santa Cruz, CA 95064. E-mail: [email protected] 394

ODDS RATIO AND RISK RATIO META-ANALYSIS

Table 1 2 ⫻ 2 Contingency Table of Cell Probabilities

^ {ln(␪ˆ )} ⫽ var

Response

Group 1

Group 2

Event Nonevent

␲11 ␲21

␲12 ␲22

395 1

冦 ⫹

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

detailed discussion of the strengths and weaknesses of the CC, VC, and RC approaches. The proposed VC methods are illustrated with three examples.

Estimation of Odds Ratios and Risk Ratios

and

Sample data for a two-group design with a dichotomous response variable may be summarized in a 2 ⫻ 2 table of observed frequency counts as shown in Table 2, where nj is the sample size in group j. The sample frequency counts in Table 2 can be used to construct confidence intervals for ␪ and ␻. The following 100(1 – ␣)% transformed Wald confidence interval for ␻ is widely used:



^ {ln(␻)} ˆ 兴, 兹var

exp ln(␻) ˆ ⫾ z␣⁄2

(1)

where 1

^ {ln(␻)} var ˆ ⫽

f 11 ⫹

1 2

1



f 12 ⫹

1 2



1 f 21 ⫹

1



2

1 f 22 ⫹

1

,

2

冋冉 冊冉 冊 ⁄ 再冉 冊冉 冊冎册

ln(␻) ˆ ⫽ ln

f 11 ⫹

1

2

1

f 22 ⫹

2

f 12 ⫹

1

2

f 21 ⫹

1

2

(2)

, (3)

and z␣/2 is a critical two-sided z value (e.g., z␣/2 ⫽ 1.96 for a 95% confidence interval). The addition of [1/2] to the cell frequencies in Equations 2 and 3 was suggested by Gart (1966), and it can be shown (see Agresti, 2002, p. 595) that the bias of ln 共␻兲 ˆ is minimized with the addition of [1/2] to each cell frequency. The transformed Wald confidence interval for ␻ described by Borenstein, Hedges, Higgins, and Rothstein (2009, p. 37), which was proposed by Woolf (1955) and does not include the [1/2] additions to Equations 2 and 3, is undefined when any fij ⫽ 0. The traditional transformed Wald interval for ␪ described by Agresti (2002, p. 73) and Borenstein et al. (2009, p. 35) has unacceptable performance characteristics and is not recommended for routine use (Koopman, 1984; Newcombe, 2013). One of the best available confidence interval methods for ␪ is the iterative method proposed by Koopman (1984). Price and Bonett (2008) developed a confidence interval for ␪, based on a normal approximation to a Bayesian posterior distribution, that performs as well as the Koopman method. The Price–Bonett 100(1 – ␣)% confidence interval for ␪ may be expressed as



^ 兵ln(␪ˆ )其兴 , 兹var

exp ln(␪ˆ ) ⫾ z␣⁄2 where

(4)

1 f 11 ⫹ ⫹ 4

冉 冊 f 11 ⫹

2

1

4

n1 ⫺ f 11 ⫹

3 2



1



1 f 12 ⫹ ⫹ 4

冉 冊 f 12 ⫹

1

2

4

n2 ⫺ f 12 ⫹

3 2

(5)



冋再冉 冊 ⁄ 冉 冊冎 ⁄ 再冉 冊 ⁄ 冉 冊冎册

ˆ ⫽ ln ln(␪)

f 11 ⫹

1

4

n1 ⫹

7

4

f 12 ⫹

1

4

n2 ⫹

7

4

.

(6) Now consider the possibility of two or more studies reporting comparable 2 ⫻ 2 tables of frequency counts. In general, a 2 ⫻ 2 table of frequency counts will be observed in each study in which fijk is the number of sampling units (e.g., participants) assigned to group j and have a response of i ⫽ 1 (event) or i ⫽ 2 (nonevent) in study k (k ⫽ 1 to m). The sample sizes for the two groups in study k will be denoted as n1k and n2k. It is assumed that no two studies have any participants in common. To simplify notation, let ␽k ⫽ ln(␪k) for all k or ␽k ⫽ ln(␻k) for all k. We denote the ^ 共␽ˆ 兲, estimator of ␽k as ␽ˆ k and the estimated variance of ␽ˆ k as var k ˆ ^ where var 共␽k兲 is given by Equations 2 or 5 for the odds ratio or risk ratio, respectively. Forest plots (see Borman & Grigg, 2009, pp. 505–510) are informative graphical and tabular displays of confidence interval results from each study in a multistudy design. If comparable 2 ⫻ 2 tables have been reported in two or more studies, Equations 1 and 4, rather than the traditional transformed Wald intervals (Borenstein et al., 2009, pp. 35 and 37), should be computed for each study and these results could then be presented in a forest plot. In a multistudy design with m ⱖ 2 studies with one estimate of ␽k obtained from each study, it will be possible to combine and compare the m estimates. As will be explained below, the ␽k values can be compared by estimating linear contrasts of the ␽k values or by estimating slope coefficients in a linear statistical model of the ␽ˆ k values. The odds ratios or risk ratios could vary across study populations because of differences in certain demographic features of the study population (e.g., age, education level, geographic region) or because of differences in study characteristics (e.g., type of design, measurement procedure, type of treat-

Table 2 2 ⫻ 2 Contingency Table of Observed Cell Frequencies Response

Group 1

Group 2

Event Nonevent

f11 f21 n1

f12 f22 n2

BONETT AND PRICE

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

396

ment). Useful scientific information may be gained if the population odds ratios or risk ratios are found to be associated with certain demographic variables or study characteristics. If an analysis of linear contrasts or slope coefficients suggests that the ␽k values are not too dissimilar, it may be informative to estimate the mean of the ␽k values. Exponentiating the arithmetic mean of the ␽k values gives a geometric mean odds ratio or risk ratio. A geometric mean is a useful measure of centrality for ratio scale measurements (see, e.g., Zar, 1999, p. 28). A confidence interval for a geometric mean odds ratio or risk ratio from a multistudy design can be considerably narrower, and hence more informative, than Equations 1 or 4 obtained in a single study. Furthermore, a confidence interval for a geometric mean odds ratio or risk ratio can have substantially greater external validity than the confidence interval for a single odds ratio or risk ratio.

Statistical Models for Multistudy Designs ˆ is a random variable, and hence a statistical The estimator ␽ k model may be used to represent its large-sample expected value and random deviation from expectation. Three types of statistical models can be used to represent odds ratio or risk ratio estimators from multiple studies—the CC model, the VC model, and the RC model. The CC and RC models and inferential methods for odds ratios, risk ratios, and risk differences are described by Borenstein et al. (2009). A VC model and inferential methods for risk differences is described by Bonett and Price (2014). A VC model for odds ratios and risk ratios is described below. The VC model for ␽ˆ k can be expressed as ␽ˆ k ⫽ ␽ ⫹ ␥k ⫹ εk ,

(7)



m where k⫽1 ␥k ⫽ 0 and the disturbances εk are assumed to be independent and (in large samples) normally distributed heteroscedastic random variables. The unknown parameter ␽ is an unweighted average of m population effect sizes and each ␥k is an unknown constant that represents heterogeneity effects due to unspecified moderator variables. The appropriate estimator of ␽ in the VC model is the following unweighted average:

␽ˆ ⫽ m⫺1

兺mk⫽1 ␽ˆ k,

(8)

where ␽ˆ k is given in Equation 3 or Equation 6 for the log odds ratio or log risk ratio, respectively. The geometric mean odds ratio or risk ratio for the m study populations is equal to exp(␽) and exponentiating Equation 8 gives an estimator of exp(␽). The VC model (Equation 7) can be viewed as a highly specialized case of more general and commonly used VC statistical models. Fixed-effect analysis of variance (ANOVA) and analysis of covariance models, general linear models with fixed quantitative predictor variables, generalized linear models that include indicator variables, and multiple-group structural equation models are some examples of VC models. Overton (1998) gives an example of a VC model for correlation estimators, and an equivalent representation of Equation 7 for an arbitrary estimator is given by Cox (2006). Let ␽k ⫽ ␽ ⫹ ␥k. The CC model, which can be written as ␽ˆ k ⫽ ␽ ⫹ εk, is a special case of Equation 7 in which all ␥k values are assumed to equal zero and consequently all ␽k values are assumed to be equal. Assumed equality of ␽k values in the CC model is

referred to as the effect-size homogeneity assumption. The CC and VC models are both fixed-effects models because ␽ and ␥k are unknown constants, and stratified random sampling is typically assumed in both models in which random samples of size njk are obtained from m study populations. The appropriate estimator of ␽ in the CC model is the following inverse-variance weighted average: ␽˜ ⫽

兺mk⫽1 wˆ kˆ␽k ⁄ 兺mk⫽1 wˆ k,

(9)

^ (␽ˆ )] and var ^ (␽ˆ ) is given in Equation 2 for ˆ k ⫽ 1/[var where w k k odds ratios and Equation 5 for risk ratios. Exponentiating Equation 9 gives an estimator of exp(␽). Equation 9 is not a consistent estimator of ␽ if the effect-size homogeneity assumption has been violated and if the weights in Equation 9 are not all equal. The large-sample bias of Equation 9 is equal to ␽⫺

兺mk⫽1 E(wˆ k)␽k ⁄ 兺mk⫽1 E(wˆ k),

(10)

ˆ k兲 is the expected value of w ˆ k. The bias vanishes if all ␽k where E共w values are equal (i.e., the effect-size homogeneity assumption has been satisfied) or if all weights are equal. Note also that Equation m ˆ k兲␽k ⁄ m ˆ k兲. 9 is a consistent estimator of ␽= ⫽ k⫽1 E共w k⫽1 E共w With unequal samples sizes across studies and effect-size heterogeneity, ␽= is an improper population parameter because it is a function of the sample sizes. However, under effect-size homogeneity ␽k ⫽ ␽ for all k and ␽= then becomes a proper m ˆ k兲␽k⁄ m ˆ k兲 ⫽ parameter because ␽⬘ ⫽ k⫽1 E共w k⫽1 E共w m ˆ ˆ E共 w 兲 ⁄ E共 w 兲 ⫽ ␽. ␽ m k k k⫽1 k⫽1 The RC model, also referred to as a random effects (RE) model, assumes that the m effect sizes are a random sample from a large definable superpopulation of effect sizes ␽1, ␽2 . . . , ␽M . We can write the RC model as ␽ˆ k ⫽ ␽ⴱ ⫹ ␥k ⫹ εk, where ␽ⴱ is an unweighted mean of the M effect sizes and each ␥k is a random variable with a superpopulation mean of 0 and superpopulation standard deviation of ␶. In the RC model, ␽ⴱ and ␶ are the fundamental parameters and confidence intervals for both ␽ⴱ and ␶ are required to characterize the superpopulation normal distribution. The RC model also assumes that the ␽k values in the superpopulation have an approximate normal distribution (the normality assumption is needed only for interval estimation of ␽ⴱ and ␶). The RC model is appealing because it does not assume effect-size homogeneity and ␽ⴱ is an unweighted average of M effect sizes and ␽ in the CC and VC models is an unweighted average of m ⬍ M effect sizes. The superpopulation mean is traditionally estimated using the following inverse-variance weighted average:









␽˜ * ⫽

兺mk⫽1 wˆ *k␽ˆ k ⁄ 兺mk⫽1 wˆ *k ,





(11)

^ (␽ˆ ) ⫹␶ˆ 2] and ␶ˆ 2 is typically obtained using the ˆ *k ⫽ 1/[var where w k estimation method of DerSimonian and Laird (1986). Exponentiating Equation 11 gives an estimator of exp(␽). When ␶ is small or the sample sizes in each study are small, the DerSimonian-Laird estimate of ␶2 will frequently equal 0 and then Equation 11 reduces to Equation 9. Equation 11 is not a consistent estimator of ␽ⴱ when the weights are correlated with effect sizes (see Appendix). The large-sample bias of Equation 11 is equal to

ODDS RATIO AND RISK RATIO META-ANALYSIS



␽* ⫺ ␽* ⫹ m␳w␽␴w␶

⁄兺

共wˆ *k兲兴 ,

m k⫽1 E

(12)

where ␳w␽ is the correlation between the true weights and the study ˆ *k ) is the population effect sizes in the superpopulation, and E(w ˆ *k . Note that the large-sample bias large-sample expected value of w of Equation 11 vanishes if ␳w␽ ⫽ 0, or if all weights in the superpopulation are equal (␴w ⫽ 0), or if all effect sizes in the superpopulation are equal (␶ ⫽ 0).

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

Inferential Methods for the Varying Coefficient Model Inferential methods for the CC model and RC model are discussed in meta-analysis textbooks (see, e.g., Borenstein et al., 2009) for a variety of effect-size parameters. VC inferential methods for odds ratios and risk ratios are given here. The following large-sample 100(1 – ␣)% confidence interval for exp(␽) of the VC model is proposed:



ˆ )其 , ^ (␽ 兹var

exp ␽ˆ ⫾ z␣⁄2

(13)



ˆ is given in Equation 8, var ^ 共␽ˆ 兲 ⫽ m⫺2 m var ^ 共␽ˆ 兲, ␽ˆ is where ␽ k k k⫽1 ˆ ^ given in Equation 3 or Equation 6, and var 共␽ 兲 is given in k

Equation 2 or Equation 5 for risk ratios or odds ratios, respectively. Suppose the researcher believes that some characteristic of the m study populations is responsible for variation in the ␽ˆ values. k

For instance, we might suspect that the first two ␽ˆ k values differ from the next three ␽ˆ k values because the first two studies used samples of college students and the next three studies used samples of high school students. In this situation, a confidence interval for (␽1 ⫹ ␽2)/2 – (␽3 ⫹ ␽4 ⫹ ␽5)/3 could provide useful information. In general, various research questions can be addressed by an examination of specific linear functions of the unknown population parameters ␽1, ␽2 . . . , ␽m, which are log-linear functions of odds ratios or risk ratios. A general log-linear function can be m expressed as k⫽1 ck␽k, where each ck is a numerical value specified by the researcher. For instance, (␽1 ⫹ ␽2)/2 – (␽3 ⫹ ␽4 ⫹ ␽5)/3 can be expressed as m k⫽1 ck␽k, where c1 ⫽ 1/2, c2 ⫽ 1/2, c3 ⫽ ⫺1/3, c4 ⫽ ⫺1/3, and c5 ⫽ ⫺1/3. The following large-sample 100(1 – ␣)% confidence interval for m k⫽1 ck␽k is proposed:







^ (␽ˆ ), k 兺mk⫽1 ck␽ˆ k ⫾ z␣⁄2兹兺mk⫽1 c2kvar

(14)

^ 共␽ˆ 兲 is given where ␽ˆ k is given in Equation 3 or Equation 6 and var k in Equation 2 or Equation 5 for odds ratios or risk ratios, respectively. Exponentiating Equation 14 usually gives a more interpretable result. In some multistudy designs, it might be easier to assess the effects of certain moderator factors using a log-linear statistical model than to specify the ck coefficients in a log-linear function. This is especially true when assessing the effects of quantitative factors and interaction effects involving quantitative factors. The m estimators, ␽ˆ , ␽ˆ . . . , ␽ˆ , may be expressed as the following 1

2

m

linear function of q – 1 specified predictor variables and m – q unspecified predictor variables:

397 ␽ˆ ⫽ X␤ ⫹ Z␥ ⫹ ε

(15)

where ␽ˆ is an observable m ⫻ 1 random vector with typical element ␽ˆ k, X is a fixed m ⫻ q full-rank design matrix that codes known quantitative or qualitative characteristics of the m study populations, ␤ is a q ⫻ 1 vector of unknown fixed parameters, and ε is an unobservable m ⫻ 1 vector of heteroscedastic random sampling errors with var(εk) ⫽ var(␽ˆ k). Usually, the first column of X will be an m ⫻ 1 vector of ones so that the first element of ␤ will represent an intercept coefficient. We assume that there is variation in effect sizes that will not be completely explained by X. This unexplained heterogeneity could be explained from m – q unspecified predictor variables that are represented in Equation 15 by the columns of Z which is a fixed m ⫻ (m – q) full rank matrix. With the appropriate choice of X, the elements of ␤ could represent interesting effects, such as main effects, interaction effects, simple main effects, slopes, or simple slopes (see, e.g., Pedhazur, 1997). An appropriate choice of dummy coding, effect coding, or orthogonal coding will give ␤ the most useful interpretation. Dummy coding is useful when coding a single qualitative moderator variable; effect coding is useful when coding factorial effects; and orthogonal coding is useful when coding polynomial effects of a quantitative moderator variable. Interpretability of ␥ is not an issue (the elements in ␥ could be regarded as nuisance parameters), and we assume that Z is coded such that Z=X ⫽ 0. This condition can be satisfied by letting Z ⫽ [I ⴚ X(X=X)ⴚ1X=]Zⴱ, where Zⴱ is a full rank m ⫻ (m – q) matrix that codes unspecified predictor variables. We propose the following ordinary least squares (OLS) estimator of ␤ in Equation 15: ˆ, ˆ ⫽ (X⬘X)⫺1X⬘␽ ␤

(16)

with an estimated covariance matrix ˆX(X⬘X)⫺1 , ^ (␤) ˆ ⫽ (X⬘X)⫺1X⬘V cov

(17)

ˆ is a diagonal matrix with var ^ (␽ˆ ) in the kth diagonal where V k element. The estimated variance of ␤ˆ t is the tth diagonal element ^ (␤ˆ ). A large-sample of Equation 17, which will be denoted as var t 100(1 ⫺ ␣)% confidence interval for ␤t is ^ (␤ˆ ), 兹var

␤ˆ t ⫾ z␣⁄2

t

(18)

where ␣ may be replaced with ␣/g to obtain g simultaneous Bonferroni confidence intervals for any g elements of ␤. An exponentiated slope coefficient (e␤t) describes the multiplicative change in exp(␽) associated with a 1-point increase in the tth predictor variable, controlling for all other predictor variables in the model. Exponentiating Equation 18 gives a confidence interval for the multiplicative change. Let x0 denote a 1 ⫻ q vector with columns containing values of the predictor variables represented by the corresponding columns of X in Equation 15. Let ␽0 denote predicted value of ␽ at x0. An estimator of ␽0 is ˆ ␽o ⫽ x⬘0 ␤ˆ , and its estimated variance is

(19)

BONETT AND PRICE

398 ˆ ) ⫽ x⬘ cov ^ (␽ ^ (␤)x ˆ 0. var o 0

(20)

Exponentiating Equation 19 gives a predicted odds ratio or risk ratio value. A large-sample 100(1 ⫺ ␣)% confidence interval for ␽0 is ^ (␽ˆ ), 兹var

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

␽ˆ o ⫾ z␣⁄2

o

(21)

and exponentiating Equation 21 gives a confidence interval for a predicted odds ratio or risk ratio for values of the predictor variables equal to x0. The meta-regression model (see, e.g., Hedges & Olkin, 1985, pp. 169 –170) assumes ␥ ⫽ 0 and a weighted least squares (WLS) estimator of ␤ is asymptotically optimal under the ␥ ⫽ 0 assumpˆ , where W is a tion. The WLS estimator of ␤ is 共X⬘WX兲⫺1X⬘W␽ diagonal matrix of weights, and is more efficient than the proposed OLS estimator (Equation 16). However, it can be shown that the WLS estimator is not a consistent estimator of ␤ if ␥ ⫽ 0. The large-sample bias of the WLS estimator is [(X⬘ WX)⫺1X⬘ WZ* ⫺ (X⬘ X)⫺1X⬘ Z*]␥,

(22)

which vanishes if ␥ ⫽ 0 or if the weights are equal. The WLS bias has also been discussed by Overton (1998) who proposed OLS estimation for the case of Fisher transformed correlations. Note that if X ⫽ 1, where 1 is an m ⫻ 1 vector of ones, Equation 15 is an equivalent representation of Equation 7 and ␤ is a 1 ⫻ 1 scalar that is equal to ␽. With X ⫽ 1, the CC model is specified by assuming ␥ ⫽ 0 . With X ⫽ 1, the OLS estimator of ␤ is equal to ˜ (Equation ␽ˆ (Equation 8) and the WLS estimator of ␤ is equal to ␽

9). Note that with X ⫽ 1, Z could be specified as an m ⫻ m – 1 matrix of effect codes (see Pedhazur, 1997, p. 360) so that ␥ will then consist of m – 1 effect-size deviations from ␽. Note also if X is m ⫻ m, then Z␥ does not appear in Equation 15 and it is easy to show that the WLS and OLS estimators of ␤ are identical.

Simulation Studies The proposed confidence interval methods are based on largesample theory, and their exact small-sample properties cannot be determined analytically. The Monte Carlo method was used to assess the small-sample performance of the proposed confidence interval methods under a wide range of conditions. The GAUSS programming language was used in the simulation studies. GAUSS simulation programs are provided as online supplementary materials.

Simulation Results for Geometric Mean in the Varying Coefficient Model The sampling distribution of ␽ˆ k is discrete, and confidence ˆ can behave erratically with small intervals based on functions of ␽ k changes in the sample sizes and cell probabilities. For this reason, it is necessary to examine the performance of the proposed confidence interval methods under a very large number of sample size and effect sizes. We examined the performance of Equation 13 in 4,500 different patterns of sample sizes and effect sizes for m ⫽ 5, m ⫽ 15, and m ⫽ 30. In each of the 4,500 conditions, 75,000 m ⫻ 2 ⫻ 2 tables of frequency counts were randomly generated. In each

of the 4,500 conditions, the 95% coverage probability of Equation 13 was estimated from the 75,000 Monte Carlo trials. In this simulation study, the sample sizes per group within each study were equal but varied from 10 to 100 across the m studies. In each of the 4,500 conditions, the m sample sizes were randomly selected from a uniform distribution. The results of the 4,500 conditions have been summarized in Table 3 into 27 sets of 500 conditions that differ in terms of the range of ␲1 and ␲2 values across studies and three values of m. The values of ␲1 and ␲2 within each study were randomly selected from a uniform distribution. In conditions in which ␲1 and ␲2 have a large range of possible values, the population odds ratios or risk ratios could then have greater heterogeneity across studies. After cell probabilities and sample sizes were selected for a particular condition, these values were then fixed for all 75,000 replications within that condition. The uniform distribution was used to obtain good coverage of effect sizes and sample sizes over the specified ranges. The average coverage probabilities across each set of 500 conditions and the smallest coverage probability within each set of 500 conditions are reported in Table 3. The typical format for reporting results of this type is to report the simulated coverage probability for each condition separately. For instance, the most extensive simulation reported in Hedges and Olkin (1985, pp. 176 –179) has 96 conditions (with 2,000 replications per condition), and the results are reported in 96 rows of a multipage table. Table 3 summarizes the results for 13,500 conditions with each condition based on 75,000 replications. The results in Table 3 show that the proposed VC confidence interval has good smallsample coverage properties with a worst case 95% coverage probability across all 13,500 conditions of .948 for odds ratios and .942 for risk ratios. However, Equation 13 is conservative when ␲1 and ␲2 are close to 0 or 1. Additional simulations were performed but not reported in Table 3. These additional simulations show that increasing the sample sizes results in average and worst case coverage probabilities that are closer to .95, and that the results in Table 3 are virtually unchanged when the sample sizes within each study are unequal (ratios of n1k/n2k as large as 3 were examined). Also, a similar pattern of results was obtained using 90% and 99% confidence levels.

Simulation Results for Geometric Mean in the Constant Coefficient Model The CC model is still widely used in psychological metaanalysis studies (Schmidt, Oh, & Hayes, 2009), and the purpose of this simulation study is to demonstrate the poor performance of the traditional CC confidence interval for the geometric mean odds ratio or risk ratio under minor and difficult-to-detect violations of the effect-size homogeneity assumption. This study examined the performance of the traditional CC confidence interval for ␽ that uses the inverse-variance weighted estimator (Equation 9) with variance estimates and point estimators given in Equations 2 and 3 for the odds ratio and Equations 5 and 6 for the risk ratio. The estimated coverage probabilities for the CC confidence intervals were obtained using the same procedure described above for the VC simulation study. However, only 9 of the 27 rows in Table 3 were examined—these 9 rows represent patterns of cell probabilities that represent minor violations of the effect-size ho-

ODDS RATIO AND RISK RATIO META-ANALYSIS

399

Table 3 Performance of 95% Varying Coefficient Confidence Intervals for Odds Ratios and Risk Ratios Average coverage ␲1

␲2

m⫽5

m ⫽ 15

.02–.05 .05–.15 .45–.55 .45–.55 .85–.95 .05–.25 .40–.60 .40–.60 .05–.95

.02–.05 .05–.15 .45–.55 .05–.15 .05–.15 .05–.25 .40–.60 .05–.25 .05–.95

.990 .972 .952 .959 .958 .967 .952 .958 .960

.990 .970 .951 .960 .960 .965 .951 .960 .960

.02–.05 .05–.15 .45–.55 .45–.55 .85–.95 .05–.25 .40–.60 .40–.60 .05–.95

.02–.05 .05–.15 .45–.55 .05–.15 .05–.15 .05–.25 .40–.60 .05–.25 .05–.95

.992 .977 .958 .969 .969 .972 .959 .966 .966

.990 .972 .956 .970 .972 .969 .956 .967 .967

Minimum coverage m ⫽ 30

m⫽5

m ⫽ 15

m ⫽ 30

.990 .969 .950 .960 .961 .964 .951 .958 .959

.974 .955 .948 .950 .950 .952 .949 .950 .948

.977 .955 .949 .952 .953 .953 .947 .950 .950

.981 .957 .948 .954 .953 .954 .948 .952 .948

.990 .970 .955 .959 .957 .966 .955 .961 .966

.977 .953 .951 .952 .953 .952 .951 .952 .950

.975 .954 .952 .951 .952 .952 .951 .950 .950

.980 .955 .952 .938 .937 .952 .952 .943 .942

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

Odds ratio

Risk ratio

Note. n1k⫽ n2k ⫽ 10 to 100; ␲1 is the probability of an event in condition 1 and ␲2 is the probability of an event in Condition 2; each row summarizes 500 conditions with 75,000 Monte Carlo trials per condition.

mogeneity assumption and would likely be undetected by the chi-square effect-size homogeneity test. For instance, the average power of the chi-square effect-size homogeneity test based on log-odds ratios across all 1,500 conditions in Table 4 for m ⫽ 5 was about .08, and the maximum power across all 1,500 conditions was only .55. Power was slightly lower when the homogeneity test was computed using log risk ratios. The results of the simulation study are summarized in Table 4. The minimum coverage probabilities in Table 4 clearly illustrate the limitations of the CC Table 4 Performance of 95% Constant Coefficient Confidence Intervals Under Minor Effect-Size Heterogeneity Average coverage ␲1 m⫽5 .05–.15 .45–.55 .45–.55 m ⫽ 15 .05–.15 .45–.55 .45–.55 m ⫽ 30 .05–.15 .45–.55 .45–.55

Minimum coverage

␲2

Odds ratio

Risk ratio

Odds ratio

Risk ratio

.05–.15 .45–.55 .05–.15

.967 .951 .892

.968 .956 .832

.870 .901 .575

.866 .908 .457

.05–.15 .45–.55 .05–.15

.967 .950 .757

.968 .956 .587

.854 .908 .386

.846 .912 .178

.05–.15 .45–.55 .05–.15

.966 .949 .555

.968 .956 .305

.844 .903 .135

.822 .925 .037

Note. n1k⫽ n2k ⫽ 10 to 100; ␲1 is the probability of an event in condition 1 and ␲2 is the probability of an event in Condition 2; each row summarizes 500 conditions with 75,000 Monte Carlo trials per condition.

confidence interval because its true coverage probability can be far below the specified confidence level in situations in which the researcher incorrectly believes that the CC model assumptions have been satisfied. We also examined the performance of Mantel–Haenszel (MH) confidence intervals for the geometric mean odds ratios and risk ratios, which assume a CC model and are options in SAS PROC FREQ. We found that the MH methods performed worse than the inverse-variance weighted methods in most conditions. Previous simulation research (see Fleiss, Levin, & Paik, 2003, pp. 254 –255) suggests that the MH methods are preferred to inverse-variance weighted methods if m is large and the sample sizes are small, but these findings are restricted to the case where the key assumption of the CC model (i.e., identical effect sizes across the m studies) has been satisfied. Researchers will be tempted to use the CC estimation methods rather than the VC estimation methods because a CC confidence interval is usually narrower than a VC confidence interval. However, as the simulation results clearly illustrate, a CC confidence interval can have a coverage probability that is far below the claimed level of confidence. It is important to remember that a “nonsignificant” effect-size homogeneity test does not imply that the effect sizes are equal and the CC model can be justified (Borenstein et al., 2009, p. 84).

Simulation Results for Slope Coefficient in the Varying Coefficient Log-Linear Model This simulation study compared the performance of OLS (Equation 18) and WLS confidence intervals of a slope coefficient in a VC log-linear model (where ␥ ⫽ 0) for odds ratios or risk ratios with one predictor variable under various degrees of unspecified

BONETT AND PRICE

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

400

heterogeneity. We compared the OLS confidence intervals and WLS confidence intervals for two different design matrices and m ⫽ 5. With the first design matrix, the ␽k values are assumed to vary linearly as a function of the predictor variable. With the second design matrix, the ␽k values are assumed to be equal for the first two studies and equal for the last three studies so that ␤1 ⫽ (␽1 ⫹ ␽2)/2 – (␽1 ⫹ ␽4 ⫹ ␽5)/3. With the first design matrix, model misspecification would result from any deviation from linearity. With the second design matrix, model misspecification would result from any inequality of ␽k values in the first two studies or any inequality of ␽k values in the last three studies. Table 5 summarizes the confidence interval results for the two design matrices from 500 patterns of effect sizes and sample sizes. The minimum coverage probability estimates in Table 5 indicate that the coverage probability of the WLS confidence interval can be far below the specified confidence level when ␥ ⫽ 0. The OLS method performs properly regardless of the degree of model misspecification. Although the traditional WLS estimator is more efficient than the proposed OLS estimator, the WLS estimator is inconsistent under model misspecification and thus the traditional WLS confidence interval can perform much worse than the proposed OLS confidence interval. The OLS simulation results in Table 5 also apply to the confidence interval for m k⫽1 ck␽k (Equation 14) and ␽0 (Equation 21). The OLS results apply to Equation ˆ 14 because ␤ˆ t can be expressed as m k⫽1 ck␽k, where c1, c2 . . ., cm ⫺1 are the elements of the tth row of (X=X) X= . The OLS results ˆ also apply to Equation 21 because ␽ˆ 0 is a linear function of ␤. Researchers will be tempted to use the classic WLS methods rather than the proposed OLS methods because the WLS methods





Table 5 Comparison of 95% Ordinary Least Squares (OLS) and Weighted Least Squares (WLS) Confidence Intervals for Slope Average coverage ␲1 x2= ⫽ [1 2 3 4 5] .05–.15 .05–.25 .05–.45 x2= ⫽ [1 1 0 0 0] .05–.15 .05–.25 .05–.45

␲2

OLS

WLS

Minimum coverage OLS

WLS

Model Choice Recommendations The effect-size homogeneity assumption of the CC model is difficult to justify in most meta-analysis applications (Aguinis, Gottfredson, & Wright, 2011; Borenstein et al., 2009, p. 83; Hunter & Schmidt, 2004; National Research Council, 1992), and the popularity of CC meta-analysis methods (Schmidt et al., 2009) suggests that researchers and journal editors are simply not aware of the unacceptable performance characteristics of the CC methods when the effect-size homogeneity assumption has been violated. Borenstein et al. (2009) stated that the CC model should be used only if the m studies are “functionally identical” but admit that such cases are “relatively rare” (pp. 83– 84). The popularity of the CC model is perhaps due to a fundamental misunderstanding of the widely used chi-square effect-size homogeneity test—many researchers incorrectly believe that a failure to reject the null hypothesis of effect-size homogeneity is evidence in support of effect-size homogeneity and that this type of evidence justifies the use of the CC model. Although there is no statistical procedure that can be used to show that all m effect sizes are equal, equivalence tests (see, e.g., Wellek, 2000) for all pairwise comparisons can be used to determine whether the differences in population effect sizes are small. However, equivalence tests to detect small differences usually require enormous within-study sample sizes. The simulation results in Table 4 illustrate how minor and difficult to detect violations of the effect-size homogeneity assumption have deleterious effects on the performance of the CC confidence interval methods. In contrast, the VC confidence intervals capture the true average effect size with probability that is close to the stated confidence level under a wide range of effect-size heterogeneity conditions. As noted above, the classic inverse-variance weighted average estimator (Equation 8) is not a consistent estimator of ␽ but instead m ˆ k兲␽k ⁄ m ˆ k兲, which is estimates the quantity ␽⬘ ⫽ k⫽1 E共w k⫽1 E共w not a proper population parameter unless the weights are equal or the effects sizes are equal. This same problem has been extensively discussed in the context of choosing between Type II and Type III ANOVA sums of squares in the analysis of unbalanced factorial designs. Note that a meta-analysis of m two-group studies can be represented as a 2 ⫻ m design. A Type II analysis in a 2 ⫻ m design assumes that the simple main effects of the first factor are equal across the m levels of the second factor (i.e., a zero two-way interaction effect), which is analogous to using a CC meta-analysis model in which the effect size is assumed to be identical across the m studies. A Type III analysis does not assume equal simple main effects across the m level of the second factor and is analogous to using a VC meta-analysis model. Unless the two-way interaction effect is zero, the Type II estimates of the main effects will be functions of sample sizes (see, e.g., Maxwell & Delaney, 2004, p. 327). The Type II analysis is recommended only if the population two-way interaction effect is known to be zero (see, e.g., Maxwell & Delaney, 2004, p. 334), which is consistent with a recommendation to use the CC model only if the effect-size homogeneity



Odds ratio .05–.15 .05–.25 .05–.4

.974 .968 .962

.969 .948 .896

.955 .952 .951

.868 .455 .013

.05–.15 .05–.25 .05–.45

.973 .967 .962

.968 .944 .873

.955 .953 .950

.884 .521 .156

Risk ratio x2= ⫽ [1 2 3 4 5] .05–.15 .05–.25 .05–.45 x2= ⫽ [1 1 0 0 0] .05–.15 .05–.25 .05–.45

give confidence intervals that are narrower than the OLS confidence intervals in any given meta-analysis. However, the simulation results clearly show that the WLS-based confidence intervals have coverage probabilities that can be far below the claimed level of confidence.

.05–.15 .05–.25 .05–.45

.978 .972 .967

.969 .943 .852

.955 .952 .951

.866 .438 .029

.05–.15 .05–.25 .05–.45

.977 .972 .968

.969 .935 .836

.954 .952 .950

.874 .444 .006

Note. x2 is the second column of the design matrix; n1k⫽ n2k ⫽ 10 to 100; ␲1 is the probability of an event in condition 1 and ␲2 is the probability of an event in Condition 2; each row summarizes 500 conditions with 75,000 Monte Carlo trials per condition.



This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

ODDS RATIO AND RISK RATIO META-ANALYSIS

assumption can be satisfied. Although some statisticians recommend a Type II analysis because of potentially greater power in hypothesis testing applications (Nelder, 1976), we are unaware of any recommendation in favor of a Type II analysis for effect-size estimation because the Type II effect-size estimate would be an uninteresting function of sample sizes. A preference for the VC model over the CC model also follows from the “never-pool,” “sometimes-pool,” and “always-pool” recommendations (see, e.g., Janky, 2003) regarding the pooling of the interaction sum of squares with the error sum of squares in factorial designs. In a 2 ⫻ m ANOVA, the always-pool approach assumes the two-way interaction is zero (i.e., equal simple main effects of the first factor across the levels of the second factor) and is analogous to always using the CC model. The never-pool approach does not assume a zero two-way interaction and is analogous to always using the VC model. The sometimes-pool approach is analogous to using a preliminary test of effect-size homogeneity and then using the CC model if the homogeneity test is nonsignificant. The problems with the sometimes-pool approach are now well known and the always-pool approach is not considered to be a viable option (Janky, 2003). The never-pool approach has been the recommended approach for at least 50 years (see Scheffé, 1959, p. 126) and is consistent with our recommendation to use the VC model and avoid the CC model even if a preliminary test of effect-size homogeneity is nonsignificant. Given the above arguments, the VC model will be preferred to the CC model in virtually every meta-analysis application. Researchers will then need to choose between the VC model and the RC model. Until recently, CC methods were the most frequently used methods in psychology (Schmidt et al., 2009), but RC methods are now gaining popularity (Cumming, 2012, p. 209). However, most psychologists are turning to the RC model for the wrong reason—They are using the RC model simply to avoid the unrealistic effect-size homogeneity assumption of the CC model. If effect-size heterogeneity is the only concern, then the VC model is the recommended model. The main advantage of the RC model over the VC model, assuming necessary assumptions of the RC model can be satisfied, is the potential of the RC model to provide statistical inference to a superpopulation of study populations (or effect sizes). Hunter and Schmidt (2004) motivated the use of the RC model by stating that it is difficult (and perhaps impossible) to conceive of a situation in which a researcher would be interested only in the specified studies included in the meta analysis and would not be interested in the broader task of estimation of the population effect sizes for the research domain as a whole. (pp. 396 –397)

Raudenbush (2009) suggests that the RC model provides generalization to “a universe of possible studies—studies that realistically could have been conducted or might be conducted in the future” (p. 297). Proponents of the RC model emphasize the benefits of generalizing beyond the m study populations, but our concern is that the conditions that must be satisfied to achieve the desired generalizability have not been completely or clearly explained in the literature. For the RC results to statistically generalize beyond the m study populations, the researcher must provide convincing arguments or evidence that the effect sizes that are estimated in the m studies are a random sample from some definable superpopulation. In addition to the concerns raised by Bonett (2008, 2009,

401

2010) regarding the feasibility of the random sampling of studies assumption, Ferguson and Brannick (2012) provide additional examples of biased but commonly used study-selection strategies that would unequivocally violate the random sampling assumption. As noted by Overton (1998), the m effect sizes “most likely are not representative of the population” (p. 356). According to Schultz (2004), assuming the m effect sizes are a random sample from some definable superpopulation “is not feasible in practice and may represent a critical point in the application of RE models” (p. 41). Cox (2006) gives some examples in which an RC model might be justified but concludes that a “clear-cut justification and interpretation” of the superpopulation mean “is relatively rare” (p. 1078). In our view, it is not realistic to show that the effect sizes for the m studies are a random sample from “the research domain as a whole” or “a universe of possible studies.” It is far more realistic to argue that the effect sizes for the m studies are possibly a random sample from some smaller superpopulation. Suppose m ⫽ 10 similar studies were conducted at 10 different universities and the goal of the RC meta-analysis was to generalize the results to a superpopulation of 1,500 university study populations. The researcher might argue that the 10 studies were conducted at universities that appear to be representative of the 1,500 universities in terms of relevant university characteristics such as geographic location, ratio of public to private institutions, or admission selectivity. It is important to clearly define the superpopulation from which the m studies are believed to be a random sample because a correct and complete interpretation of the confidence intervals for ␽ⴱ and ␶ depends on a clear specification of the superpopulation. Even if an argument could be made in a particular meta-analysis that the effect sizes for the m studies are a random sample from some definable superpopulation, several other requirements need to be satisfied before an RC model should be used. The RC model requires a large number of studies (m) for two reasons. First, a confidence interval for ␶ might be uselessly wide unless m is large. In an RC model, the value of ␶ has important theoretical implications and a wide confidence interval for ␶ defeats a primary objective of an RC meta-analysis. With a small number of studies (e.g., m ⱕ 30), the lower confidence limit for ␶ can be close to zero and the upper limit can be large (Bonett & Price, 2014). In these cases, the lower limit suggests that the ␽k values are nearly identical in the superpopulation while the upper limit supports a fundamentally different conclusion that there is substantial variability in the ␽k values. Moineddin, Matheson, and Glazier (2007) examined the performance of an RC logistic regression model, which is similar to an RC meta-analysis model for odds ratios, and recommended a minimum of m ⫽ 50. A second reason for requiring large m is the need to assess the effect-size superpopulation normality assumption of the RC model. The traditional confidence intervals for ␶ are hypersensitive to small amounts of leptokurtosis (i.e., distributions having thicker tails or greater peakedness than a normal distribution). Kraemer and Bonett (2010) found that m ⫽ 150 studies are needed to detect with power of .8 a degree of leptokurtosis that would seriously degrade the performance of the normal-theory confidence intervals for ␶. Even with large m, the interpretation of a confidence interval for ␶ when analyzing odds ratios or risk ratios can be a problem because ␶ describes the standard deviation of log-transformed odds ratios or risk ratios which will be a meaningless parameter to most

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

402

BONETT AND PRICE

researchers. (This same problem occurs when using an RC model to analyze Fisher transformed correlations.) The interpretability of ␶ also has implications for the application of Bayesian metaanalysis methods (see Borenstein et al., 2009, p. 318) because researchers will have difficulty specifying an informative prior distribution for ␶. Some researchers report the I2 measure of heterogeneity to avoid the issues of interpreting ␶ when effect sizes have been nonlinearly transformed (Higgins & Thompson, 2002). However, I2 is equal to 1 – (m – 1)/Q, where Q is the chi-squared effect-size homogeneity test statistic. I2 is not an estimate of any population parameter and its value is heavily influenced by the m within-study sample sizes. For any nonzero value of ␶, I2 (like Q) will be smaller with smaller within-study sample sizes and larger with larger within-study sample sizes. As shown in Equation 12, the traditional inverse-variance weighted estimator for the RC model can be inconsistent if the weights are correlated with the effect sizes. With effect sizes that are assumed to vary across studies and assuming each of the m studies was designed to have acceptable power, studies with smaller effect sizes will tend to have the larger sample sizes which will result in a negative correlation between the weights and the effect sizes. Levine, Asada, and Carpenter (2009) found negative correlations between sample sizes and effect sizes in about 80% of the meta-analyses they examined. A correlation between effect sizes and sample sizes implies a correlation between effect sizes and weights because the weights are functions of the sample sizes. For log odds ratios and log risk ratios, the effect-size estimates can be highly correlated with the weights because of an intrinsic correlation between these effect-size estimates and their variances. To illustrate how weights can be correlated with log odd ratios, consider m ⫽ 8 studies that all have 50 participants per group (to eliminate any correlation between sample size and effect size). Suppose ␲ˆ 1 ⫽ .2 in all 8 studies and let ␲ˆ 2 ⫽ [.04 .08 .12 .16 .20 .28 .36 .44] for the 8 studies. The estimated log odds ratios (Equation 3) are approximately [1.62 0.98 0.57 0.26 0 ⫺0.43 ⫺0.79 ⫺1.11] and their estimated variances (Equation 2) are approximately [0.541 0.364 0.296 0.261 0.240 0.216 0.205 0.200]. The DerSimonian–Laird estimate of ␶2 for this example is .44, and the correlation between the estimated log odds ratios and weights is ⫺.95. The correlation between estimated log risk ratios and weights is ⫺.99 in this example. If ␲ˆ 2 rather than ␲ˆ 1 is set to .2 and ␲ˆ 1 ⫽ [.04 .08 .12 .16 .20 .28 .36 .44], the correlations have the same value but are positive. This example shows that estimated log odds ratios and estimated log risk ratios can be highly correlated with weights even when the sample sizes are equal across studies. However, the magnitude of the large-sample bias of Equation 11 depends on more than just the correlation between weights and effect sizes. If the variability of the m weights is small (usually as the result of similar sample sizes across studies) and if ␶2 is small, the bias of Equation 11 can be small even with a large correlation between weights and effect sizes. The possibility of a correlation between weights and effect sizes is a potentially serious problem with RC methods. Shuster, Hatton, Hendeles, and Winterstein (2010) carefully examined this problem and concluded that Methods that try to weight the estimates inversely proportional to the variance have a number of undesirable properties, including bias, incorrect standard errors, inconsistency (including coverage of confi-

dence intervals), and counterintuitive properties that the expectation of the estimator changes both with the number of studies sampled and with constant multiples of sample size across all studies. These adverse properties do not exist for the unweighted approach. (p. 1272)

Their recommendation to medical researchers that “past metaanalyses which have influenced public policy or clinical paradigms be reanalyzed by unweighted methods” (p. 1272) has far-reaching implications for meta-analyses that have been published in social and behavioral science journals. Note that all of the VC methods given here and in Bonett (2008, 2009, 2010) and Bonett and Price (2014) are based on unweighted (OLS) methods. As noted previously, the primary benefit of an RC model is the potential to make a statistical inference to a superpopulation. However, the set of study populations to which a statistical inference can reasonably be made using an RC model might not be substantially greater than the set of study populations to which a convincing nonstatistical inference can be made using a VC model. Consider again the example of m ⫽ 10 published university studies in which ␽k is estimated in a study population of students who were eligible to participate in study k. In most university studies of undergraduate students, the study population in any single study is a subset of a much larger population that is of interest to the scientific community. The scientific value of study k is enhanced if a convincing argument can be made that the value of ␽k should be similar to the corresponding parameter value of other similar study populations. In our hypothetical example of m ⫽ 10 published studies, suppose the authors of Study 1 have convinced reviewers that their results could reasonably apply to all Midwest public universities, the authors of Study 2 have convinced reviewers that their results could reasonably apply to all highly selective universities, and so on. In this application, the extended set of study populations represented by the m ⫽ 10 studies could overlap considerably with the 1,500 study populations to which an RC analysis would apply, and there would then be little advantage in trying to justify the use of an RC model. The VC model will be preferred to an RC model for most meta-analyses because (a) it may be difficult to provide convincing arguments or evidence that the m effect sizes are a random sample from some definable superpopulation; (b) a large number of studies is required to assess the superpopulation normality assumption; (c) a correlation between the weights and the estimates will bias the estimator of the superpopulation mean; (d) a fundamental parameter of the RC model, ␶, is difficult to interpret with nonlinearly transformed effect sizes; (d) a large number of studies may be required to obtain a usefully narrow confidence interval for ␶; and (f) RC results may not provide a level of inferential generality that is substantively greater than VC results.

Examples Example 1 Bonett and Price (2014) compared and combined risk differences from five two-group experiments that compared aripiprazol, an antipsychotic drug, with a placebo. Some aripiprazol patients experience akathisia, a movement disorder characterized by unpleasant sensations of restlessness. The frequency counts are summarized in Table 6. Studies 1–3 sampled from study populations of

ODDS RATIO AND RISK RATIO META-ANALYSIS

403

Table 6 Sample Data From Five Studies of Akathisia Under Aripiprazole and Placebo Conditions Akathisia

95% CI

Aripiprazole

Placebo

Aripiprazole

Placebo

Weeks

Kane et al. (2002) Potkin et al. (2003) Marder et al. (2003) Keck et al. (2003) Keck et al. (2006)

24 40 93 14 5

12 9 28 3 1

180 161 839 116 72

94 94 387 129 82

4 4 5 3 26

Note. This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

No akathisia

Study

Odds ratio [0.50, [1.18, [1.40, [1.40, [0.67,

2.12] 5.28] 2.34] 15.2] 26.1]

Risk ratio [0.50, [1.15, [0.98, [1.37, [0.66,

2.18] 4.41] 2.21] 14.5] 31.0]

Akathisia is a movement disorder. Weeks ⫽ weeks of treatment with a placebo or the psychoactive drug aripiprazole; CI ⫽ confidence interval.

adult schizophrenic patients, and Studies 4 and 5 sampled from study populations of adult bipolar disorder patients. Duration of treatment (in weeks) is included in Table 6 as a possible quantitative moderator variable. A log-linear model (Equation 15) was used to assess possible moderating effects of treatment duration and patient type on odds ratios and risk ratios. The design matrix includes a column of ones to code the intercept, a dummy variable (1 ⫽ schizophrenic, 0 ⫽ bipolar) was used to code patient type, and weeks of treatment was included as a quantitative predictor variable. The exponentiated slope estimates and confidence intervals were obtained using Equations 16 –18 and are summarized in Table 7. The 95% confidence interval for the exponentiated slope for treatment duration have lower and upper limits that are close to 1.0, suggesting that treatment duration could have a trivial moderating effect. Although the 95% confidence interval for the exponentiated patient type includes one, the interval is too wide to rule out its possible moderating effect, and we prefer to describe this result as “inconclusive.” Given an inconclusive effect of patient type, it could be informative to examine the predicted odds ratio or risk ratio for schizophrenic patients and for bipolar patients using Equations 19 and 21. At 4 weeks of treatment, the predicted risk ratio for schizophrenic patients is 1.51 (95% CI [1.07, 2.12]) and the predicted risk ratio for bipolar patients is 4.46 (95% CI [1.44, 13.82]). These results suggest that the effect of aripiprazol on akathisia could be substantially larger for bipolar patients than schizophrenic patients, and further research is needed to obtain more accurate estimates within each of these two patient populations. We proceed with the assumption that the population odds ratios and risk ratios across the five studies are not highly dissimilar so that the geometric mean odds ratio or risk ratio could be an interesting parameter to estimate. Applying Equation 13 gives a 95% confidence interval of [1.45, 3.88] for the geometric mean Table 7 Log-Linear Model Results Risk ratio Effect

ˆ exp共␤兲

Intercept Treatment Duration Patient type

4.45 1.00 0.33

Odds ratio

95% CI

ˆ exp共␤兲

95% CI

[1.15, 17.26] [0.91, 1.10] [0.11, 1.08]

4.67 1.00 0.34

[1.19, 18.36] [0.91, 1.09] [0.10, 1.12]

ˆ is an exponentiated ordinary least Note. CI ⫽ confidence interval; exp共␤兲 squares estimate.

odds ratio and [1.42, 3.83] for the geometric mean risk ratio. These confidence intervals are narrower than any of the single-study confidence intervals reported in Table 6. In this example, the effect-size homogeneity assumption of the CC model would be difficult to justify because the five study populations are distinctly different and we should not expect equality of population odds ratios or risk ratios across the five study populations. It is a common but inappropriate practice to justify the use of a CC model based on a nonsignificant chi-square test of effect-size homogeneity. For the data in Table 6 we obtain Q(4) ⫽ 6.73, p ⫽ .151 for odds ratios and Q(4) ⫽ 6.85, p ⫽ .144 for risk ratios. These nonsignificant test results should not be interpreted as evidence of equal population odds ratios or risk ratios across the five study populations. It is difficult to justify an RC model in this application because the effect sizes that were estimated in these five studies are not a random sample from some definable superpopulation of effect sizes—the five studies were deliberately selected on the basis of specific inclusion criteria regarding the type of study population, type of research design, type of drug treatment, and similarity in the akathisia assessment method. The fact that all five studies have at least one author in common casts additional doubt on the plausibility of the random sampling assumption required for the RC model. Furthermore, there is not enough information with m ⫽ 5 to assess the superpopulation normality assumption and the confidence interval for ␶ might be uselessly wide.

Example 2 In a replication– extension study (see Bonett, 2012), a new study attempts to replicate and extend the results of a published study. If replication is achieved, the results of the new study are combined with the results of the published study to provide more precise interval estimates and greater externality validity. The new study will also include additional conditions that extend the results of the published study in theoretically interesting directions. The replication– extension process is an ongoing process with future replication– extension studies attempting to replicate and extend the results of previous replication– extension studies. To illustrate the use of VC models in a hypothetical replication– extension study, consider the Loftus and Palmer (1974) study of memory errors in eye-witness testimony in which college students were shown a 4-s video of a car crash and one week later asked if they remember seeing any broken glass (in fact there was no broken glass in the video). Immediately after viewing the video, 50 participants were asked to estimate the speed of the cars “when

BONETT AND PRICE

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

404

they smashed each other,” and 50 other participants were asked to estimate the speed of the cars “when they hit each other.” The results of the Loftus–Palmer study are given in Table 8. Using the data from the Loftus–Palmer study and computing Equation 4 gives a 95% confidence interval for the population risk ratio of [1.02, 4.92]. This interval has a lower limit that is very close to 1 (which would suggest there is virtually no effect of smash vs. hit wording on memory error) and an attorney could use this result to cast doubt on any courtroom claim of eyewitness fallibility. Given the imprecise estimate of the population risk ratio in the Loftus–Palmer study, a replication– extension study would provide valuable information. Suppose a researcher replicated the Loftus– Palmer study using one sample of 300 college students who were asked about the broken glass one week after viewing the video. The researcher also extended the Loftus–Palmer study by using a second sample of 300 college students who were asked about the broken glass 12 weeks after viewing the video. Hypothetical data for this replication– extension study are given in Table 8. The three samples in Table 8 are used to estimate three population risk ratios: ␪LP1, ␪NEW1, and ␪NEW12. The first analysis involves a replication of the Loftus–Palmer results. We obtained a 95% confidence interval for ␪LP1/␪NEW1 of [0.26, 1.75] by exponentiating Equation 14 with coefficients c1 ⫽ 1, c2 ⫽ ⫺1, and c3 ⫽ 0. This confidence interval includes 1, which is evidence of replication. However, the confidence interval is wide and we might consider this to be evidence of “weak” effect-size replication (see Bonett, 2009). Replications of psychological studies will often provide only weak effect-size replication evidence because psychological studies tend to be underpowered which leads to wide replication confidence intervals. Assuming replication, a confidence interval for the geometric mean of ␪LP1 and ␪NEW1 is obtained using coefficients of c1 ⫽ .5, c2 ⫽ .5, and c3 ⫽ 0. The 95% confidence interval is [1.69, 4.36] and suggests that the proportion of all members of the two study populations who would have a false memory for broken glass would be 1.69 to 4.36 times greater under smash than hit conditions. This result is substantially more precise than the Loftus–Palmer result. To assess the effect of 1-week versus 12-week recall, a confidence interval for the geometric mean of ␪LP1 and ␪NEW1 divided by ␪NEW12 is obtained using coefficients c1 ⫽ .5, c2 ⫽ .5, and c3 ⫽ ⫺1. The 95% confidence interval for this effect is [1.06, 4.47], which suggests that the population risk ratio is greater in a 1-week recall condition than a 12-week recall condition. Future studies could attempt to replicate the 1-week versus 12week recall effect and further extend the study to examine other possible moderator variables such as age of participant and crash scene viewing time. Each study could provide important replication evidence, more precise confidence intervals of effect sizes, and greater external validity. Pashier and Wagenmakers (2012) discuss the Table 8 Hypothetical Data for a Replication and Extension of the Loftus–Palmer Study Sample

f11

f12

f21

f22

Loftus–Palmer (1974): 1-week recall New: 1-week recall New: 12-weeks recall

16 50 25

7 15 20

34 100 125

43 135 130

Note.

fij is defined in Table 2.

replicability crisis in psychology and call for actions to make psychological science more reliable and reputable. Replication– extension designs with VC data analysis methods are one way to achieve this goal.

Example 3 Alogna et al. (2014) conducted a multilab replication study of the “verbal overshadowing effect” reported by Schooler and Engstler–Schooler (1990) in which participants viewed a video of a simulated bank robbery and then attempted to pick the robber out of a photo line-up of suspects. Before viewing the line-up, participants in the experimental condition wrote a brief description of the robber, and participants in the control condition wrote a list of U.S. states and capitols. The proportions of participants (undergraduate students ages 18 to 25) who correctly identified the robber in the experimental condition and the control condition were reported for m ⫽ 31 different labs. Algona et al. analyzed the 31 two-group risk differences using an RC model. In their multilab replication study, all 31 labs were required to obtain at least 50 participants per group, and the results from all 31 labs were included in the meta-analysis. It is reasonable to assume that the 31 labs are a random sample from a superpopulation of labs, and with a risk difference effect size, ␶ has a clear and important interpretation. Our concern is that Alogna et al. did not describe the superpopulation of labs for which the RC results would apply (which is the main reason for using an RC model instead of a VC model), the superpopulation normality assumption was not assessed, the correlation between effect sizes and weights was not assessed, and only a point estimate of ␶ was reported. Risk ratios are usually more meaningful than risk differences in situations in which there is substantial variability in the denominator proportion values across studies. Sample proportions in the experimental condition reported in Table 1 of Alogna et al. ranged from .23 to .73 across the 31 studies, and the risk ratio is arguably the preferred measure of effect size in this multilab replication study. Using the sample sizes and sample proportions reported in Table 1 of Alogna et al. (2014) and applying Equation 13, we obtain a 95% confidence interval for the geometric mean risk ratio of [1.01, 1.16]. This result indicates that the population proportion of a correct identification would be 1.01 to 1.16 times greater if all undergraduate students in the 31 study populations were asked to write the names of states and capitols instead of writing a description of the robber after viewing the video. The lower limit of the confidence interval suggests that the verbal overshadowing effect could be almost nonexistent, and the upper limit suggests that the effect is at best very small. Alogna et al. arrived at a similar conclusion.

Summary Meta-analysis has become a standard data-analytic tool in psychology, and Cumming (2012) stated, “I think meta-analysis is so central to how science should be done that every introductory statistics course should include an encounter with it” (p. 181). The popularity of meta-analysis has grown, especially during the last decade, with CC or RC models used in virtually every published meta-analysis. We agree with Borenstein et al. (2009, pp. 83– 84) that the CC model is rarely applicable and statistical inference with the RC model is problematic if m is small, but we would go further and say that an RC model should be considered only if m is large

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

ODDS RATIO AND RISK RATIO META-ANALYSIS

enough to assess its assumptions and the researcher is able to provide a description of the superpopulation to which the RC results might apply. Claggett, Xie, and Tian (2012) take a more extreme view and claim that the RC model assumptions are “difficult, if not impossible to verify” (p. 1). The VC meta-analysis model (which can be defined for any type of parameter estimator) is a sensible and trustworthy alternative to the CC and RC metaanalysis models. In a VC model, the traditional weighted average and WLS estimators are not appropriate and alternative inferential methods are required. VC inferential methods are now available for Pearson, Spearman and partial correlations (Bonett, 2008), standardized and unstandardized mean differences for independent-samples and paired-samples designs (Bonett, 2009), Cronbach’s alpha reliabilities (Bonett, 2010), and risk differences for independent-samples and paired-samples designs (Bonett & Price, 2014), in addition to the VC inferential methods for odds ratios and risk ratios presented here. All of the proposed VC inferential methods have been shown to have good small sample performance characteristics. To assist researchers in applying the new VC meta-analysis methods for odds ratios and risk ratios, SAS PROC IML programs and R functions are provided as online supplementary materials.

References Agresti, A. (2002). Categorical data analysis. New York, NY: Wiley. http://dx.doi.org/10.1002/0471249688 Agresti, A., & Caffo, B. (2000). Simple and effective confidence intervals for proportions and difference of proportions results from adding two successes and two failures. The American Statistician, 54, 280 –288. Aguinis, H., Gottfredson, R. K., & Wright, T. A. (2011). Best-practice recommendations for estimating interaction effects using meta-analysis. Journal of Organizational Behavior, 32, 1033–1043. http://dx.doi.org/ 10.1002/job.719 Alogna, V. K., Attaya, M. K., Aucoin, P., Bahnik, S., Birch, S., Birt, A. R., . . . Zwaan, R. A. (2014). Registered replication report: Schooler and Engstler-Schooler (1990). Perspectives on Psychological Science, 9, 556 –578. Bonett, D. G. (2007). Transforming odds ratios into correlations for metaanalytic research. American Psychologist, 62, 254 –255. http://dx.doi .org/10.1037/0003-066X.62.3.254 Bonett, D. G. (2008). Meta-analytic interval estimation for bivariate correlations. Psychological Methods, 13, 173–181. http://dx.doi.org/ 10.1037/a0012868 Bonett, D. G. (2009). Meta-analytic confidence intervals for standardized and unstandardized mean differences. Psychological Methods, 14, 225– 238. http://dx.doi.org/10.1037/a0016619 Bonett, D. G. (2010). Varying coefficient meta-analytic methods for alpha reliability. Psychological Methods, 15, 368 –385. http://dx.doi.org/ 10.1037/a0020142 Bonett, D. G. (2012). Replication– extension studies. Current Directions in Psychological Science, 21, 409 – 412. http://dx.doi.org/10.1177/ 0963721412459512 Bonett, D. G., & Price, R. M. (2005). Inferential methods for the tetrachoric correlation coefficient. Journal of Educational and Behavioral Statistics, 30, 213–225. http://dx.doi.org/10.3102/10769986030002213 Bonett, D. G., & Price, R. M. (2007). Statistical inference for generalized Yule coefficients in 2 ⫻ 2 contingency tables. Sociological Methods & Research, 35, 429 – 446. http://dx.doi.org/10.1177/0049124106292358 Bonett, D. G., & Price, R. M. (2014). Meta-analysis methods for risk differences. British Journal of Mathematical and Statistical Psychology, 67, 371–387. http://dx.doi.org/10.1111/bmsp.12024

405

Borenstein, M., Hedges, L. V., Higgins, J. P. T., & Rothstein, H. R. (2009). Introduction to meta-analysis. New York, NY: Wiley. http://dx.doi.org/ 10.1002/9780470743386 Borman, G. D., & Grigg, J. A. (2009). Visual and narrative interpretation. In H. Cooper, L. V. Hedges, & J. C. Valentine (Eds.), The handbook of research synthesis and meta-analysis (2nd ed., pp. 497–519). New York, NY: Russell Sage Foundation. Claggett, B., Xie, M., & Tian, L. (2012). Nonparametric inference for meta-analysis with fixed unknown, study-specific parameters (Harvard University Biostatistics Working Paper 154). Cambridge, MA: Harvard University. Retrieved from http://biostats.bepress.com/cgi/viewcontent .cgi?article⫽1162&context⫽harvardbiostat Cox, D. R. (2006). Combination of data. In S. Kotz, C. B. Read, N. Balakrishnan, & B. Vidakovic (Eds.), Encyclopedia of statistical sciences (2nd ed., pp. 1074 –1081). Hoboken, NJ: Wiley. http://dx.doi.org/ 10.1002/0471667196.ess0377.pub2 Cumming, G. (2012). Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis. New York, NY: Routledge. DerSimonian, R., & Laird, N. (1986). Meta-analysis in clinical trials. Controlled Clinical Trials, 7, 177–188. http://dx.doi.org/10.1016/01972456(86)90046-2 Ferguson, C. J., & Brannick, M. T. (2012). Publication bias in psychological science: Prevalence, methods for identifying and controlling, and implications for the use of meta-analyses. Psychological Methods, 17, 120 –128. http://dx.doi.org/10.1037/a0024445 Fleiss, J. L., & Berlin, J. A. (2009). Effect sizes for dichotomous data. In H. Cooper, L. V. Hedges, & J. C. Valentine (Eds.), The handbook of research synthesis and meta-analysis (2nd ed., pp. 237–253). New York, NY: Russell Sage Foundation. Fleiss, J. L., Levin, B., & Paik, M. C. (2003). Statistical methods for rates and proportions (3rd ed.). New York, NY: Wiley. http://dx.doi.org/ 10.1002/0471445428 Gart, J. J. (1966). Alternate analysis of contingency tables. Journal of the Royal Statistical Society: Series B. Methodological, 28, 164 –179. Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-analysis. San Diego, CA: Academic Press. Higgins, J. P., & Thompson, S. G. (2002). Quantifying heterogeneity in a meta-analysis. Statistics in Medicine, 21, 1539 –1558. http://dx.doi.org/ 10.1002/sim.1186 Hunter, J. E., & Schmidt, F. L. (2004). Methods of meta-analysis: Correcting error and bias in research findings. Thousand Oaks, CA: Sage. Janky, D. G. (2003). Sometimes pooling for analysis of variance hypothesis tests: A review and study of a split-plot model. The American Statistician, 54, 269 –279. ⴱ Kane, J. M., Carson, W. H., Saha, A. R., McQuade, R. D., Ingenito, G. G., Zimbroff, D. L., & Ali, M. W. (2002). Efficacy and safety of aripiprazole and haloperidol versus placebo in patients with schizophrenia and schizoaffective disorder. The Journal of Clinical Psychiatry, 63, 763– 771. http://dx.doi.org/10.4088/JCP.v63n0903 ⴱ Keck, P. E., Jr., Calabrese, J. R., McQuade, R. D., Carson, W. H., Carlson, B. X., Rollin, L. M., . . . Aripiprazole Study Group. (2006). A randomized, double-blind, placebo-controlled 26-week trial of aripiprazole in recently manic patients with bipolar I disorder. The Journal of Clinical Psychiatry, 67, 626 – 637. http://dx.doi.org/10.4088/JCP.v67n0414 ⴱ Keck, P. E., Jr., Marcus, R. Tourkodimitris, S., Mirza, A., Liebeskind, A., Saha, A., . . . Arpiprazole Study Group. (2003). A placebo-controlled, double-blind study of the efficacy and safety of aripiprazole in patients with acute bipolar disorder. American Journal of Psychiatry, 160, 1651– 1658. http://dx.doi.org/10.1176/appi.ajp.160.9.1651 Koopman, P. A. R. (1984). Confidence limits for the ratio of two binomial proportions. Biometrics, 40, 513–517. http://dx.doi.org/10.2307/2531405 Kraemer, K. A., & Bonett, D. G. (2010, August). Confidence intervals for the amount of heterogeneity in meta-analysis: Small-sample properties

BONETT AND PRICE

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

406

and robustness to non-normality. Paper presented at Joint Statistical Meeting of the American Statistical Association, Vancouver, BC. Levine, T. R., Asada, K. J., & Carpenter, C. (2009). Sample size and effect size are negatively correlated in meta-analysis: Evidence and implications of a publication bias against non-significant findings. Communication Monographs, 76, 286 –302. http://dx.doi.org/10.1080/ 03637750903074685 ⴱ Loftus, E. F., & Palmer, J. C. (1974). Reconstruction of automobile destruction: An example of the interaction between language and memory. Journal of Verbal Learning and Verbal Behavior, 13, 585–589. http://dx.doi.org/10.1016/S0022-5371(74)80011-3 ⴱ Marder, S. R., McQuade, R. D., Stock, E., Kaplita, S., Marcus, R., Safferman, A. Z., . . . Iwamoto, T. (2003). Aripiprazole in the treatment of schizophrenia: Safety and tolerability in short-term, placebocontrolled trials. Schizophrenia Research, 61, 123–136. http://dx.doi .org/10.1016/S0920-9964(03)00050-1 Maxwell, S. E., & Delaney, H. D. (2004). Designing experiments and analyzing data: A model comparison approach (2nd ed.). Mahwah, NJ: Erlbaum. Moineddin, R., Matheson, F. I., & Glazier, R. H. (2007). A simulation study of sample size for multilevel logistic regression models. BMC Medical Research Methodology, 7, 34. http://dx.doi.org/10.1186/14712288-7-34 National Research Council. (1992). Combining information: Statistical issues and opportunities for research. Washington, DC: National Academy Press. Nelder, J. A. (1976). Letter to the editor. The American Statistician, 30, 103. Newcombe, R. G. (2013). Confidence intervals for proportions and related measures of effect size. Boca Raton, FL: CRC Press. Overton, R. C. (1998). A comparison of fixed-effects and mixed (randomeffects) models for meta-analysis test of moderator variable effects. Psychological Methods, 3, 354 –379. http://dx.doi.org/10.1037/1082989X.3.3.354 Pashier, H., & Wagenmakers, E.-J. (2012). Ed.’s introduction to the special ed. of replicability in psychological science: A crisis of confidence? Perspectives on Psychological Science, 7, 528 –530. Pedhazur, E. J. (1997). Multiple regression in behavioral research (3rd ed.). New York, NY: Harcourt Brace.



Potkin, S. G., Saha, A. R., Kujawa, M. J., Carson, W. H., Ali, M., Stock, E., . . . Marder, S. R. (2003). Aripiprazole, an antipsychotic with a novel mechanism of action, and risperidone vs placebo in patients with schizophrenia and schizoaffective disorder. Archives of General Psychiatry, 60, 681– 690. http://dx.doi.org/10.1001/archpsyc .60.7.681 Price, R. M., & Bonett, D. G. (2008). Confidence intervals for a ratio of two independent binomial proportions. Statistics in Medicine, 27, 5497– 5508. http://dx.doi.org/10.1002/sim.3376 Raudenbush, S. W. (2009). Analyzing effect sizes: Random-effects models. In H. Cooper, L. V. Hedges, & J. C. Valentine (Eds.), The handbook of research synthesis and meta-analysis (2nd ed., pp. 295–315). New York, NY: Russell Sage Foundation. Scheffé, H. (1959). The analysis of variance. New York, NY: Wiley. Schmidt, C. O., & Kohlmann, T. (2008). When to use the odds ratio or the relative risk? International Journal of Public Health, 53, 165–167. http://dx.doi.org/10.1007/s00038-008-7068-3 Schmidt, F. L., Oh, I. S., & Hayes, T. L. (2009). Fixed- versus randomeffects models in meta-analysis: Model properties and an empirical comparison of differences in results. British Journal of Mathematical and Statistical Psychology, 62, 97–128. http://dx.doi.org/10.1348/ 000711007X255327 Schooler, J. W., & Engstler-Schooler, T. Y. (1990). Verbal overshadowing of visual memories: Some things are better left unsaid. Cognitive Psychology, 22, 36 –71. http://dx.doi.org/10.1016/0010-0285(90)90003-M Schultz, R. (2004). Meta-analysis: A comparison of approaches. Toronto, Ontario, Canada: Hogrefe & Huber. Shuster, J. J., Hatton, R. C., Hendeles, L., & Winterstein, A. G. (2010). Reply to discussion of “Empirical vs. natural weighting in random effects meta-analysis.” Statistics in Medicine, 29, 1272–1281. http://dx .doi.org/10.1002/sim.3842 Wellek, S. (2000). Testing statistical hypotheses of equivalence. Boca Raton, FL: Chapman & Hall. Woolf, B. (1955). On estimating the relation between blood groups and disease. Annals of Human Genetics, 19, 251–253. Zar, J. H. (1999). Biostatistical analysis (4th ed.). Upper Saddle River, NJ: Prentice Hall.

Appendix Large-Sample Expected Value of Inverse-Variance Weighted Estimator in Random Coefficient Models





ˆ *k ⫽ ˆ *k ␽ˆ k ⁄ m ˆ *k , where w The inverse-variance weighted estimator of the superpopulation mean (␽ⴱ) in the RC model is ␽˜ * ⫽ m k⫽1 w k⫽1 w ^ (␽ˆ ) ⫹␶ˆ 2], ␽ˆ is a consistent estimator of ␽ , ␶ˆ 2 is a consistent estimator of ␶2, w ˆ *k is a consistent estimator of 1/[var(␽ˆ k) ⫹␶2], and 1/[var k k k ˆ 兲 ⬵ ␽ , and assume a all estimators are independent across the m studies. Assume a large sample size within each study so that E共␽ k k ˆ 兲 ⬵ ␽*. random sample of effect sizes from a superpopulation with mean ␽ⴱ so that E(␽ ) ⫽ ␽ⴱ. It follows that E共␽



k



k

ˆ *k兲␽* it ˆ *k,␽ˆ k) ⫽ E(w ˆ *k␽ˆ k) ⫺ E共w ˆ *k兲E共␽ˆ k兲 ⬵ E(w ˆ *k␽ˆ k) ⫺ E共w ˆ *k␽ˆ k) ⁄E共 m ˆ *k). Given cov(w A first-order Taylor approximation to E(␽˜ *) is E( m k⫽1 w k⫽1 w * * * ˆ *ˆ * ˆ ˆ k ␽k) ⬵ cov(w ˆ k ,␽k) ⫹ E共w ˆ k 兲␽ . Assume cov(w ˆ k ,␽k) ⫽ ␳w␽␴w␶ in large samples for all k, where ␳w␽ is the superpopulation follows that E(w correlation between the true weights and the study population effect sizes, ␴w is the superpopulation standard deviation of the true weights, and ␶ is the superpopulation standard deviation of the study population effect sizes. It follows that E共␽˜ *兲 ⬵ [m␳ ␴ ␶ ⫹ ␽ⴱ

兺mk⫽1 wˆ *k)] ⁄ E共兺mk⫽1 wˆ *k兲 ⫽ [␽



⫹ m␳w␽␴w␶ ⁄

兺mk⫽1 E共wˆ *k兲], which is equal to ␽



w␽ w

if ␳w␽ ⫽ 0 (the weights are uncorrelated with the effect sizes), or ␴w ⫽ 0 (all weights are equal), or ␶ ⫽ 0 (all effect sizes are equal). Note that ␽˜ * has positive first-order bias if ␳w␽ ⬎ 0 and negative first-order bias if ␳w␽ ⬍ 0. E(

Received January 9, 2014 Revision received October 12, 2014 Accepted December 1, 2014 䡲

Varying coefficient meta-analysis methods for odds ratios and risk ratios.

Odds ratios and risk ratios are useful measures of effect size in 2-group studies in which the response variable is dichotomous. Confidence interval m...
197KB Sizes 2 Downloads 4 Views