pharmacoepidemiology and drug safety 2014; 23: 802–811 Published online 29 January 2014 in Wiley Online Library (wileyonlinelibrary.com) DOI: 10.1002/pds.3574

ORIGINAL REPORT

Propensity score balance measures in pharmacoepidemiology: a simulation study† M. Sanni Ali1, Rolf H. H. Groenwold1,2, Wiebe R. Pestman3, Svetlana V. Belitser1, Kit C. B. Roes2, Arno W. Hoes2, Anthonius de Boer1 and Olaf H. Klungel1,2* 1

Division of Pharmacoepidemiology and Clinical Pharmacology, Utrecht Institute for Pharmaceutical Sciences, University of Utrecht, Utrecht, The Netherlands 2 Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht, The Netherlands 3 Departamento de Matematica, Universidade Federal de Santa Catarina, Florianopolis, Santa Catarina, Brazil

ABSTRACT Background Conditional on the propensity score (PS), treated and untreated subjects have similar distribution of observed baseline characteristics when the PS model is appropriately specified. The performance of several PS balance measures in assessing the balance of covariates achieved by a specific PS model and selecting the optimal PS model was evaluated in simulation studies. However, these studies involved only normally distributed covariates. Comparisons in binary or mixed covariate distributions with rare outcomes, typical of pharmacoepidemiologic settings, are scarce. Methods Monte Carlo simulations were performed to examine the performance of different balance measures in terms of selecting an optimal PS model, thus reduction in bias. The balance of covariates between treatment groups was assessed using the absolute standardized difference, the Kolmogorov–Smirnov distance, the Lévy distance, and the overlapping coefficient. Spearman’s correlation coefficient (r) between each of these balance measures and bias were calculated. Results In large sample sizes (n ≥ 1000), all balance measures were similarly correlated with bias (r ranging between 0.50 and 0.68) irrespective of the treatment effect’s strength and frequency of the outcome. In smaller sample sizes with mixed binary and continuous covariate distributions, these correlations were low for all balance measures (r ranging between 0.11 and 0.43), except for the absolute standardized difference (r = 0.51). Conclusions The absolute standardized difference, which is an easy-to-calculate balance measure, displayed consistently better performance across different simulation scenarios. Therefore, it should be the balance measure of choice for measuring and reporting the amount of balance reached, and for selecting the final PS model. Copyright © 2014 John Wiley & Sons, Ltd. key words—balance measure; confounding; model selection; propensity score; pharmacoepidemiology Received 28 January 2013; Revised 19 December 2013; Accepted 20 December 2013

INTRODUCTION In observational studies, attributing causality remains a major challenge due to the nonrandom assignment of treatment leading to systematic differences in covariate distributions between treated and untreated subjects. This may bias the estimates of treatment effect unless adequate statistical adjustments are made. *Correspondence to: O. H. Klungel, Division of Pharmacoepidemiology and Clinical Pharmacology, Utrecht Institute for Pharmaceutical Sciences, University of Utrecht, PO Box 80082, 3508 TB Utrecht, The Netherlands. E-mail: O.H. [email protected] † The abstract has been presented in the 28th ICPE, 23–26 August 2012, Barcelona, and published in Pharmacoepidemiology and Drug Safety, 2012; 21: (Suppl. 3): 36–37 DOI: 10.1002 as “Evaluating propensity score balance measures in typical pharmacoepidemiologic settings”.

Copyright © 2014 John Wiley & Sons, Ltd.

Propensity score (PS) methods are a set of commonly used tools to correct for such differences.1 The PS is a balancing score: Conditional on the PS, prognostic patient characteristics tend to be balanced between treatment groups; that is, treatment received is independent of measured patient characteristics.1,2 A critical step in PS analysis, although not often well described or even ignored, is an iterative derivation of adequately specified PS model and assessing the balance reached on measured covariates between treatment groups.1,3–5 A check for appropriateness of such a PS model is the degree to which measured baseline characteristics are balanced between treated and untreated groups.3 Selection of variables based on their association with the outcome has been recommended for constructing an optimal PS model in terms of bias-

propensity score balance measures

variance trade-off.6–10 On the other hand, detailed knowledge of potential confounders and their association with both the treatment and the outcome of interest is usually incomplete in real practice.11 Hence, investigators may be forced to select variables for a PS model in data-driven ways.7 However, suboptimal procedures such as stepwise variable selection algorithms or c-statistics should not be chosen since such standard modelbuilding tools designed to create good predictive models of the exposure will not necessarily lead to “optimal” PS models and good balance.5–7 Others proposed balance measures to assess the balance of covariate distribution between treatment groups reached by a specific PS model and to select an optimal PS model (in terms of achieved balance) among a variety of possible models.12,13 Different balance measures including the standardized difference,3,4,14 the Kolmogorov–Smirnov distance (KSD),15,16 and the Lévy distance (L)16,17 have been evaluated in simulation studies. Mean-based measures such as the standardized difference were highly correlated with bias in effect estimate irrespective of sample size, whereas KSD and L were highly correlated with bias only in large sample sizes.12,13 The overlapping coefficient (OVL),12,18 another balance measure, showed lower correlation with bias in large sample sizes.12 However, these simulation studies considered only a limited range of scenarios that might not be generalizable to pharmacoepidemiologic studies, typically dealing with categorical patient characteristics and rare outcomes. Therefore, we conducted a simulation study to assess the performance of different balance measures for PS models in various research settings typical of pharmacoepidemiologic research. METHODS Propensity score balance measures Different PS balance measures have been proposed. In this study, we evaluated four balance measures: the standardized difference, OVL, KSD, and L. These balance measures were chosen because they are characteristics of the sample and are not part of hypothesis testing unlike other statistics such as the t-statistics.3,12,14 For a more detailed explanation of these balance measures, we refer to the literature.3,4,12,14,19 Here, we only describe the key characteristics of these measures. The absolute standardized difference (S) for binary confounders is the absolute difference in proportions of the confounder between treated and untreated subjects standardized to the variation in the confounding variable (i.e., the standard deviation). It has a minimum value of 0 (“perfect” balance) but no maximum value. For continuous confounders, S is the absolute Copyright © 2014 John Wiley & Sons, Ltd.

803

difference in the means of confounders in treated and untreated subjects standardized to the variation. The overlapping coefficient, also called proportion of similar response, is the amount of overlap in the density of confounder distributions for treated and untreated subjects. Overlapping coefficient can range between 0 (no overlap, i.e., “perfect” imbalance) and 1 (complete overlap, i.e., “perfect” balance). In case of continuous confounders, we used nonparametric kernel density to estimate OVL.12,20,21 For dichotomous confounders, we calculated the proportion of overlap.13 The Kolmogorov-Smirnov distance is defined as the maximum of all vertical distances between two cumulative distribution functions of a certain confounder for treated and untreated subjects expressed as relative frequencies.16 This distance can range between 0 (“perfect” balance) and 1 (complete imbalance). The Lévy distance, a variant of KSD, is defined intuitively as the side length of the largest of the squares that can be inscribed between the curves of the cumulative distribution functions of a certain confounder for treated and untreated subjects with sides parallel to the coordinate axes.12,16 Unlike KSD, L takes into account the horizontal distance; hence, it is not scale invariant meaning that it is sensitive to the unit of measurement of the confounder. Therefore, when different confounders are included in the estimation, one should standardize the confounders before these measures can be compared.12 L can range between 0 (“perfect” balance) and 1 (complete imbalance). Simulation We performed Monte Carlo simulations to assess the performance of balance measures to select a PS model with respect to the amount of bias in the effect estimate. In addition, we determined how inclusion of different covariates, their interactions, and higherorder terms in a PS model affects this bias. For the simulations, we used the framework of Austin and Belitser et al.12,19,22,23 Data generation We simulated three different scenarios. In each scenario, we generated nine covariates: four confounders (X1, X2, X4, and X5), that is, variables that are related to both treatment and outcome; two covariates (X7 and X8) that were related only to the treatment but not to the outcome; two covariates (X3 and X6) that were related only to the outcome but not to the treatment; and one covariate (X9) that was neither related to the treatment nor to the outcome (Figure 1). In Scenario 1, all the nine covariates (X1–X9) were binary. In Scenario 2, six of the Pharmacoepidemiology and Drug Safety, 2014; 23: 802–811 DOI: 10.1002/pds

804

m. s. ali et al.

Figure 1. Causal diagram showing variables related to both treatment and outcome, that is, confounders of the treatment-outcome pair (X1, X2, X4, and X5), variables related only to the treatment (X7 and X8), variables related to the outcome (X3 and X6), and variable neither related only to the treatment nor the outcome (X9)

covariates (X3 and X5–X9) were binary, and the other three (X1, X2, and X4) were continuous following a gamma distribution (scale parameter = 2 and shape parameter = 1) to capture skewedly distributed covariates in our simulation. In Scenario 3, six of the covariates (X3 and X5–X9) were binary, and the other three (X1, X2, and X4) were continuous following a standard normal distribution. To generate the binary covariates, we used a Bernoulli distribution with varying success rate: 0.30, 0.35, 0.40, 0.45, 0.50, and 0.60, which means that the prevalence of the binary covariates was on average 30%, 35%, 40%, 45%, 50%, and 60% in each of the simulated populations, respectively. Next, we randomly generated binary treatment status (t) for each of the n subjects (n = sample size) according to the following logistic model, including quadratic and interactions terms, logit ðpt Þ ¼ α0 þα1 x1 þ α2 x2 þ α3 x4 þ α4 x5 þ α5 x7 þα6 x8 þ α7 x2 x4 þ α8 x2 x7 þ α9 x7 x8 (1) 2 2 þα10 x4 x5 þ α11 x1 þ α12 x7 Binary outcome status was then generated for each of the n subjects conditional on treatment status (t), and six of the covariates (X1–X6), their interactions, and higherorder terms using the following Poisson model,  log py ¼ β0 þ γ*tþβ1 x1 þ β2 x2 þ β3 x3 þ β4 x4 (2) þβ5 x5 þ β6 x6 þ β7 x2 x4 þ β8 x3 x5 þβ9 x3 x6 þ β10 x4 x5 þ β11 x21 þ β12 x26 The dichotomous treatment and outcome were generated using a Bernoulli distribution with subject-specific probabilities of treatment assignment and outcome status, Copyright © 2014 John Wiley & Sons, Ltd.

π = pt and π = py, respectively. Strong and medium associations between covariates and treatment or outcome were induced by varying the values of the regression coefficients for the main terms, that is, α1–α6 and β1–β6 in the treatment (1) and outcome (2) models, respectively. When a variable was binary, the values of α1–α6 and β1–β6 were set to log(3) and log(1.5) to induce strong and medium associations between a covariate and the treatment or outcome, respectively. This means that strong and moderate associations between a given variable and either treatment or outcome was considered when the presence of that variable independently increased the risk of either treatment or outcome by a factor of 3 and 1.5, respectively. For continuous variables, the values of α1–α6 and β1–β6 were set to log(2) and log (1.35) to induce strong and medium associations between a covariate and the treatment or outcome, respectively. In this case, strong and moderate associations with treatment or outcome were defined as an increase in the risk of treatment and outcome by 2 and 1.35, respectively, per standard deviation increase in the covariate. No association was considered between a given covariate and treatment or outcome when the presence of a covariate did not have independent impact on the risk of treatment or outcome. In addition, the following values for the regression coefficients of interaction and square terms, that is, α7–α12 and β7–β12 were used: log(1.2) for α7, α10, β7, and β10; log(1.4) for α8, α11, β8, and β11; and log(1.6) for α9, α12, β9, and β12. Marginal treatment effects (RR) of 1 (γ = log1), 1.5 (γ = log1.5), and 2 (γ = log2) were considered, and sample sizes were varied for different simulations (n = 500, 1000, 2000, and 5000). The values of β0 and α0 were set such that approximately half of the subjects were treated and an event occurred in approximately 15%, 10%, or 5% of the untreated individuals. We used 500 replications in each scenario. Propensity score analysis In each simulated dataset, 40 different PS models containing different sets of covariates and higher-order and interaction terms were considered. Although it was possible to construct more than 40 combinations of covariates or terms (hence, PS models) using the nine covariates and their interaction or square terms, we chose only the 40 PS models in such a way that the PS models considered included models containing only confounders, confounders and treatment-related covariates, confounders and outcome-related covariates, all covariates, and other combination of covariates (Appendix 1 in the supporting information). For all the PS models, the PS was estimated using ordinary logistic regression. For each of Pharmacoepidemiology and Drug Safety, 2014; 23: 802–811 DOI: 10.1002/pds

805

propensity score balance measures

the 40 PS models, the effect of the treatment on the outcome was estimated using the estimated PS as a covariate in a Poisson regression model (providing a risk ratio). We included the PS as a covariate in the regression model, because it is one of the most commonly used approaches among the different PS methods.24 Bias was calculated as the difference between the true marginal effect (used in data generation process, that is, RR = 1.0, RR = 1.5, or RR = 2.0) and the effect estimate obtained from a Poisson regression model with the PS as a covariate. Data were then stratified into quintiles of the PS, and the four balance measures were calculated within each of the PS quintiles for each covariate separately. The balance measures were then pooled: first across the nine covariates within each stratum and then across the strata into a single (mean and median) measure. Because 40 different models were applied, in each simulated dataset, 40 values for each balance measure were estimated and 40 values for the bias were obtained. Next, Spearman’s correlation coefficient (r) was used to assess the association between the different balance measures and the bias in the 40 PS models and averaged over the total number of simulations (500) for a given sample size (n). Finally, the four different balance measures were used for PS model selection. Hence, for each balance measure, the PS model that gave the “best” balance in each simulation (i.e., the maximum of OVL and the minimum of KSD, S, and L) was selected. The performance of different balance measures in terms of bias reduction (i.e., selecting the optimal PS model; hence, the least biased treatment effect estimate) and the frequency of selection of a specific or related PS models were evaluated. All analyses were performed in R, version 2.15.0.25 R code for the simulation is available from the authors upon request.

RESULTS When all the covariates were binary (Scenario 1) and the prevalence of outcome was 10%, the three balance measures OVL, S, and L had equal Spearman’s correlation coefficient, and all balance measures showed a strong correlation with bias (e.g., for sample size of 2000 and RR = 2, r = 0.54, 0.54, 0.54, and 0.53 for OVL, KSD, L, and S, respectively) (Table 1). In addition, the correlations for the means of the balance measures with bias were higher than for the medians of the balance measures. In Scenario 2, when three of the confounders (X1, X2, and X4) were continuous and gamma distributed and the other covariates binary, S and OVL showed a strong correlation with bias irrespective of sample size. However, in this scenario, KSD and L were weakly correlated in the case of small sample sizes (e.g., for sample size of 500 and RR = 2, r = 0.43, 0.18, 0.11, and 0.51 for OVL, KSD, L, and S, respectively). Compared to the other balance measures, S was consistently strongly correlated with bias irrespective of covariate distributions. With a lower prevalence of outcome (e.g., 5% and 1%), these correlations were smaller in all scenarios. When the continuous covariates followed a standard normal distribution (Scenario 3), instead of a gamma distribution, the correlations were similar to Scenario 2 (data not shown). When balance measures were used to select the optimal PS model, estimates of the RR were similar for all balance measures in Scenario 1 (all covariates were binary). The variation in estimated RR decreased with increasing sample size (Table 2). In small sample sizes and in the case of mixed binary and continuous covariates (Scenarios 2 and 3), the PS model selected using the means of OVL and S yielded the least biased

Table 1. Average Spearman’s correlation between bias and mean and median of different balance measures for different sample sizes (n) (true treatment-outcome RR = 2.00) X1, X2 and X4 were gamma distributed and others binary covariates

All covariates were binary Balance measure

500

1000

2000

5000

500

1000

2000

5000

0.50 0.31

0.53 0.40

0.53 0.43

0.54 0.45

0.43 0.03

0.54 0.12

0.63 0.26

0.65 0.34

Kolmogorov–Smirnov distance Mean Median

0.49 0.30

0.53 0.39

0.53 0.43

0.54 0.44

0.18 0.17

0.43 0.03

0.59 0.20

0.64 0.34

Lévy distance Mean Median

0.49 0.30

0.53 0.39

0.53 0.43

0.54 0.44

0.11 0.01

0.32 0.13

0.52 0.29

0.57 0.37

Standardized mean difference Mean Median

0.45 0.33

0.49 0.36

0.50 0.41

0.52 0.43

0.51 0.26

0.60 0.36

0.66 0.47

0.68 0.50

Overlapping coefficient Mean Median

Sample sizes (n) = 500, 1000, 2000, and 5000 (each 500 replications).

Copyright © 2014 John Wiley & Sons, Ltd.

Pharmacoepidemiology and Drug Safety, 2014; 23: 802–811 DOI: 10.1002/pds

806

m. s. ali et al.

Table 2. Median (and interquartile range) of treatment–outcome risk ratio when using means and medians of different balance measures for PS model selection (true treatment-outcome RR = 2.00) X1, X2, and X4 were gamma distributed and others binary covariates

All covariates were binary Balance measures

n = 500

Overlapping coefficient Mean Median

n = 1000

n = 2000

n = 5000

n = 500

n = 1000

n = 2000

n = 5000

1.94 (0.73) 1.84 (0.86)

1.94 (0.54) 1.87 (0.56)

1.96 (0.35) 1.90 (0.41)

1.95 (0.20) 1.92 (0.22)

2.06 (1.29) 1.19 (1.04)

1.97 (0.73) 1.01 (0.85)

2.03 (0.54) 1.05 (0.92)

1.98 (0.35) 0.95 (0.90)

Kolmogorov–Smirnov distance Mean Median

1.93 (0.74) 1.81 (0.84)

1.94 (0.54) 1.87 (0.56)

1.96 (0.35) 1.90 (0.42)

1.95 (0.20) 1.92 (0.23)

1.86 (1.33) 0.95 (0.68)

1.92 (0.70) 0.91 (0.55)

2.03 (0.54) 0.97 (0.82)

1.98 (0.34) 0.94 (0.83)

Levy distance Mean Median

1.93 (0.74) 1.81 (0.84)

1.94 (0.54) 1.87 (0.56)

1.96 (0.35) 1.90 (0.42)

1.95 (0.20) 1.92 (0.23)

1.82 (1.40) 1.12 (0.95)

1.90 (0.72) 1.01 (0.83)

2.03 (0.54) 1.07 (0.90)

1.98 (0.34) 1.02 (0.99)

Standardized mean difference Mean Median

1.94 (0.76) 1.76 (0.89)

1.94 (0.54) 1.86 (0.58)

1.96 (0.34) 1.89 (0.39)

1.95 (0.20) 1.92 (0.22)

2.07 (1.28) 1.77 (1.36)

1.98 (0.74) 1.74 (0.89)

2.03 (0.54) 1.87 (0.66)

1.98 (0.34) 1.90 (0.53)

Sample sizes (n) = 500, 1000, 2000, and 5000 (each 500 replications). PS, propensity score.

estimates. However, treatment effect estimates using the mean and medians of the three balance measures (OVL, KSD, and L) were more divergent compared to S. Table 3 shows the frequency (as percentage) of selection of related PS models (the number of times PS models were chosen) by different balance measures in 500 replications for different sample sizes. The PS models that were most often selected were the ones containing variables either related to the treatment and/or the outcome. PS models containing many of the covariates were most often selected, which is expected since balance measures were estimated over all nine covariates irrespective of their relation with the treatment/outcome. Given that all confounders were included in the PS model, PS models that include confounders and outcome-related variables were less often selected than PS models including confounders and only treatment-related variables. The frequency of selection of the “true” confounder model (the PS model containing only variables related to both the treatment and the outcome) was low compared with the full PS model (i.e., the PS model containing variables related to either the treatment or the outcome). Omission of confounders from the PS model resulted in the most biased estimates. Including variables related only to the treatment but not to the outcome (X7 and X8) amplified the bias when observed confounders were omitted from the PS model (i.e., in the presence of unmeasured confounding) (PS models 37 and 39 in Appendix 1; Table 4). PS models containing confounders and only outcome-related covariates provided the least biased effect estimates among the PS models considered with reduced variation. This was the case irrespective of sample size particularly when all Copyright © 2014 John Wiley & Sons, Ltd.

covariates were binary (Scenario 1). The true PS model (the PS model used to generate treatment status, i.e., PS Model 1) and related PS models (Models 2–7, Appendix 1 in the supporting information) yielded slightly biased estimates with increased variation (Table 4). Including interaction and higher-order terms reduced the bias of the effect estimates. Furthermore, including variable that was related neither to the treatment nor to the outcome did not affect the effect estimate as well as the balance of confounders. When the treatment effect was set to RR = 1 or RR = 1.5, results were similar in all scenarios (results for RR = 1.5 and RR = 2 with mixed gamma and binary distributions are included in the supporting information). DISCUSSION This simulation study extends the findings from our previous study12 on measures of balance in scenarios with continuous normal distributed covariates and frequent outcomes to scenarios that include binary covariates, non-normal distributed continuous covariates and rare outcomes. Such scenarios are typical of pharmacoepidemiologic studies. We further confirmed the usefulness of balance measures as tools not only for measuring and reporting the amount of balance reached by a given PS model but also for selecting an optimal PS model in terms of bias and variance of the treatment effect. Absolute standardized difference displayed better performance in terms of bias reduction and variance among the different balance measures compared in a wide range of sample sizes and covariate distributions. The performance of KSD and the L was dependent on sample size and covariate patterns. Pharmacoepidemiology and Drug Safety, 2014; 23: 802–811 DOI: 10.1002/pds

Copyright © 2014 John Wiley & Sons, Ltd.

28.4 18.6

31.2 18.8

31.2 18.8

35.8 15.4

Overlapping coefficient Mean Median

Kolmogorov–Smirnov distance Mean Median

Lévy distance Mean Median

Standardized mean difference Mean Median 30.8 18.4

27.6 18.0

27.6 17.8

26.8 18.0

1000

25.4 24.0

20.8 23.6

20.8 23.6

20.6 23.4

2000

22.0 24.4

14.8 23.4

14.8 24.4

14.8 23.2

5000

0.0 7.6

0.0 9.4

0.0 9.4

0.6 8.8

500

0.2 8.0

0 9.0

0 9.0

0 8.8

1000

0 5.6

0 5.8

0 5.8

0 5.8

2000

Sample size (n)

0 1.6

0 1.4

0 1.4

0 1.4

5000

Selection of PS Models 8–10†

53.4 18.6

60.2 22.0

60.2 22.0

62.2 24.6

500

63.0 31.0

65.8 37.6

65.8 37.8

66.4 38.4

1000

67.2 40.8

72.4 43.8

72.4 43.8

72.6 44.0

2000

Sample size (n)

70.0 51.2

79.6 53.4

79.6 53.4

79.6 53.2

5000

Selection of PS Models 11, 12, and 14–18‡

Total number of replications = 500 (Percentages were calculated from the 500 possible selections in each sample size, n). PS, propensity score. *True PS (Model 1) and related PS models, PS models containing confounders and treatment-related covariates (Models 1–7). † PS models including confounders and covariates related only to the outcome (Models 8–10). ‡ PS models containing covariates related either to the treatment or the outcome (Models 11–18). § The rest of PS models (Models 19-40). (For specific PS models, Refer Appendix 1 in the supporting information).

500

Sample size (n)

Selection of PS Models 1–7*

10.8 58.4

8.6 49.8

8.6 49.8

8.8 48.0

6.0 42.6

6.6 35.4

6.6 35.4

6.8 34.8

1000

7.4 29.6

6.8 26.8

6.8 26.8

6.8 26.8

2000

Sample size (n)

8.0 22.8

5.6 21.8

5.6 20.8

5.6 19.2

5000

Selection of the rest PS models§

500

Selection frequency of related PS models in (percentage) by various balance measures (all covariates were binary and true treatment–outcome RR = 2.00)

Balance measures

Table 3.

propensity score balance measures 807

Pharmacoepidemiology and Drug Safety, 2014; 23: 802–811 DOI: 10.1002/pds

808

m. s. ali et al.

Table 4. Median (and interquartile range) of treatment–outcome risk ratio when using the 40 different PS models (all covariates were binary, true treatment– outcome RR = 2.00, 500 replication for each sample size, n) Variables in the model

Model no.

n = 500

n = 1000

n = 2000

n = 5000

Confounders and treatment-related variables

1 2 3 4 5 6 7

1.93 (0.75) 1.94 (0.76) 1.93 (0.75) 1.93 (0.75) 1.93 (0.75) 1.93 (0.76) 1.93 (0.74)

1.95 (0.53) 1.95 (0.54) 1.95 (0.53) 1.94 (0.53) 1.94 (0.53) 1.94 (0.53) 1.96 (0.54)

1.95 (0.38) 1.95 (0.38) 1.95 (0.38) 1.94 (0.37) 1.94 (0.37) 1.95 (0.37) 1.95 (0.37)

1.95 (0.19) 1.96 (0.20) 1.95 (0.19) 1.95 (0.20) 1.95 (0.20) 1.95 (0.21) 1.95 (0.10)

Confounders and outcome-related variables

8 9 10

2.01 (0.72) 2.01 (0.71) 2.01 (0.70)

1.97 (0.48) 1.97 (0.48) 1.97 (0.50)

2.00 (0.29) 2.00 (0.29) 2.00 (0.36)

2.00 (0.20) 2.00 (0.20) 2.00 (0.19)

Treatment and/or outcome-related variables

11 12 13 14 15 16 17 18

1.97 (0.74) 1.97 (0.70) 1.93 (0.73) 1.98 (0.74) 1.97 (0.74) 1.97 (0.74) 1.94 (0.73) 1.97 (0.74)

1.94 (0.52) 1.95 (0.52) 1.94 (0.52) 1.95 (0.52) 1.94 (0.53) 1.94 (0.52) 1.94 (0.51) 1.94 (0.53)

1.97 (0.36) 1.97 (0.36) 1.94 (0.37) 1.96 (0.37) 1.96 (0.33) 1.97 (0.37) 1.94 (0.20) 1.96 (0.20)

1.95 (0.21) 1.95 (0.21) 1.95 (0.21) 1.95 (0.20) 1.95 (0.21) 1.95 (0.21) 1.95 (0.21) 1.95 (0.21)

Only confounding variables

19 20 21 22 23 24

1.99 (0.73) 2.00 (0.73) 1.99 (0.73) 1.99 (0.74) 1.99 (0.74) 1.98 (0.72)

1.98 (0.51) 1.99 (0.52) 1.98 (0.51) 1.98 (0.50) 1.98 (0.50) 1.97 (0.51)

2.00 (0.36) 1.99 (0.35) 2.00 (0.18) 1.99 (0.36) 1.99 (0.34) 1.99 (0.35)

2.00 (0.20) 2.00 (0.20) 2.00 (0.20) 1.99 (0.20) 1.99 (0.20) 1.99 (0.20)

Three/more of the confounders excluded from the model

25 26 27 28 29 30 31 32

1.15 (0.46) 1.10 (0.46) 1.10 (0.46) 1.12 (0.43) 1.11 (0.43) 1.11 (0.43) 1.07 (0.41) 0.99 (0.40)

1.15 (0.30) 1.10 (0.31) 1.10 (0.31) 1.12 (0.28) 1.12 (0.28) 1.11 (0.29) 1.08 (0.27) 0.98 (0.26)

1.15 (0.35) 1.10 (0.35) 1.10 (0.35) 1.12 (0.22) 1.12 (0.31) 1.08 (0.35) 1.12 (0.18) 0.99 (0.18)

1.15 (0.14) 1.10 (0.13) 1.10 (0.13) 1.12 (0.13) 1.12 (0.13) 1.12 (0.13) 1.08 (0.12) 0.99 (0.11)

The rest of the PS models*

33 34 35 36 37 38 39 40

1.94 (0.76) 1.94 (0.76) 1.70 (0.66) 1.76 (0.65) 1.71 (0.63) 1.81 (0.67) 1.64 (0.63) 1.12 (0.46)

1.94 (0.54) 1.94 (0.54) 1.73 (0.44) 1.77 (0.46) 1.71 (0.42) 1.78 (0. 46) 1.64 (0.43) 1.11 (0. 31)

1.95 (0.35) 1.95 (0.18) 1.71 (0.19) 1.78 (0.30) 1.71 (0.35) 1.78 (0.22) 1.64 (0.34) 1.12 (0.35)

1.96 (0.20) 1.96 (0.20) 1.71 (0.18) 1.78 (0.19) 1.70 (0.19) 1.79 (0.16) 1.63 (0.17) 1.13 (0.14)

*Propensity score (PS) Models 33–40. For specific PS models, Refer Appendix 1 in the supporting information.

Including variables unrelated to the treatment but related to the outcome in the PS model reduced the variation of the effect estimate, whereas including variables related only to the treatment in the PS model increased the variation of the effect estimate. The overlapping coefficient, the KolmogorovSmirnov, and the Lévy distances gave similar results with respect to bias reduction when all the covariates followed a binomial distribution and were comparable with S irrespective of sample size and the strength of the treatment effect. With continuous or mixed binary and continuous covariate distributions, the performance of OVL, KSD, and L was poor in small sample sizes, which is in line with our previous study,12 whereas S Copyright © 2014 John Wiley & Sons, Ltd.

performed consistently better in all scenarios. When calculating the parametric S, the difference in means of the continuous covariate between treated groups within strata of the PS was standardized to the pooled variation, which might explain why the mean and median of S (hence, the bias when using means and median of S to select PS models) were relatively close to each other, compared to the means and medians of OVL, KSD, and L. Indeed, OVL, KSD, and L are nonparametric, and we noted in our simulations that with mixed binary and continuous covariates, the values of the three balance measures were more divergent than those of S. Some extreme imbalances in covariate distribution (as outliers) that might exist in some of the PS strata (e.g., in the Pharmacoepidemiology and Drug Safety, 2014; 23: 802–811 DOI: 10.1002/pds

propensity score balance measures

first or fifth quintiles of the PS) could be captured by the mean rather than the median, which might explain why using the means of the different balance measures resulted in less biased estimates than their medians. Although our primary aim was to evaluate balance measures in PS model selection in terms of bias, we have also looked at variable selection for inclusion in PS model based on their association with the treatment and/or the outcome. Including variables related only to the outcome but not to the treatment in the PS model reduced the variance of the effect estimate, which is consistent with previous studies.6,8,9,12,26,27 In addition, including variables related only to the treatment but not to the outcome (instrumental variables, IVs) or variables strongly related to the treatment but only weakly related to the outcome (near-IVs) amplified the bias, which is in line with the findings of Myers et al.8 and Pearl.9,27 However, the bias amplification in our simulation was mild particularly when covariates were mixed binary and continuous. Possible explanations could be that, first, the presence of extreme unobserved confounding and a very strong instrument is a condition for observable increase in bias.8,28 However, in our simulation, we did not consider extreme unobserved confounding and the treatment was binary, further constraining the extent to which an IV or a near-IV determines treatment status. No such constraints exist with a continuous exposure, and the IV association with exposure could be larger.8,28 Second, our settings incorporated a large number of covariates, higher-order and interaction terms in the PS model. Pearl9 has pointed out that inclusion of IVs or near-IVs in case of multiple covariates will always amplify confounding bias (if such exists) in linear models but bias in nonlinear systems (such as the one used in our simulations) could be either amplified or attenuated. On the other hand, in the absence of unobserved confounder, in our study, consistently least biased treatment effect estimates were obtained using PS models including confounders and variables related only to the outcome compared to PS models that include confounders and variables related only to treatment (instrumental variables) only when all the covariates were binary. When balance measures were used to select a PS model, the PS models that were most often selected included covariates related only to the outcome in addition to confounders and treatment-related covariates. Although the choice of functional forms and possible interactions among covariates for the PS model may not seem straightforward, the use of balance measures in combination with prior clinical knowledge about the causal network that links exposure, outcome, and potential confounders is worth consideration. When sample sizes are relatively small, using balance measures for model Copyright © 2014 John Wiley & Sons, Ltd.

809

selection may result in a large number of calculations per covariate and stratum of the PS and may provide unreliable estimates in small sample size.12 However, when matching on the PS is used, the number of calculations will be greatly reduced (involving only balance of covariate and/or interaction terms across matched groups), and hence, the reliability of estimates could be improved. The current study captures a wide range of scenarios in relation to covariate distributions, inclusion of higher-order and/or interaction terms, and lower incidence of outcome allowing for the generalizability of study results to empirical, pharmacoepidemiologic studies. Possible limitations of the current study could be that, first, we gave equal weights to all irrespective of their association with treatment and outcome in calculating the balance measures. Weighted balance measures based on covariate’s association with the outcome have been discussed by previously,12 but the choice of the weights is not straightforward and, therefore, not included in our simulation. Second, we based our balance calculations on all covariates but not on interaction or higher-order terms to simplify computations and interpretations although Rubin suggested to check balance on covariates as well as their interactions terms included in the PS model, when relevant.29 On the otherhand, several researchers advocate including outcome-related covariates in the PS model to reduce bias and improve precision of treatment effect estimate;9,10,26,27 hence, balance check only on outcome related covariates seems important. Third, we calculated the balance measures per covariate within each PS strata before pooling into summary measures, while applying the covariate adjustment using the PS in regression model (and not stratification on the PS) to estimate treatment effect. Although pooling across strata might lead to an underestimation of the amount of residual imbalance, its impact on the treatment effect estimate does not seem to be significant. This can be seen in our simulation by looking at the bias that resulted when we use summary balance measures to select PS models. Standardized differences on the logit of the PS as well as variance ratios using residuals after regressing each covariate on the logit of the PS have been suggested elsewhere,29,30 but we did not include this approach in our simulation. Last but not least, covariate adjustment using PS in the regression model that we used in estimating treatment effect rely on the assumption of linear relationship between the estimated PS and the outcome.29 On the otherhand, noncollapsibility,31,32 a phenomenon when estimating the treatment–outcome association using odds ratio (OR) or hazard ratio (HR) by including covariates in the adjustment model that the conditional OR or HR does not equal the marginal Pharmacoepidemiology and Drug Safety, 2014; 23: 802–811 DOI: 10.1002/pds

810

m. s. ali et al.

OR or HR in the presence of non-null treatment effect, even in the absence of confounding and effect modification, is not an issue in our study since a Poisson model was employed for both data generation and treatment effect estimation. The use of machine learning techniques such as neural networks and classification and regression trees33 was proposed for PS estimation, yet applications of these methods in PS analysis of pharmacoepidemiologic data is very limited. However, the methods were shown to have better performance in terms of bias reduction and more consistent 95%CI coverage than logistic regression approaches, particularly under conditions of both nonadditivity and nonlinearity.34,35 Hence, evaluation of the different balance measures when using logistic regression and machine learning techniques for PS estimation is a possible next step. In summary, we recommend that emphasis should be given to correct specification of the PS model and suggest absolute standardized difference as a balance measure of choice for measuring, reporting, and selecting the final PS model. Not only is the absolute standardized difference easy to estimate, independent of sample size, and scale invariant, but it is also a very familiar and reasonably wellunderstood tool compared to other balance measures. CONFLICT OF INTEREST Olaf Klungel received unrestricted funding for pharmacoepidemiologic research from the Dutch private–public funded Top Institute Pharma. KEY POINTS

• •

• •

Checking balance of covariates between treatment groups is a critical step in PS analysis. Balance measures for PS analysis should be characteristic of a specific sample and should not be influenced by sample size. Examples include absolute standardized difference, overlapping coeficcient, Kolmogorov-Smirnov distance, and Lévy diatance. In simulations, Absolute standardized difference showed strong correlation with bias irrespective of covariate patterns and sample size. Balance measures are helpful not only for quantifying and reporting the amount of balance reached by a PS model but also for selecting the optimal PS model.

ETHICS STATEMENT The authors state that no ethical approval was needed. Copyright © 2014 John Wiley & Sons, Ltd.

ACKNOWLEDGEMENTS The research leading to these results was conducted as part of the Pharmacoepidemiological Research on Outcomes of Therapeutics by a European Consortium (PROTECT Consortium) that is a public–private partnership coordinated by the European Medicines Agency. The PROTECT project has received support from the Innovative Medicine Initiative Joint Undertaking (www.imi.europa.eu) under grant agreement no. 115004, resources of which are composed of financial contribution from the European Union’s Seventh Framework Programme (FP7/2007-2013) and EFPIA companies’ in kind contribution. In the context of the IMI Joint Undertaking (IMI JU), the Department of Pharmacoepidemiology, Utrecht University, also received a direct financial contribution from Pfizer. The views expressed are those of the authors only and not of their respective institution or company. REFERENCES 1. Rosenbaum PR, Rubin DB. Reducing bias in observational studies using subclassification on the propensity score. J Am Stat Assoc 1984; 79: 516–524. 2. Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika 1983; 70(1): 41–55. 3. Austin PC. Goodness-of-fit diagnostics for the propensity score model when estimating treatment effects using covariate adjustment with the propensity score. Pharmacoepidemiol Drug Saf 2008; 17(12): 1202–1217. 4. Austin PC. Assessing balance in measured baseline covariates when using manyto-one matching on the propensity score. Pharmacoepidemiol Drug Saf 2008; 17(12): 1218–1225. 5. Weitzen S, Lapane KL, Toledano AY, et al. Principles for modelling propensity scores in medical research: a systematic literature review. Pharmacoepidemiol Drug Saf 2004; 13(12): 841–853. 6. Rubin DB, Thomas N. Matching using estimated propensity scores: relating theory to practice. Biometrics 1996; 52(1): 249–264. 7. Brookhart MA, Schneeweiss S, Rothman KJ, et al. Variable selection for propensity score models. Am J Epidemiol 2006; 163(12): 1149–1156. 8. Bhattacharya J, Vogt WB. Do instrumental variables belong in propensity scores? NBER Technical Working Paper No. 343. The National Bureau of Economic Research: Cambridge, MA, 2007. 9. Myers JA, Rassen JA, Gagne JJ, et al. Effects of adjusting for instrumental variables on bias and precision of effect estimates. Am J Epidemiol 2011; 174(11): 1213–1222. 10. Pearl J. Invited commentary: understanding bias amplification. Am J Epidemiol 2011; 174(11): 1223–1227. 11. Pearl J. Causal diagrams for empirical research. Biometrika 1995; 82(4): 669–688. 12. Belitser SV, Martens EP, Pestman WR, et al. Measuring balance and model selection in propensity score methods. Pharmacoepidemiol Drug Saf 2011; 20(11): 1115–1129. 13. Groenwold RHH, Vries F, Boer A, et al. Balance measures for propensity score methods: a clinical example on beta-agonist use and the risk of myocardial infarction. Pharmacoepidemiol Drug Saf 2011; 20(11): 1130–1137. 14. Austin PC. Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples. Stat Med 2009; 28(25): 3083–3107. 15. Stephens MA. Use of the Kolmogorov-Smirnov, Cramér-von Mises and related statistics without extensive tables. J Royal Stat Soc Series B (Methodological) 1970; 32(1): 115–122. 16. Pestman WR. Mathematical Statistics (2nd edn). Walter de Gruyter: Berlin, 2009. 17. Lévy P. Théorie de l’addition des variables aléatoires. Gauthier-Villars: Paris, 1937. 18. Stine RA, Heyse JF. Non-parametric estimates of overlap. Stat Med 2001; 20(2): 215–236.

Pharmacoepidemiology and Drug Safety, 2014; 23: 802–811 DOI: 10.1002/pds

propensity score balance measures 19. Austin PC, Grootendorst P, Anderson GM. A comparison of the ability of different propensity score models to balance measured variables between treated and untreated subjects: a Monte Carlo study. Stat Med 2007; 26(4): 734–753. 20. Silverman BW. Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC: London, 1986. 21. Wand MP, Jones MC. Kernel Smoothing. Chapman & Hall/CRC: London, 1995. 22. Austin PC. The performance of different propensity score methods for estimating marginal odds ratios. Stat Med 2007; 26(16): 3078–3094. 23. Austin PC, Grootendorst P, Normand SLT, et al. Conditioning on the propensity score can result in biased estimation of common measures of treatment effect: a Monte Carlo study. Stat Med 2007; 26(4): 754–768. 24. Shah BR, Laupacis A, Hux JE, et al. Propensity score methods gave similar results to traditional regression modeling in observational studies: a systematic review. J Clin Epidemiol 2005; 58(6): 550–559. 25. R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing: Vienna, Austria, 2012. ISBN 3-900051-07-0, URL http://www.R-project.org/ 26. Patrick AR, Schneeweiss S, Brookhart MA, et al. The implications of propensity score variable selection strategies in pharmacoepidemiology: an empirical illustration. Pharmacoepidemiol Drug Saf 2011; 20(6): 551–559. 27. Pearl J. On a class of bias-amplifying variables that endanger effect estimates. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence (UAI 2010), Grünwald P, Spirtes P (eds). Association for Uncertainty in Artificial Intelligence: Corvallis, OR, 2010; 425–432. 28. Brookhart MA, Stürmer T, Glynn RJ, et al. Confounding control in healthcare database research: challenges and potential approaches. Med Care 2010; 48(6 suppl): S114–S120.

Copyright © 2014 John Wiley & Sons, Ltd.

811

29. Rubin DB. Using propensity scores to help design observational studies: application to the tobacco litigation. Health Serv Outcomes Res Methodol 2001; 2(3): 169–188. 30. Rubin DB. The design versus the analysis of observational studies for causal effects: parallels with the design of randomized trials. Stat Med 2007; 26(1): 20–36. 31. Greenland S, Robins JM, Pearl J. Confounding and collapsibility in causal inference. Stat Sci 1999; 14(1): 29–46. 32. Gail MH, Wieand S, Piantadosi S. Biased estimates of treatment effect in randomized experiments with nonlinear regressions and omitted covariates. Biometrika 1984; 71(3): 431–444. 33. Westreich D, Lessler J, Funk MJ. Propensity score estimation: neural networks, support vector machines, decision trees (CART), and metaclassifiers as alternatives to logistic regression. J Clin Epidemiol 2010; 63(8): 826–833. 34. Lee BK, Lessler J, Stuart EA. Weight trimming and propensity score weighting. PLoS One 2011; 6(3): E18174. 35. Setoguchi S, Schneeweiss S, Brookhart MA, et al. Evaluating uses of data mining techniques in propensity score estimation: a simulation study. Pharmacoepidemiol Drug Saf 2008; 17(6): 546–555.

SUPPORTING INFORMATION Additional supporting information may be found in the online version of this article at the publisher’s web site.

Pharmacoepidemiology and Drug Safety, 2014; 23: 802–811 DOI: 10.1002/pds

Propensity score balance measures in pharmacoepidemiology: a simulation study.

Conditional on the propensity score (PS), treated and untreated subjects have similar distribution of observed baseline characteristics when the PS mo...
177KB Sizes 2 Downloads 0 Views