Article

Causal effect estimation strategies in a longitudinal study with complex time-varying confounders: A tutorial

Statistical Methods in Medical Research 0(0) 1–19 ! The Author(s) 2014 Reprints and permissions: sagepub.co.uk/journalsPermissions.nav DOI: 10.1177/0962280214545529 smm.sagepub.com

Bart JA Mertens,1 S Datta,2 R Brand1 and W Peul3

Abstract The Dutch Sciatica Trial represents a longitudinal study with complex time-varying confounders as patients with poorer health conditions (e.g. more severe pain) are more likely to opt for surgery, which, in turn, may affect future outcomes (pain severity). A straightforward classical as-treated comparison at the end point would lead to biased estimation of the surgery effect. We present several strategies of causal treatment effect estimation that might be applicable for analyzing such data. These include an inverse probability of treatment weighted regression analysis, a marginal weighted analysis, an unweighted regression analysis, and several propensity score-based approaches. In addition, we demonstrate how to evaluate these approaches in a thorough simulation study where we generate various realistic complex confounding patterns akin to the sciatica study. Keywords sciatica, causal, inverse probability weighting, propensity

1 Introduction The past 20 years have seen tremendous methodological development for so-called causal effect estimation in observational studies. Among these, propensity score methods and the closely related inverse probability weighting via marginal structural models (MSMs) have gained widespread popularity. Several texts present example analyses or detail philosophical and algorithmic aspects.1–3 A full appreciation of the statistical properties of propensity or weighting-based approaches must, however, still emerge. Of particular concern are recent publications suggesting propensity scores may inherit the familiar overfitting problems in high dimensions.2,4 1

Department of Medical Statistics, Leiden University Medical Center, RC Leiden, The Netherlands Department of Bioinformatics and Biostatistics, University of Louisville, Louisville, KY 40292, USA 3 Department of Neurosurgery, Leiden University Medical Center, RC Leiden, The Netherlands 2

Corresponding author: Bart JA Mertens, Department of Medical Statistics, Leiden University Medical Center, PO Box 9600, 2300 RC Leiden, The Netherlands. Email: [email protected]

Downloaded from smm.sagepub.com at Stockholm University Library on July 31, 2015

2

Statistical Methods in Medical Research 0(0)

Papers considering data-analytic implications on the implementation of propensity scoring-based effect estimation to accommodate these potential problems are absent. Until recently, traditional wisdom has been to fit a ‘‘large’’ propensity model to calibrate a balancing score and then condition subsequent analyses on this score as if known a priori. In this tutorial, we consider the problem of estimating the causal effect of treatment from a longitudinal study where the patients can opt to receive treatment (e.g. surgery)—not at the baseline—but at any given time during the course of the study. However, this decision is likely to be affected by their health condition (e.g. pain indices) at the previous time point and thus creates a complex time-varying confounder. We illustrate a number of strategies for estimating the true (e.g. causal) treatment effect. The primary methodology that is illustrated here is an adaptation of inverse probability weighting MSMs to assess treatment effect in an AIDS cohort study by Herna´n et al.1 In addition, we demonstrate three other complementary/competing strategies, namely, a marginal weighted analysis (t-test), a propensity score-based regression analysis, and a propensity scorebased stratification approach. Instead of describing these methods in more abstract terms, we illustrate them on a complex temporal dataset arising out of a Dutch Sciatica Trial.5,6 Sciatica is a term used to denote a collection of pain symptoms, which may have several causes related to compression or irritation of spinal nerve roots leading into the sciatic nerve. In this tutorial, we have used data from the control group only as these data present the challenge of isolating the true effect of surgery in the presence of time-varying confounders. An extensive and realistic simulation to mimic such longitudinal studies and evaluate the effect estimates obtained from these methods is included as supplementary files. The rest of the article is organized as follows. Section 2 gives a general overview of the Dutch Sciatica Trial data, the findings reported in the original publications and connection with the causal inference literature. Section 3.1 provides a general description of the data and some preliminary exploratory analysis. The methodology of inverse probability of treatment weighting-based analysis to estimate the causal effect of treatment is introduced and illustrated using the ‘‘wait-and-see’’ arm of the trial data in Section 3.2. We provide a fairly detailed coverage of this primary methodology of our tutorial including (a) treatment probability prediction modeling, (b) the construction of the weights, (c) the inspection/exploration of the weights, and (d) a weighted regression analysis. The marginal weighted analysis and propensity score-based analyses are explained in Sections 3.3 and 3.4, respectively. The supplementary files demonstrate how to evaluate the treatment effect estimates obtained from these methods using a simulation experiment. Our simulation study mimicks the ‘‘wait and see’’ arm of the Dutch trial, where the future treatment decision is likely to be affected by the past/present outcome values. The article ends with a detailed discussion section mostly focusing on these various methodologies in the context of the Dutch Sciatica Trial data.

2 The Leiden Sciatica Trial data and causal inference The Dutch Sciatica Trial5 was set up to improve options for informed treatment choice for patients suffering from sciatica. It randomly assigned 283 patients to either early surgery (141 patients) or to a conservative wait-and-see policy (control group, 142 patients). Patients in the wait-and-see group were given the option to have surgery at a later date after inclusion. The follow-up period consists of two years with one visit at baseline and nine subsequent scheduled monitoring visits (at 2, 4, 8, 12, 38, 52, 78, and 104 weeks), which we denote with sequence numbers 0 to 9 such that the first corresponds to randomization (baseline observation) and visit ‘‘9’’ the final observation. Surgery was offered by study design for patients with persistent symptoms six months after randomization.

Downloaded from smm.sagepub.com at Stockholm University Library on July 31, 2015

Mertens et al.

3

The trial recorded four primary outcome measures at each follow-up visit. These consist of the Roland disability score, a Likert self-assessment score, and two visual analogue scales (VAS1 and VAS2), measuring leg and back pain, respectively. The Roland disability score is defined on a discrete grid ranging from 0 to 23, with increasing values denoting deteriorating conditions. The Likert score is a patient self-assessment and in that sense perhaps less objective, defined on a coarser grid of integer values ranging from 1 to 7, with larger values denoting poorer conditions. The primary analysis from the original paper by Peul et al.5 consists of intention-to-treat-based comparison between both randomization groups using repeated measures analysis of variance for the above four outcomes separately across the first year of follow-up (up to the seventh follow-up visit). Results were presented as estimated differences with 95% confidence intervals in their second table. The trial was powered to detect at least a three-point difference on the Roland score with power 90% and a two-tailed significance level of 0.05. The sciatica trial is designed to evaluate treatment policy, of which surgery is a component, rather than the surgery itself. Nevertheless, such questions as ‘‘should a specific patient receive the surgery’’ or ‘‘does application of surgery affect outcome’’ are relevant to medical practice. These are, however, subtly distinct questions concerned with the evaluation of surgery effect itself—as opposed to—the comparison between an ‘‘early’’ and conservative approach to the management of the patient condition considered in the sciatica trial comparison. This article applies causal effect estimation methods for the assessment of the surgery effect in the Sciatica trial. In an influential collection of papers, Herna´n et al.1,7–9 describe application of MSMs to assess treatment effect in a Multicentre AIDS Cohort Study (MACS). The MACS study formally has the same confounding structure on treatment assignment as in our sciatica trial within the waitand-see group. Patients may be assigned treatment (zidovudine and prophylaxis) at any point in time during follow-up, based on past CD4 counts and Pneumocystis carinii pneumonia history. The treatment objective is of course to influence future development of CD4 count and probability of pneumonia recurrence, which in turn affects life expectancy. Likewise, in the sciatica trial, past levels of pain (e.g. as measured by Roland and Likert scores) are likely predictors of the decision to take surgery within the wait-and-see group. On the other hand, past treatment history (such as the decision to take surgery) may affect future pain levels, or is at least intended to do so. Such complex confounding between predictors of both a time-dependent and non-randomized treatment and outcome, while the treatment intervention may itself affect future outcomes, renders standard methods of analysis—such as regression adjustment or stratification—invalid. An alternative view of the sciatica trial control group is as a sequentially randomized trial with the treatment probabilities dependent on past covariate and treatment history. In such trials, the MSMs demonstrated in the above-referenced papers by Herna´n et al. can give consistent estimates of treatment effect, provided there is no unmeasured confounding on the treatment decision. Therefore, though MSMs cannot shed light on the issue of optimal timing of treatment, we decided to carry out a secondary analysis of the control group trial data using MSMs in the first instance, both as a verification of the methodology in a new but relevant data analytic application, and also for the interest in the treatment estimates themselves. MSMs cannot be applied to the sciatica trial early surgery-arm group because they are based on re-weighting and hence require the so-called positivity assumption, which states that the treatment probability should be strictly between 0 and 1. The latter assumption is clearly invalid by design for the early surgery group. We will return to this and related considerations on the exclusion of the early-surgery data for the analysis of surgery effect presented in this article in the discussion. We will next describe some basic exploratory data analysis for the wait-and-see trial arm data first and then introduce the application of MSMs, before a subsequent confirmatory analysis using

Downloaded from smm.sagepub.com at Stockholm University Library on July 31, 2015

4

Statistical Methods in Medical Research 0(0)

propensity scores. To simplify the presentation we consider re-weighted estimation of surgery effect within the sciatica trial control group on outcomes at the ninth visit (two-year follow-up).

3 Weighted re-analysis of the sciatica data 3.1 Data and preliminary exploratory analysis About 40% of patients (56) in the wait-and-see group opt into surgery at some point during the follow-up period. Roughly one quarter of these (15) have treatment within two months from randomization and the remainder by approximately one year at most, four patients excepted (at 1.14, 1.25, 1.26, and 1.38 years). Figure 1 displays two distinct Kaplan–Meier estimates of survival versus actual visit times (bottom right) as well as visit numbers (top right) respectively. The left graph displays box plots of the actual timings (in years) for each of the nine follow-up visits after baseline, as well as a box plot for times of treatment for those patients taking treatment (indicated by ‘‘T’’ at the bottom axis). For the Kaplan–Meier calculations, failure is defined as surgery (treatment) and censoring as absence of surgery till the ninth visit. It can be seen that there is some substantial variation in the actual visit timings, though visit times are almost completely separated between the scheduled visits. Variability of visit times increases with follow-up. The Kaplan–Meier plot on actual failure timings indicates a steady rate of treatments within the first half year and starting immediately after baseline randomization, gently leveling off after that period. Re-calculating the Kaplan–Meier based on scheduled visit numbers reveals a jump at the fifth visit, which could partly

Figure 1. The left plot shows box plots of actual times of follow-up visits relative to baseline versus visit numbers (1–9) and a box plot of treatment times (T). The two right-side plots are Kaplan–Meier survival estimates based on visit numbers (top right) and actual visit times (bottom right), respectively.

Downloaded from smm.sagepub.com at Stockholm University Library on July 31, 2015

Mertens et al.

5

be due to the increased accrual time for treatments due to the lengthening of the time interval between visits at this point, and could also be due to the study protocol actively offering treatment at this stage if not yet received. Figure 2 shows (jittered) plots of VAS1, VAS2, Roland, and Likert scores at scheduled visits prior to surgery (or end follow-up if no surgery) and distinguishing patients eventually receiving surgery (blue circles) from those never opting into treatment (red crosses). Besides the clear preference for treatments within the first year for those having treatment, the plots reveal distinctly higher scores for both Roland and Likert on treated individuals within the first half year as compared to individuals who never receive treatment. The same behaviour applies for the VAS1 and VAS2 scores. Another feature is that the level of the observed scores does not seem to reduce with time for individuals who never get treated, though it must be kept in mind that we have removed the longitudinal within-patient profiles from the data to improve readability of the graphs. The plots demonstrate the confounding problem discussed in Section 2, as patients with poor condition (high levels of pain) may preferentially opt into treatment, while the past state of individuals is likely to be predictive of the final outcome on patients subsequent to treatment. In the remainder of this article, we restrict attention to the Roland and Likert scores as the two key outcomes of interest to summarize a patient’s state. We will, however, use the VAS scores for treatment weight calibration in subsequent analyses. In subsequent text, we refer to ‘‘treatment

Figure 2. Plots of VAS1, VAS2, Roland, and Likert scores prior to surgery (or end follow-up). Observations from individuals who eventually get treated during the observation period are plotted blue, all other observations are plotted red.

Downloaded from smm.sagepub.com at Stockholm University Library on July 31, 2015

6

Statistical Methods in Medical Research 0(0)

group(s)’’ as the patient grouping defined by the treatment received by end of the trial for each patient. In the same way, we may refer to an individual as ‘‘treated,’’ indicating that it belongs to the group of patients who eventually opted for treatment, ignoring the precise reference to the timing of treatment itself.

3.2

Inverse probability treatment weighted analysis

MSMs are estimated in two stages. At the first stage, a treatment prediction model is fit, which is used to calculate the patient-specific probabilities to receive treatment (surgery in this case) at any point in time while the patient has not yet received the treatment. At the second stage, the inverses of these treatment probabilities are used as weights in the calibration of the substantive model, such as a regression model or other on the outcome of interest, which is used for the assessment of the treatment effect. For the fitting of the treatment prediction model we may only use the observed data up to and including the time of surgery, or end of follow up, whichever comes sooner. All data subsequent to the surgery is discarded. Let k 2 f1, . . . , 9g denote the visit numbers in follow-up and k ¼ 0 denotes the baseline. Then we write L(i,tik) for the set of all observed scores (VAS1, VAS2, Roland, and Likert) for any i-th patient at the k-th visit, with corresponding visit time tik after baseline. A patient who remains free of surgery till end follow-up will contribute the full set of measurements fLði, ti1 Þ, . . . , Lði, ti9 Þg across all visits to the fitting of the treatment prediction model. If an i-th patient receives surgery at time tis and there are k – 1 scheduled visits prior to surgery for this patient, then the patient contributes the set fLði, ti1 Þ, . . . , Lði, tiðk1Þ Þ, Lði, tis Þg to the data used for fitting the weights model. Should surgery take place prior to the next scheduled visit, then we define the visit number corresponding to the final measurement to be kðtis Þ ¼ k. In addition, we have for each i-th patient the baseline set of covariates V(i), which consists of the above-mentioned four baseline scores L(i,0), besides age, gender, height, and weight. In analogy to the above, we define an additional set M(i,tik) of time-dependent measures on visits k > 0 defined as the changes in the VAS1, VAS2, Roland, and Likert scores from baseline at each time point as well as the changes in these scores from each previous time point (lag values). Finally, we may define the treatment (surgery) indicators A(i,ti) for any patient such that Aði, ti Þ ¼ 1 for any time point from surgery onwards (ti  tis ) and zero otherwise or when surgery never took place for that patient. Note Aði, 0Þ ¼ 0 by definition and that treatment can only be given once. 3.2.1 Treatment probability prediction model for inverse treatment probability weighted (IPTW) weighting Calibration of a treatment probability model is formally a survival analysis problem on time till treatment and thus Cox regression could be used. Herna´n et al.1,7–9 propose pooled logistic regression models for the estimation of the treatment probabilities. While the main arguments seem to be simplicities of the approach and ease of implementation, we refer readers to the above-cited papers for full discussion of the pros and cons involved. To remain consistent with these publications, we decided to apply the same methodology on the sciatica data in this article. Let Pi,t be the treatment probability for the i-th patient at time t and conditional on the patient not yet having received treatment at that point. We then formulate the model as   Pi,t log ð1Þ ¼  þ hðtÞ þ ½Lði, tÞbL  þ ½Mði, tÞbM  þ ½VðiÞbV  1  Pi,t

Downloaded from smm.sagepub.com at Stockholm University Library on July 31, 2015

Mertens et al.

7

The term ½Lði, tÞ bL  is a shorthand notation for the linear combination of the scores within the predictor set L(i, t) with regression coefficients in the vector bL and similarly for the baseline component ½VðiÞ bV  and for ½Mði, tÞ bM . The function h(t) is a function of time since randomization only and defined as hðtÞ ¼ t þ IðkðtÞ ¼ 5Þ where I() denotes the indicator function having a value of 1 when its argument is true and 0 otherwise and k(t) denotes the visit interval number which corresponds to time t. The last term represents a probability spike at the fifth visit. The latter is to account for a sudden increase in surgeries carried out as a consequence of the trial actively offering the surgery to patients who had not yet received it at that stage and did not show recovery of function or pain. We carried out a confirmatory analysis on our choice of the time component by starting with the more general model hðtÞ ¼ t þ

9 X

k IðkðtÞ ¼ kÞ

k¼1

and then incrementally merging time points and their coefficients k from the last visit backward or from the first visit onward. By checking deviances we found that all time points could indeed be merged within the model, except for the fifth visit. While this approach is simple, it is however essential to enforce parsimony on the calibration of the weights as much as possible, because inverse probability weighted effect estimation could be highly sensitive to excess variation of the weights (see subsequent results and discussion). 3.2.2 Inverse probability treatment weights Once the treatment probabilities Pi,t have been calibrated at the observed visit and treatment times, the estimated weight W(i) for any i-th person is defined as WðiÞ ¼

Wnum ðiÞ Wdenom ðiÞ

ð2Þ

with the denominators defined as Wdenom ðiÞ ¼ Pi,tis

k 1 Y

ð1  Pi,til Þ

ð3Þ

l¼1

if the individual did have treatment at time tis. Otherwise we have Wdenom ðiÞ ¼

9 Y

ð1  Pi,til Þ

ð4Þ

l¼1

for those people never taking treatment up to end of the trial. The numerators Wnum ðiÞ are stabilizing weights calculated in the same manner from complementary logistic models to the above described. They have identical definition as the above model, except that all the timedependent predictors are removed from the model (keeping only the above-defined eight baseline

Downloaded from smm.sagepub.com at Stockholm University Library on July 31, 2015

8

Statistical Methods in Medical Research 0(0)

predictors and the time component). An inverse probability of treatment weighting approach then applies these weights to the likelihood as in n Y

½liki WðiÞ

i¼1

where the liki are the likelihood components for each i-th patient (see also Herna´n et al.1 p. 442). 3.2.3

Analysis using inverse probability of treatment weighting

Data exploration. A first step in any practical data analysis would be to get a basic understanding of the strength of any potential confounding of past pain level scores and baseline covariates between the time-dependent decision to treat on the one hand, and the eventual outcome of interest on the other. We would also like to have an indication of which pain scores and baseline measures might be the strongest confounders. For this purpose, Table 1 displays simple univariate summaries for the baseline predictors (age, gender, weight, height, VAS1, VAS2, Roland, and Likert—first eight rows) and a set of time-dependent variables (last eight rows). The latter are defined taking each last measured score prior to treatment—or at end of follow-up (whichever comes sooner)—as a reference and then calculating either the changes in the VAS1, VAS2, Roland, and Likert scores from baseline (rows 9–12) or the lag value from the immediately preceding previous time point (rows 13–16). Table 1. Tabulation of predictors used in the treatment probability calibration model 1, distinguishing between baseline variable and variables subsequent to baseline. Predictor

Treatment

Roland

Likert

Baseline

Age Gender Weight Height VAS1 base VAS2 base Roland base Likert base

(0.68, 0.38) (0.03, 0.02) (0.59, 1.03) (0.60, 0.48) (1.11, 3.63) (0.23, 2.86) (0.28, 0.77) (0.05, 0.16)

(0.02, (1.82, (0.02, (0.07, (0.01, (0.00, (0.07, (0.52,

0.11) 1.05) 0.07) 0.07) 0.04) 0.04) 0.21) 0.74)

(0.01, 0.03) (0.49, 0.27) (0.00, 0.02) (0.00, 0.03) (0.01, 0.00) (0.01, 0.01) (0.05, 0.03) (0.28, 0.05)

Time-dependent

VAS1 change VAS2 change Roland change Likert change VAS1 lags VAS2 lags Roland lags Likert lags

(6.77, 21.78) (7.94, 21.86) (1.87, 5.40) (0.30, 1.05) (7.31, 1.73) (0.76, 9.19) (0.94, 1.02) (0.45, 0.08)

(0.01, 0.06) (0.00, 0.02) (0.27, 0.46) (0.49, 1.29) (0.06, 0.01) (0.00, 0.02) (0.05, 0.29) (0.51, 0.70)

(0.01, 0.02) (0.00, 0.01) (0.05, 0.10) (0.39, 0.55) (0.01, 0.01) (0.00, 0.01) (0.05, 0.04) (0.18, 0.48)

See the text for the precise definition of predictor variables. The ‘‘treatment’’ column gives 95% confidence intervals for mean differences between individuals eventually treated and those left untreated by end of the trial and for each respective predictor. The columns ‘‘Roland’’ and ‘‘Likert’’ give 95% confidence intervals for the regression coefficient of each of these two scores in a univariate regression on each predictor individually. Bold printed results denote confidence estimates excluding 0.

Downloaded from smm.sagepub.com at Stockholm University Library on July 31, 2015

Mertens et al.

9

To assess association of these measures with treatment assignment, the third column shows 95% confidence interval estimates for mean differences between treatment groups for each of these predictors. Likewise, to evaluate confounding of baseline and pain level scores prior to the treatment decision with the Roland and Likert outcome scores at end follow-up, the fourth and fifth column give 95% confidence estimates for the regression coefficients of either the Roland or Likert scores at end of the trial using a univariate linear regression on each predictor separately. Note that the ‘‘predictor data’’ evaluated in this table does not use information after the treatment decision in order to assess the impact on either the treatment decision itself, or on the Roland and Likert outcome scores. Bold printed results display confidence intervals excluding the null hypothesis and are indicative of strong association between the associated predictor and the outcome of interest. In interpreting the results, we should note that there would be evidence of the existence of the confounding scenarios we described before, when there is association between treatment decision and predictor (column 3) as well as between outcome and predictor (columns 4 and/or 5). We find evidence of such confounding for the changes from baseline across all pain scores (rows 9–12). Baseline measurements themselves seem particularly associated with the decision to treat. The lagged changes immediately preceding the treatment decision are most associated with outcome. These results provide confirmation of the patterns observed above in Figure 2 and provide justification for the use of the complex confounding adjustment methods as in IPTW. Indeed, simple baseline confounding could in principle be dealt with using stratification or regression covariance adjustment only. Application of IPTW in MSMs can resolve complex confounding where past predictors affect both future treatment decision, which may in turn affect future outcome. It must be borne in mind that these change scores are highly correlated with the baseline score. Badly affected patients (having higher scores) may also tend to improve more rapidly than others—even without treatment. Hence, for this reason also, the data must be interpreted with caution and require special methodology.

Weights exploration. Figure 3 shows box plots for the calculated adjusted weights W(i) versus visit number. The calculated weights referred to in previous formulae 2, 3, and 4 are based on the full data and shown in the box plot for visit number 9. The box plots shown at previous visit numbers m < 9 are for recalculations of the weights, based on the same treatment probability model (1), but restricting the products within equations (3) and (4) to the first m visits only. The picture displays the expected increase in variability as weights deviate ever more from ‘‘1’’ as the trial proceeds. This incremental increase happens because of the absence of ‘statistical exogeneity’ (Herna´n et al.,1 pp. 441–442), due to dependence of the treatment assignment process on past measured performance scores history, causing the weights to become ever more variable, the longer the assignment process is observed. The weight distribution stabilizes quickly after the fifth visit, because most treatment assignments have occurred by that time and hence, no further reweighting is required. Interpretation and analysis of the weights is cumbersome and not straightforward. Figure 4 shows the plots of the logðWðiÞÞ versus baseline Roland scores as well as versus changes in Roland scores from baseline (left two plots). Both plots are effectively mirror images which is due to the high negative correlation between the baseline and change scores, which we referred to previously (right plot). It can be seen that reweighting is strongly, but not necessarily uniquely, associated with baseline scores (and thus also with large changes from baseline), which is again as could be expected on intuitive grounds as these patients will tend to opt for treatment more quickly as opposed to less affected patients. Another way to put this is that the widening of box plots observed in Figure 3 can in large measure be explained

Downloaded from smm.sagepub.com at Stockholm University Library on July 31, 2015

10

Statistical Methods in Medical Research 0(0)

Figure 3. Box plots for treatment weights versus visit numbers.

Figure 4. Logarithms of treatment weights versus Roland baseline scores (left) and Roland changes from baseline (middle). The right-side plot shows changes from baseline versus baseline measurements for Roland scores.

Downloaded from smm.sagepub.com at Stockholm University Library on July 31, 2015

Mertens et al.

11

by baseline differences between individuals. Note also from the right-side picture in Figure 4 how larger baseline Roland values are associated with substantive improvements from baseline, although these are predominantly associated with the treated group. It would be of interest to explore the weights further at this stage, possibly through some form of stratification, however from a mathematical and conceptual point of view they are not the appropriate data summary for such analysis. For this reason, we will introduce an alternative propensity-based approach further on and defer such analysis to that point.

Conditional weighted analysis. In analogy to Peul et al.6 we base our initial analysis on the covariance regression analysis Lði, ti9 Þ ¼  þ TðiÞ þ ½VðiÞ d þ " which regresses Lði, ti9 Þ, the outcome score of interest at the ninth visit, on the baseline variables besides the treatment indicator TðiÞ ¼ Aði, ti9 Þ and with " an error term. Tables 2 and 3 show median of bootstrap effect estimates as well as a 95% confidence interval from both an IPTW analysis (first three columns) as well as a classical unweighted analysis (next three columns, adjusted for baseline variables only) of the above covariance regression model for Roland and Likert scores, respectively, based on 10,000 bootstrapping repetitions, to account for the variability of the weight estimation. The top row in the table (‘‘all preds’’) gives results when including all predictors from Table 1 as described previously in this section. The weights model may leave unknown confounders unaccounted for. Nevertheless, the model may include too many predictors, particularly as these are highly correlated (such as the baseline and subsequent change from baseline measures) and thus may inflate the variance of weight prediction, which is undesirable. Some balance must be struck between the need to enforce parsimony of weight estimation as much as possible with capturing the required confounding information in the weight prediction. We address the problem by successively reducing the weight prediction model by

Table 2. Median of bootstrap estimates and 95% confidence intervals based on 10000 bootstrap replications for the treatment effect on Roland scores at end of study, based on application of IPTW weighted analysis (columns 1–3) as well as unweighted analysis (classic, columns 4–6) of a covariance adjusted regression of Roland score at end of study on treatment indicator, adjusted for baseline covariables. IPTW

Classic

Marginal weighted

2.5%

med

97.5%

2.5%

med

97.5%

2.5%

med

97.5%

Model

Dev

4.21 4.74 5.07 5.07 4.77 4.23

2.10 2.40 2.49 2.52 2.51 2.17

0.07 0.01 0.06 0.03 0.16 0.19

2.72 2.42 2.51 2.46 2.48 2.50

1.47 1.25 1.32 1.32 1.31 1.31

0.14 0.02 0.07 0.09 0.09 0.11

2.94 3.44 3.59 3.61 3.42 3.06

1.39 1.66 1.69 1.74 1.72 1.61

0.42 0.20 0.23 0.21 0.08 0.12

all preds -gender, weight, height -age -VAS1/2 lags -Roland (chg/lags) -Likert (chg/lags)

315.50 318.14 318.35 323.35 325.32 340.40

2.47

1.32

0.09

2.47

1.32

0.09

1.83

0.84

0.14

-VAS1/2 chg

374.98

A marginal weighted estimate is also presented (columns 7–9). The analysis is repeated in subsequent rows to the first by removing predictors from the treatment probability prediction model (column 10). Deviances of each treatment probability prediction model are shown in the last column.

Downloaded from smm.sagepub.com at Stockholm University Library on July 31, 2015

12

Statistical Methods in Medical Research 0(0)

Table 3. Median of bootstrap estimates and 95% confidence intervals based on 10000 bootstrap replications for the treatment effect on Likert scores at end of study, based on application of IPTW weighted analysis (columns 1–3) as well as unweighted analysis (classic, columns 4–6) of a covariance adjusted regression of Likert score at end of study on treatment indicator, adjusted for baseline covariables. IPTW

Classic

Marginal weighted

2.5%

med

97.5%

2.5%

med

97.5%

2.5%

med

97.5%

Model

Dev

0.72 0.69 0.68 0.70 0.75 0.73 0.58

0.21 0.20 0.19 0.23 0.29 0.29 0.26

0.24 0.25 0.28 0.23 0.16 0.15 0.05

0.62 0.55 0.57 0.57 0.58 0.57 0.58

0.28 0.24 0.26 0.26 0.26 0.26 0.26

0.05 0.06 0.05 0.05 0.05 0.05 0.05

0.74 0.70 0.70 0.68 0.72 0.74 0.54

0.24 0.23 0.22 0.26 0.33 0.36 0.25

0.23 0.22 0.24 0.16 0.07 0.06 0.03

all preds -gender, weight, height -age -VAS1/2 lags -Roland (chg/lags) -Likert (chg/lags) -VAS1/2 chg

315.50 318.14 318.35 323.35 325.32 340.40 374.98

A marginal weighted estimate is also presented (columns 7–9). The analysis is repeated in subsequent rows to the first by removing predictors from the treatment probability prediction model (column 10). Deviances of each treatment probability prediction model are shown in the last column.

removing predictors and then re-calculating the weighted estimates, but using results from Table 1 as a guide, removing the change scores last. The results from this procedure are shown in subsequent rows to the first where predictors are removed from the model as shown in the Model column. For consistency, we always remove the same predictors from the covariance regression models (both IPTW and classic). The last column in both tables shows deviances from the logistic regression weight model 1 on treatment probability, which give an indication of the impact of removing predictors on the change in calibrated weights relative to the previous model. Inevitably, the analysis implemented in the above table is somewhat ad hoc in the choice of order of predictors to be removed (see also discussion below). We decided to always keep the baseline VAS1, VAS2, Likert, and Roland scores in the model, because stabilization of the weights calculation in model 2 is required and because we would expect some association with outcome for these scores, also due to the correlation with subsequent score changes. This leaves a standard stratification for these baseline variables as the final model (last row of both tables). Loosely following results from Table 1 we remove gender, weight, height, and age at randomization first, followed by the VAS1 and VAS2 lags. The Roland and Likert changes and lagged values are removed next followed by the changes in VAS1 and VAS2 levels from baseline, because there is strong suggestion of confounding based on Table 1 results. Note that for the ‘‘classic’’ approach, the last four rows of the table are on the same model and hence we would expect these to give equal results, since we are only removing time-dependent variables from the weight prediction model here only, which are not included in the baseline. The changes in the second decimal places for these results are due to bootstrap variation.

Marginal weighted analysis. For comparison, we may consider a marginal analysis, using the new weights 1 Wdenom ðiÞ 1 j¼1 Wdenom ð j Þ

Wm ðiÞ ¼ Pn

Downloaded from smm.sagepub.com at Stockholm University Library on July 31, 2015

ð5Þ

Mertens et al.

13

where the Wdenom ðiÞ are defined as before and use the same baseline and time-dependent covariates and the sum is over all individuals. We can now define a weighted mean difference in an outcome measure X (such as Lðt9 Þ in this paper) as ¼

n X i¼1

Wm ðiÞXðiÞIðTðiÞ ¼ 0Þ 

n X

Wm ðiÞXðiÞIðTðiÞ ¼ 1Þ

ð6Þ

i¼1

A (bootstrapped) standardized version of  can be considered an analogue of a t-statistic.

Roland scores. For the Roland score, Table 2 suggests that the effect of treatment may be underestimated by classical approaches and that results at the top of the table are likely affected by overfit due to excessive variability of the weights. When predictors are removed from the treatment prediction model confidence intervals eventually exclude the null hypothesis for the IPTW approach for the more parsimonious treatment prediction models, but the absolute level of effects are small, ranging between 2.17 and 2.52, which are modest differences on the Roland scale. Note how the deviances of the treatment probability prediction model increase by small and nonsignificant increments on a chi-square comparison, except for the penultimate removal of the Likert change and lag values. This is partly expected as the scores correlate and thus act as substitute predictors for each other. For the last row, IPTW and classic results coincide as only the baseline scores for VAS1, VAS2, Likert, and Roland are left in the weight prediction model, removing all other baseline and all time-dependent predictors and thus W(i) ¼ 1 as numerator and denominators become identical. Results from the marginal weighted approach give somewhat smaller estimates, but still significant in rows 5 and 6. Likert scores. There do not seem to be substantive differences between either the IPTW or classic estimates for the Likert score in Table 3 or between the different weight calibration models for this score. This is perhaps remarkable as the Likert score is a self-reporting and evaluation measure and we might expect greater potential for complex confounding patterns to exist for this outcome. 3.3

Secondary exploration and comparison with a propensity-based approach

Williamson et al.2 describe four different propensity score analysis methods. We investigate application of both a propensity regression and stratification approach by means of confirmatory analysis, and also to allow for more in-depth post hoc exploratory data analysis, which is more difficult in the IPTW approach. 3.3.1 Propensity scores We define the propensity scores P(i) as the probability of treatment for each i-th patient, based on model 1 with all predictors added, but now only retaining either the last calculated probability of treatment Pi,tis from the model if the person has treatment at time tis and otherwise Pi,ti9 for untreated individuals at the end of the trial (last visit ¼ 9). The corresponding estimates are simply obtained from the fit of the full pooled logistic model discussed before for the MSM approach, hence no new calculations are required. Adjusted estimates of treatment effect are obtained by adding the propensity scores and baseline variables as covariables to a simple linear regression model on the final observed

Downloaded from smm.sagepub.com at Stockholm University Library on July 31, 2015

14

Statistical Methods in Medical Research 0(0) Table 4. Median of bootstrap treatment effects and 95% confidence intervals from propensity adjusted covariance regressions for Roland scores on treatment indicator, based on 10000 bootstrap replications. Treatment effect Model

2.5%

med

97.5%

Treatment þ propensity þ baseline Treatment þ propensity Treatment

3.17 2.53 1.81

1.67 1.26 0.83

0.12 0.13 0.12

outcome scores at visit 9, besides the treatment indicator. So the final model for the propensitybased analysis is Lði, ti9 Þ ¼  þ TðiÞ þ PðiÞ þ ½VðiÞ  þ "

ð7Þ

where P(i) is the propensity score treated as a continuous covariate and T(i) the treatment indicator for treatment prior to last visit. 3.3.2 Propensity-based analysis and exploration The first row in Table 4 gives bootstrapping-based estimates and confidence bounds for the treatment effect parameter , using 10000 bootstrapping replications to account for the calibration of the propensity scores, from application of the above propensity adjusted model. The estimated expected effect 1.67 is slightly lower than found for the weighted analyses in Table 2 (2.52 to 2.10) but the confidence interval is also more narrow, thus confirming from a qualitative point of view the previous results. If we remove the baseline covariables from the model the effect reduces further to 1.26 and similarly to 0.83 when only treatment is left in the model (second and third rows of the table). The last result equates to a simple two-sample t-testing approach. Table 5 shows the data stratified for percentiles of the calibrated propensity scores. The first three columns give the percentile class definitions and numbers of treated (t) and untreated (nt) individuals within each of those intervals, as observed by the end of the trial. The next four columns show mean Roland scores within these percentile classes for both groups at baseline and outcome (ninth visit, end of the trial). The last two columns show the average and the difference between mean outcomes of treated and untreated individuals at the ninth visit. As we observed before from Figure 2, smaller propensities are associated with smaller baseline scores and vice versa, as it is individuals with poorer condition at baseline who will prefer to seek treatment. This is also reflected in the larger numbers of untreated individuals for the lower propensity ranges and this reverses as treatment propensity increases (14 untreated versus 4 treated in the lowest 15% versus 3 untreated and 15 treated within the highest 15%). Interestingly, mean Roland scores are systematically larger for treated individuals even within each propensity percentile interval shown at baseline, even though there is a steady increase for both treated and untreated groups with increasing propensity. This may be both due to coarseness of the percentile class definition or unknown residual confounding. For the outcome data, we can see that mean Roland scores are still increasing with propensity, but to a lesser extent. This does, however, not happen for the treated group, for which mean scores remain roughly constant, or at least within an interval of approximately 1–3, with no apparent trend.

Downloaded from smm.sagepub.com at Stockholm University Library on July 31, 2015

Mertens et al.

15

Table 5. Roland scores data stratified for percentiles of propensity scores. Number

Baseline

Outcome

Pctl

nt

t

nt

t

nt

t

Sum

85

14 8 23 22 4 3

4 4 8 9 8 15

8.11 11.63 14.03 15.86 16.30 15.33

10.75 17.33 16.55 15.86 18.13 18.00

1.00 1.63 1.26 4.29 9.78 9.00

2.75 2.00 1.63 3.33 2.77 1.07

3.75 3.63 2.89 7.63 12.56 10.07

Weighted average Weighted reduction from baseline for treated Weighted reduction from baseline for untreated

Diff 1.75 0.38 0.36 0.96 7.01 7.93 1.74 13.67 9.75

The first column shows the percentile intervals of propensity score. Subsequent columns show the numbers of patients and mean Roland scores within each percentile interval and for treated (t) and untreated (nt) individuals by end of study. The last column shows difference between mean Roland outcomes for treated and untreated individuals.

Note how mean Roland scores roughly double for the last two percentile intervals (75–78 and >85) in comparison to responses for smaller propensities. It is these last increases for the higher propensities within the untreated group, in comparison to the roughly stable mean Roland scores for the treated individuals, which causes the large effects of 7.01 and 7.93 shown within the last column on treatment differences for these propensity ranges. Calculating a weighted average of the observed differences, which takes the relative widths of the propensity intervals into account, gives an estimated average effect of 1.74 (bottom of the table), which is again of similar magnitude as observed from the previous propensity-based analysis (model 7 and Table 4) and the weighted analyses (Table 2). Most of this effect is however due to the relative increase in scores for the untreated group at highest propensity, which does not occur for the treated individuals. The sequence of differences within the last column has monotone ordering of the effect, which is strongly suggestive of a trend in treatment effect with propensity and hence inhomogeneity of expected treatment effect across patients—due to initial condition and thus inclination to be treated. Although caution should be applied in interpreting the small sample sizes, treatment may even be harmful at lower propensity, as the lowest 15% has an increase in mean Roland score of 1.75 for treatment, in contrast of the overall mean decrease of 1.74. Furthermore, it can be observed that scores will decrease with time irrespective of whether treatment is applied or not. Hence, if these estimates are reliable indications of the expected effects (which is not guaranteed at all, due to the substantive assumptions underlying the models applied) then caution should be applied in advising surgery. As treatment effect may depend on treatment propensity, it is of interest to expand the first propensity-based model with a treatment by propensity interaction. Table 6 shows bootstrapping results from this exercise. There is a strong treatment-by-propensity interaction, ranging from 99.9 up to 8.8 with a median effect of 29.2, indicating evidence that treatment effect varies depending on propensity level. The effect is also large in substantive terms, since the range of observed Roland scores at baseline within the control group is restricted from 0 up to 30. Obviously, this effect would seldom be fully realized for any individual, as it depends on the person’s treatment propensity level. Table 6 shows some propensity percentiles. An individual at the median propensity level would have a propensity of roughly 0.065, which implies an estimated

Downloaded from smm.sagepub.com at Stockholm University Library on July 31, 2015

16

Statistical Methods in Medical Research 0(0) Table 6. Median of bootstrap effects and 95% confidence intervals from a treatment by propensity interaction model, based on 10000 bootstrap replications. Treatment effect Model

2.5%

med

97.5%

Treatment(t) þ propensity(p) þ (tp) þ baseline tp t p

99.85 1.10 7.20

29.23 0.99 27.52

8.80 3.25 93.86

25% propensity ¼ 0.03 Median propensity ¼ 0.065 75% propensity ¼ 0.20

29  0.03 ¼ 0.87 29  0.065 ¼ 2.01 29  0.20 ¼ 5.8

Shown are the interaction effect (tp), propensity effect (p) and treatment main effect (t).

Figure 5. Roland outcome scores versus propensity (left plot) as well as differences between baseline and outcome measurements of the Roland score versus propensity (right plot). Individuals treated by end of the trial are plotted red and otherwise blue.

median treatment effect of approximately 0.90. The 2.5 and 97.5 percentiles for the treatment effect at this propensity level are (7.59, 2.68) (bootstrap-based). If the interaction model would be an adequate representation of the variation in treatment effect (which may be doubtful), then the impact could be much larger for individuals with higher propensities to have treatment. At the 75 percentile, for example, propensities are roughly 20%, implying a median treatment effect of 4.85 with 2.5 and 97.5 bootstrap percentile bounds of (21.07, 1.49) approximately. Figure 5 shows the Roland scores at the last visit (outcome) as well as the differences between these scores and baseline

Downloaded from smm.sagepub.com at Stockholm University Library on July 31, 2015

Mertens et al.

17

values from the same patient, both versus the estimated treatment propensity. It is clear how both the final Roland scores as well as the difference from baseline separates between the treated and untreated individuals as the propensity increases.

4 Discussion We presented a weighted re-analysis of patient data from the ‘‘wait-and-see’’ arm of the Dutch Sciatica Trial5 for the estimation of the true or causal treatment (surgery) effect. We examined the application of statistical methodology for causal inference, specifically how it could be used for a secondary data analytic evaluation of a clinical trial. Since patients with poorer health condition (e.g. more severe pain) are more likely to opt for surgery, which in turn may affect future outcomes (pain severity), a straightforward classical ‘‘as-treated’’ comparison at the end point would lead to biased estimation of the surgery effect. As a result, we chose the recently developed inverse probability of reweighting methodology by Robins and his colleagues,1,3 which is precisely geared toward situations like this, with as confirmatory analysis a more exploratory propensity-based approach. Our article has argued, by means of an analysis on the Dutch Sciatica Trial data, that the weights may benefit from a data-analytic exploration. As opposed to the seemingly established—but possibly naive—practice of fitting a single ‘‘large’’ propensity model, we have pursued a conservative approach to the calibration of the weights, starting from an overfit model for the weights, gradually reducing the model size in order to assess the impact of uncertainty in weight estimation on the treatment effect assessment. As remarked in Section 3.2.3 the model selection approach for the calibration of weight is ad hoc and somewhat problematic. However, we believe that this is at present a general issue with the literature on MSMs—which is subject to debate. Up to recently, the advice has been to fit large models and not worry about overfitting, particularly in the epidemiology oriented literature. This is because not including a potential confounder (covariable) into the model for a propensity score is considered to be a more serious problem than falsely including additional covariates into this model. From a statistical perspective, this would also seem to be justified in terms of asymptotic variance. However, this viewpoint has only recently become subject to criticism due to authors like Stephen Senn.4 We are not aware of a general solution to the model selection problem for the weights with MSM methodology or with propensity scores generally. In this article, we have numerically illustrated this issue directly by demonstrating the impact of the model uncertainty at the level of the treatment predictor. It must be noted however that presenting information on the impact of model choice on calibrated treatment effect in reports would itself be an improvement on current practice. Our analysis results suggest that surgery may have moderately beneficial effects even after adjustment for the baseline covariates such as age, gender, height, and the baseline pain scores, as measured by the Roland score at the end of the ninth visit (two-year follow-up). A similar effect was absent (much less pronounced) in both the re-weighted and classical analyses of the Likert scores. We also undertook a further post hoc reanalysis by stratifying (grouping) the sample by an individual’s estimated propensity toward eventually opting for surgery. Among other things, this indicates that patients with high propensities are likely to receive most benefits from surgery, on the other hand, the surgery may have a slightly detrimental effect on patients in the low-propensity group. Unfortunately, the propensity score cannot be used as a diagnostic tool at baseline since it is based on the entire longitudinal profile of pain scores, which are not available at baseline. Among the baseline covariables, it is most strongly correlated with the measured Roland score at baseline, although it is unclear if the latter score alone can be used as a reliable diagnostic marker for individuals who would receive significantly greater benefit from surgery.

Downloaded from smm.sagepub.com at Stockholm University Library on July 31, 2015

18

Statistical Methods in Medical Research 0(0)

It may be of use to develop prediction models for the expected reduction in pain at endpoint, given the baseline covariates and the initiation of treatment (surgery) at baseline. One could attempt to build such a model assuming the effect of surgery at baseline and at a later stage are the same and then employing inverse probability of surgery weights as developed in this article for each individual in fitting a logistic regression model. However, such an assumption is not realistic and a weighted analysis is less than desirable. Perhaps, a future observational study could be designed to address this issue. Even better would be if greater care could be applied in clarifying the relevant research question, both in the planning of new studies—or in the discussion of existing trial results—such as the sciatica trial. It is indeed unlikely that we will be able to perfectly screen out those patients who will eventually need surgery at baseline, and some form of wait-and-see period may remain necessary in any effective clinical treatment policy. This points to the need for designs and models which can update a patient’s expectation, based on observation accruing in the ‘‘wait-and-see’’ period (see also further). In order to correct for the bias with traditional approaches, we need to ensure that the model for probability of seeking treatment is correct. In practice, this would mean that the model needs to be flexible and include all relevant covariates. On the other hand, the estimation of the weights needs to be achieved with a decent asymptotic rate of convergence and thus complex (e.g. with too many parameters) and/or nonparametric models should be avoided. The estimation of the weights itself is an interesting and important aspect of the methodology. Although it is somewhat counter-intuitive, estimated weights tend to produce better results than known weights (e.g. when the only variables affecting treatment decision are the baseline covariates). We have indeed verified this in a realistic simulation study (see supplementary materials, available at http://smm.sagepub.com/) which also supported the rest of the methodology. Similar results have been found in other inverse weighted estimation problems such as in survival analysis with dependent censoring caused by time-dependent covariates.10,11 The simulation studies clearly demonstrated that the method results in nearly unbiased estimation of the causal treatment effect whereas the classical analysis leads to biased estimate of the same. An issue closely related to the above discussion is that the study itself should be designed to measure or record all relevant aspects to the treatment decision. This is almost certainly not the case for the analysis presented here, if only because we are only considering a secondary data analysis of a randomized trial which was not set up for the purposes of this analysis. Perhaps greater effort should be made at the design stage of studies, even for randomized trials such as these, to record information to accurately describe the treatment decision process, even if such is not the primary focus of the study. The positivity assumption—which underlies almost all ‘‘causal’’ analysis approaches—precludes inclusion of the treatment group data into the analysis presented in this article. Within the present study’s design, the timing of treatment at either baseline, or subsequently, is near perfectly confounded with assignment by design, as patients take surgery close to the baseline within the treatment group with near-complete compliance, while treatments tend to occur at later time points for the ‘‘control’’ group. This difference is also embedded in wait-and-see policy as conservative treatment explicitly consists of an initial ‘‘observation period’’ where surgeons tend to refrain from more invasive intervention. Hence, the processes leading into the decision to treat as well as treatment responses themselves are certainly (very) different between both scenarios. The models required to describe them should reflect this, or we should at least be cautious in applying results across treatment arms, from wait-and-see to control group. There are also related problems in extending the weighted approach discussed in this article. One could for example naively attempt to include the data from the other arm of the trial, using a weight of unity for such individuals in combined analysis of the surgery effect. However, this runs counter to the experience that we should

Downloaded from smm.sagepub.com at Stockholm University Library on July 31, 2015

Mertens et al.

19

have estimated rather than known weights in the analysis to account for the different treatment propensities. Furthermore, treatment propensities can hardly be taken to equal one, even in the treatment arm and the propensity model used from the control arm is not applicable to the surgery group. We attempted a combined analysis using weights fixed at one for the treatment group, which gave slightly smaller treatment estimates for the Roland score than reported in this article, but the interpretation of such results is unclear because of the above problems. Since we investigated regression models at a fixed endpoint (ninth evaluation), we were unable to capture the dynamic or time-varying effect of treatment. It is not entirely clear how to pose a model that captures the time of treatment initiation since some individuals in this group may not opt for surgery at all. On the other hand, one could possibly pose a dynamic model for the two subdistributions of the pain scores at the current evaluation (one for those opting for surgery and the other for those still undergoing conservative treatment) conditional on not receiving treatment up to the prior evaluation. We have not pursued the details of such a model in this article. Given the difficulties encountered in applying the weighted approach attempted in this article for a data-analytic investigation, an alternative and more statistical modeling approach would however be highly desirable for such secondary data analysis. Funding This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

References 1. Herna´n MA, Brumback B and Robins JM. Marginal structural models to estimate the joint causal effect of nonrandomized treatments. J Am Stat Assoc 2001; 96: 440–448. 2. Williamson E, Morley R, Lucas A, et al. Propensity scores: From naı¨ ve enthusiasm to intuitive understanding. Stat Methods Med Res 2011; 21: 273–293. 3. Fitzmaurice G, Davidian M, Verbeke G, et al. Longitudinal data analysis. Boca Raton, FL: Chapman and Hall, 2009 chapter 23. 4. Senn S, Graf E and Caputo A. Stratification for the propensity score with linear regression techniques to assess the effect of treatment or exposure. Statistics in Medicine 2007; 26: 5529–5544. 5. Peul WC, van Houwelingen HC, van den Hout WB, et al. Surgery versus prolonged conservative treatment for sciatica. New Engl J Med 2007; 356: 2245–2256. 6. Peul WC, van den Hout WB, Brand R, et al. Prolonged conservative care versus early surgery in patients with sciatica caused by lumbar disc herniation: Two year results of a randomised controlled trial. Br Med J 2008; 336: 1355.

7. Hernan MA, Brumback BA and Robins JM. Estimating the causal effect of zidovudine on CD4 count with a marginal structural model for repeated measures. Stat Med 2002; 21: 1689–1709. 8. Hernan MA, Brumback BA and Robins JM. Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. Epidemiology 2000; 11: 561–570. 9. Robins JM, Hernan MA and Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology 1992; 11: 550–560. 10. Robins JM and Rotnitzky A. Recovery of information and adjustment for dependent censoring using surrogate markers. In: Jewell N, Dietz K and Farewell V (eds) AIDS epidemiology methodological issues. Boston: Birkhauser, 1992, pp.297–331. 11. Satten GA, Datta S and Robins JM. An estimator for the survival function when data are subject to dependent censoring. Stat Prob Lett 2001; 54: 397–403.

Downloaded from smm.sagepub.com at Stockholm University Library on July 31, 2015

Causal effect estimation strategies in a longitudinal study with complex time-varying confounders: A tutorial.

The Dutch Sciatica Trial represents a longitudinal study with complex time-varying confounders as patients with poorer health conditions (e.g. more se...
563KB Sizes 0 Downloads 2 Views