Original Article

Estimating Excess Hazard Ratios and Net Survival When Covariate Data Are Missing Strategies for Multiple Imputation Milena Falcaro,a Ula Nur,b Bernard Rachet,b and James R. Carpenterb,c

Background: Net survival is the survival probability we would observe if the disease under study were the only cause of death. When estimated from routinely collected population-based cancer registry data, this indicator is a key metric for cancer control. Unfortunately, such data typically contain a non-negligible proportion of missing values on important prognostic factors (eg, tumor stage). Methods: We carried out an empirical study to compare the performance of complete records analysis and several multiple imputation strategies when net survival is estimated via a flexible parametric proportional hazards model that includes stage, a partially observed categorical covariate. Starting from fully observed cancer registry data, we induced missingness on stage under three scenarios. For each of these scenarios, we simulated 100 incomplete datasets and evaluated the performance of the different strategies. Results: Ordinal logistic models are not suitable for the imputation of tumor stage. Complete records analysis may lead to grossly misleading estimates of net survival, even when the missing data mechanism is conditionally independent of survival time given the covariates and the bias on the excess hazard ratios estimates is negligible. Conclusions: As key covariates are unlikely missing completely at random, studies estimating net survival should not use complete records. When the missingness can be inferred from available data, appropriate multiple imputation should be performed. In the context of flexible parametric proportional hazards models with a partially observed stage covariate, a multinomial logistic imputation model for stage should be used and should include the Nelson-Aalen cumulative hazard estimate and the event indicator. (Epidemiology 2015;26: 421–428) Submitted 19 September 2013; accepted 5 February 2015. From the aUniversity College London, London, United Kingdom; bLondon School of Hygiene and Tropical Medicine, London, United Kingdom; and c MRC Clinical Trials Unit, London, United Kingdom. Supported by Cancer Research UK (C1336-A11700) and by the Medical Research Council (G0900724 and G0900701 to JRC). Correspondence: Milena Falcaro, Department of Primary Care and Population Health, UCL Medical School, Rowland Hill Street, London NW3 2PF, UK. E-mail: [email protected]. Copyright © 2015 Wolters Kluwer Health, Inc. All rights reserved. ISSN: 1044-3983/15/2603-0421 DOI: 10.1097/EDE.0000000000000283

Epidemiology  •  Volume 26, Number 3, May 2015

N

et survival, the survival probability we would observe if the disease under study were the only cause of death, is one of the major indicators for cancer control policy when estimated using routinely collected population-based data. Unfortunately, such data typically contain a non-negligible proportion of missing values on important prognostic factors, such as categorical tumor stage. For example, it may happen that older patients with a very poor prognosis are less likely to receive thorough staging investigation. Restricting the analysis to the subset of complete records results in non-trivial loss of information and, as we show in this article, is typically misleading when data are not missing completely at random (MCAR). Because missingness in cancer registry data usually depends on prognostic covariates and survival, for reliable inference we therefore need to make appropriate use of the information in the partially observed records.1 Multiple imputation is potentially a practically convenient method to do this.2,3 However, use of multiple imputation in this context requires appropriate handling of certain key issues. The flexible parametric modeling approach proposed by Royston and Parmar4 is increasingly being used in cancer research. Multiple imputation consistent with this class of models is non-trivial3 and, to the best of our knowledge, no study has yet been published to address the complexities in the cancer survival setting. Using an empirical study derived from cancer registry data, we evaluate the performance of complete records analysis and several approaches to imputation in this context and we give recommendations for the handling of missing covariate data in this setting. We give a brief introduction to net survival and flexible parametric survival modeling. Next, we describe the cancer registry data from which our empirical study is derived, alongside the various scenarios and models under investigation. We follow with our results and then conclude with a discussion and some practical recommendations.

ESTIMATING NET SURVIVAL VIA FLEXIBLE PARAMETRIC MODELING Cancer patients, especially the elderly patients, may die of causes other than the malignancy under study. However, one of our primary interests is to estimate the survival probability www.epidem.com  |  421

Falcaro et al

that we would observe if the cancer in question were the only reason a patient could die. This is termed the net survival and is a key metric for health policy makers, as it is independent of differences in non-cancer mortality and therefore allows fair comparisons of cancer survival between populations defined geographically, temporally, or according to other covariables. Estimating net survival requires handling competing mortality risks. Two approaches have been proposed for this: cause-specific survival and relative survival.5,6 With causespecific survival only deaths due to the cancer under study are considered as events, whereas deaths from all other causes are treated as censored observations. This approach therefore relies on an accurate recording of the cause of death. Unfortunately, this information is often unavailable or unreliable, especially in routine population-based data. To address this problem, the relative survival approach estimates the excess hazard of death experienced by the cancer patients as compared with a general population. In a relative survival analysis, the total hazard is typically assumed to be the sum of two components: the cancerrelated or excess hazard and the background or expected hazard, where the latter is usually retrieved from life tables of a general population with sociodemographic characteristics comparable with the cancer patients. We therefore use cancer registry data to estimate the total hazard and life tables of the general population to estimate the expected hazard of the cancer patients. The difference between them is the excess hazard and its accumulation over time defines the cumulative excess hazard, which for patient i (i = 1,…, n) and time t we denote by Λ E (t | xi ), where xi is a vector of covariates. At time t the ith patient-specific net survival is S Ei (t ) = e −Λ E ( t|xi ) and the (overall) net survival is obtained as the average of the individual net survival curves, 1 n ie, S E (t ) = ∑S Ei (t ). Because this is an average over the n i =1 population, if data are not MCAR the summation is over a non-random subset of complete records, and hence biased. Until recently, under both the cause-specific and the relative survival approaches, net survival was traditionally estimated by assuming the two mortality processes (ie, death due to cancer and death due to other causes) to be independent. Perme et al7 pointed out that this assumption is very often violated and that informative censoring is most likely to arise. A way to overcome this problem is to use the non-parametric method proposed by Pohar Perme et al,7 which adjusts for informative censoring via inverse probability weights. An alternative approach consists of using a parametric or semiparametric survival model where the non-random censoring is accounted for by including in the model the covariates that simultaneously influence the two mortality processes (see Danieli et al8). The two processes are thus assumed to be independent after conditioning on appropriate covariates. The assumptions made on the censoring mechanism are, however, essentially the same in these two methods. 422  |  www.epidem.com

Epidemiology  •  Volume 26, Number 3, May 2015

Because we use routine population-based cancer data for which the cause of death is not reliably known, our research questions are naturally addressed via relative survival analysis. In this context, the flexible parametric proportional hazards modeling approach proposed by Royston and Parmar4 and extended to relative survival by Nelson et al9 has several advantages that justify its additional complexities. In particular, it allows for the estimation of (possibly time-dependent) covariate effects and yields a smooth estimate of the excess hazard function. We will model the effect of covariates on the log cumulative excess hazard scale: ln( Λ E (t | x))= ln ( Λ 0 (t )) + β ′ x, (1) where Λ 0 (t ) tis the baseline cumulative excess hazard function and β′ denotes a vector of regression coefficients. If ln( Λ 0 (t )) is modeled as a linear function of ln(t), this gives the Weibull model. Following Royston and Parmar4 and Nelson et al9, we approximate ln( Λ 0 (t )) with a restricted cubic spline function of ln(t) that is a set of piecewise polynomials of order three that satisfy certain conditions and that join together at specific times, chosen by the analyst, termed “knots.” The resulting curve is restricted to be linear beyond the boundary knots (ie, the lowest and highest knots) and to be continuous with continuous first and second derivatives. Restricted cubic splines with r + 1 degrees of freedom (r internal and two boundary knots) can be fitted as a linear function of r + 1 derived variables.4,10,11 We note that model (1) assumes proportional cumulative excess hazards, which in turn imply proportionality of the excess hazards. This means that if, for example, model (1) contains only a binary covariate x then the ratio Λ E (t | x = 1) / Λ E (t | x = 0) does not depend on t and, because Λ E (t | x ) is merely an accumulation of excess hazard over time, the ratio of the excess hazards for x = 1 vs. x = 0 must also remain constant over time. However, the proportional cumulative excess hazards assumption can readily be relaxed by including suitable interactions between the effect of time and other covariates. Details of the estimation can be found, for example, in Nelson et al.9

EMPIRICAL STUDY Study Settings To illustrate the potential bias that could arise in relative survival analysis involving partially observed covariates and to evaluate the performance of several multiple imputation strategies to address this, we used a subset of data from the National Cancer Registry of England and Wales. Specifically, data were retrieved from four cancer registries—Northern and Yorkshire, East Anglia, Oxford, and West Midlands—with similar levels of data completeness. We extracted the records of male patients who were diagnosed with colorectal cancer between 1998 and 2006 © 2015 Wolters Kluwer Health, Inc. All rights reserved.

Epidemiology  •  Volume 26, Number 3, May 2015

with follow-up to the end of 2009 and who had complete data on age at diagnosis (agediag), deprivation quintile (dep), tumor stage at diagnosis (stage), survival time (T), and event indicator (D). This gave a total of 44,461 complete records. In data of this kind, stage at diagnosis is a major prognostic factor but is typically a variable with a high proportion of missing values. As a categorical variable, it is also the most awkward to impute. This is therefore the focus of our study. Let i index the patients, Ri be a missing-data indicator (ie, Ri = 1 if stage is missing and 0 otherwise) and Zi represent the values of remaining variables in the dataset. Recall we defined Ti as the survival time and Di as the event indicator. We set values of stage to missing according to the following rules: •

MCAR: logit[Pr(Ri = 1 | Zi )] = δ 0



Covariate-dependent missing at random: logit[Pr( Ri = 1 | Zi )] = α 0 + α1agediagi



Missing at random (MAR): logit[Pr( Ri = 1 | Zi )] = γ 0 + γ 1agediagi + γ 2Ti + γ 3Di

where agediag was standardized and centered at 60, α1 = γ 1 = 1.5, γ 2 = −2 and γ 3 = −0.7, whereas δ 0 , α 0 , and γ 0 were chosen so that 30% of stage values were missing overall. Note that by “covariate-dependent missing at random” we mean that the missingness mechanism does not depend on the survival time or event indicator, given the covariates. Starting from the 44,461 fully observed records, we independently created 100 partially observed datasets under each of these three mechanisms. Table 1 reports descriptive statistics for the original complete data and the expected values for the subset of remaining complete records under the three missingness mechanisms.

Substantive Model As described above, we used a flexible parametric model for the log cumulative excess hazard: ln( Λ E (t | xi )) = s1 (ln(t );γ 1 , k 1 ) + β ′xi + s2 (agediagi ; γ 2 , k 2 ), where s1() and s2() are restricted cubic spline functions, γ 1 , β and γ 2 denote vectors of regression coefficients, whereas k1 and k2 are two sets of knots. In more detail, s1 (ln(t ); γ 1 , k 1 ) is a restricted cubic spline function of ln(t) with five degrees of freedom (two boundary and four internal knots) to approximate the log baseline cumulative excess hazard. The internal knots were located at the 20th, 40th, 60th, and 80th centiles of the distribution of the uncensored log survival times. The covariates xi were indicators for cancer stage at diagnosis and deprivation groups. In addition, we included another restricted cubic spline function, s2 (agediagi ; γ 2 , k 2 ) , with three degrees of freedom for age at diagnosis (internal knots were placed at the 33rd and 67th centiles of the distribution). The boundary knots for s1() and s2() were set at the minimum and © 2015 Wolters Kluwer Health, Inc. All rights reserved.

Strategies for Multiple Imputation in Cancer Survival

maximum values of, respectively, the uncensored log survival times and agediag. The background general population mortality rates were retrieved from regional life tables stratified by sex, age, deprivation quintile, and calendar year. Our primary focus was on the estimation of log excess hazard ratios for stage and stage-specific net survival at 1, 5, and 10 years post-diagnosis. Analysis of the 44,461 fully observed records gave us reference parameter values for evaluating the performance of the various strategies.

Imputation Strategies Under each of the missing data scenarios, we investigated the performance of a complete records analysis and four approaches to multiple imputation of the missing stage variable, which are summarized in Table 2. As the table indicates, these varied according to (1) whether the categorical variable stage was imputed using ordinal or multinomial logistic regression and (2) whether the survival time and its logarithm or the Nelson-Aalen estimate of the cumulative hazard was used in the imputation model. All the imputation models also included the event indicator, a restricted cubic spline function for age at diagnosis (with the same knots as in the substantive model) and indicator variables for deprivation. We explored the performance of the Nelson-Aalen estimate of the cumulative hazard as this was recommended in the different context of Cox proportional hazards modeling by White and Royston.12 The Nelson-Aalen estimate is calculated at the observed time of event or censoring and can easily be derived in standard statistical software. In Stata, for instance, before imputation we can stset the data and use the “sts gen H=na” command to create a new variable H containing the Nelson-Aalen estimate for each patient. This new variable can then be included in the imputation model. Following the discussion in Carpenter and Kenward,3(P. 55) we used 30 imputations. Net survival SE(t) was estimated from each imputed dataset for t = 1, 5, and 10, but transformed to approximate normality using log{−log[S E (t )]} before applying Rubin’s rules to obtain a summary estimate and confidence interval.3,13 These were then back transformed for interpretation. As in the article by White and Royston,12 the standard errors of the parameter estimates from our substantive model were not adjusted for the uncertainty around the cumulative hazard estimation. Although this may lead to confidence intervals that are too narrow, it is very unlikely that the adjustment would make much difference to our inferential conclusions.

Measures of Performance The results from the 100 replications were then compared with those obtained from the analysis of the original 44,461 fully observed records. Comparisons were made in terms of bias (average, maximum, and percentage relative bias) and root mean square error (rMSE).14 All the analyses were performed using Stata 12.15 www.epidem.com  |  423

Epidemiology  •  Volume 26, Number 3, May 2015

Falcaro et al

TABLE 1.  Descriptive Statistics for the Original Fully Observed Data and the Expected Values for the Subset of Remaining Complete Records Under the Three Missingness Mechanisms Empirical Expected Distribution in Remaining Complete Records Fully Observed Data

MCAR

CD-MAR

MAR

44,461 (100) 62.3 3.6 69.7

31,128 (70) 62.3 3.6 69.7

30,783 (69.2) 58.5 4.3 66.3

30,932 (69.6) 46.3 7.8 67.2

14.0 31.9 29.9 24.2

13.8 32.0 29.9 24.3

14.1 30.7 30.4 24.8

17.5 37.3 31.8 13.4

21.0 21.3 19.5 20.0 18.2

21.0 21.3 19.6 19.9 18.2

21.4 21.3 19.3 19.6 18.3

22.3 21.9 19.5 19.4 16.9

Complete records; number (%) Percent with events Survival time (years); median Age at diagnosis (years); mean Stage (%)  1  2  3  4 Deprivation quintile (%)  1 (least deprived)  2  3  4  5 (most deprived)

CD-MAR indicates covariate-dependent missing at random.

RESULTS Table 3 shows, over the 100 replications, the three measures of bias and the rMSE for the log excess hazard ratios for each cancer stage (with stage 1 as the reference). When data are MCAR, we see that imputing stage with ordinal logistic regression results in markedly higher bias and rMSE. This persists with covariate-dependent MAR. Moving to MAR, both the complete records analysis and the multiple imputation using ordinal logistic regression are severely biased. Furthermore, the multiple imputation based on multinomial logistic regression with survival time and log survival time as covariates also performs poorly. Only multiple imputation using a multinomial logistic regression with the Nelson-Aalen estimate of the cumulative hazard and the event indicator as covariates gives a practically negligible bias and small rMSE. The Figure and Table 4 show the results for 1-year net survival for the three missingness mechanisms using complete records analysis and the four multiple imputation methods. Overall, once again imputation models with a multinomial

logistic regression and the Nelson-Aalen estimate of the cumulative hazard perform best. However, although the complete records analysis estimates of the log excess hazard ratios for stage are unbiased under covariate-dependent MAR, the corresponding complete records analysis estimates of stagespecific net survival are markedly biased. This occurs because, as pointed out above, net survival is an average of individual net survival curves and here the marginalization is carried out using data that are not MCAR. Results for net survival at 5 and 10 years are similar and thus not presented here.

DISCUSSION Specifying an imputation strategy for a flexible parametric model in the context of net survival is challenging. This is principally because the substantive model is defined on the log cumulative excess hazard scale, and it is not clear what is the most appropriate way to incorporate the survival time into the imputation process. Including the outcome variable in the imputation model may seem counterintuitive, but it has been

TABLE 2.  Multiple Imputation Strategies for Imputing the Missing Stage Covariate Multiple Imputation Strategy MI_ologit_surv MI_ologit_na MI_mlogit_surv MI_mlogit_na

Functional Form Ordinal logistic Ordinal logistic Multinomial logistic Multinomial logistic

How Survival Is Modeled in the Imputation Survival time and log survival time Nelson-Aalen estimate of cumulative hazard Survival time and log survival time Nelson-Aalen estimate of cumulative hazard

All imputation models also included the event indicator and the predictors from the substantive model.

424  |  www.epidem.com

© 2015 Wolters Kluwer Health, Inc. All rights reserved.

Epidemiology  •  Volume 26, Number 3, May 2015

Strategies for Multiple Imputation in Cancer Survival

TABLE 3.  Average Bias (Ave. Bias), Percentage Relative Bias (%Bias), Maximum Positive or Negative Bias (Max Bias), and rMSE over the 100 Replications for the Estimates of the Log Excess Hazard Ratios for Stage (Reference Category = Stage 1)

Missing Data Scenario and Method

Stage 2 vs. 1

Stage 3 vs. 1

Stage 4 vs. 1

[0.780]

[1.651]

[3.228]

Ave. Bias %Bias

MCAR  CRA 0.00  MI_mlogit_na -0.04  MI_ologit_na 0.25  MI_mlogit_surv -0.03  MI_ologit_surv 0.23 CD-MAR  CRA 0.03  MI_mlogit_na 0.02  MI_ologit_na 0.41  MI_mlogit_surv 0.03  MI_ologit_surv 0.39 MAR  CRA 0.39  MI_mlogit_na 0.03  MI_ologit_na 0.34  MI_mlogit_surv 0.30  MI_ologit_surv 0.60

Max Bias

rMSE

Ave. Bias

%Bias

Max Bias

rMSE

Ave. Bias

%Bias

Max Bias

rMSE

0.0 -5.3 32.4 -4.3 29.4

0.12 -0.14 0.35 -0.14 0.34

0.04 0.06 0.26 0.05 0.23

0.00 -0.05 0.35 -0.06 0.31

-0.1 -3.3 21.4 -3.8 19.0

0.10 -0.15 0.45 -0.15 0.42

0.04 0.07 0.35 0.07 0.32

0.00 -0.06 0.06 -0.07 0.03

0.0 -1.8 2.0 -2.1 0.9

0.10 -0.14 0.16 -0.15 0.13

0.04 0.07 0.07 0.08 0.05

4.4 2.4 52.2 3.9 50.0

0.13 0.12 0.50 0.14 0.48

0.05 0.04 0.41 0.05 0.39

0.05 0.01 0.55 -0.02 0.52

3.2 0.7 33.1 -1.0 31.3

0.14 0.09 0.63 -0.11 0.60

0.06 0.04 0.55 0.04 0.52

0.09 0.00 0.25 -0.04 0.25

2.9 0.1 7.8 -1.1 7.9

0.20 0.09 0.34 -0.12 0.34

0.10 0.03 0.25 0.05 0.26

50.1 3.3 43.6 38.3 76.5

0.51 0.16 0.44 0.74 0.77

0.39 0.05 0.34 0.33 0.60

0.62 0.09 0.69 0.49 1.16

37.5 5.4 42.0 29.5 70.0

0.73 0.21 0.80 0.88 1.34

0.62 0.10 0.69 0.51 1.16

0.76 0.11 0.26 0.54 1.29

23.6 3.3 8.0 16.7 40.0

0.88 0.25 0.38 0.94 1.52

0.76 0.11 0.26 0.56 1.29

The figures reported within square brackets in the column head of the table are the reference parameter values obtained from the analysis of the original fully observed data. Methods: CRA (complete records analysis) and MI (multiple imputation). The MI strategies (ie, MI_mlogit_surv, MI_mlogit_na, MI_ologit_surv, and MI_ologit_na) are briefly described in Table 2.

shown to be essential for preserving the relationships among variables in the imputed datasets.3 In addition, there is an issue as to which model is the most appropriate for imputing a categorical variable such as stage. An ordinal logistic regression would represent a substantial computational simplification but is valid only under a proportional odds assumption. We investigated these questions empirically, starting with a fully observed dataset extracted from population-based cancer registry data. Our results clearly demonstrated that an ordinal model for missing stage is not suitable for imputing data of these types, and that a multinomial model should be used. Moreover, we found that calculating the Nelson-Aalen estimate of the cumulative hazard and including this, together with the event indicator, in the imputation model gave much better results overall than using survival time, log survival time, and the event indicator as covariates. Numerous simulations and empirical studies have been carried out to evaluate the performance of multiple imputation, especially as compared with complete records analysis.16-18 This latter is commonly reported in the literature as being valid only under MCAR. White and Carlin17 pointed out that when the missing values occur in the covariates complete records analysis may be valid even if the missingness process is not MCAR. In particular, they showed that the bias © 2015 Wolters Kluwer Health, Inc. All rights reserved.

is negligible when the missing data mechanism is independent of the outcome given the covariates. Their findings apply to regression coefficients but not to statistics derived from them by marginalizing over the data, such as stage-specific net survival estimates. This explains our finding that complete records analysis estimates of net survival are severely biased, even when the missingness mechanism depends only on covariates and conditionally on these is independent on the survival outcome. The simulations in this article had stage as the partially observed variable, which was made MCAR, covariatedependent MAR, and MAR. We focused on stage because this is a strong predictor of cancer survival and typically contains a non-negligible amount of missing values; as a categorical variable it is also the most difficult to impute. The results we obtained can therefore be applied (eg, as part of a fully conditional specification imputation process) to the more common situation where there is a relatively small number of missing values in other covariates. The best-performing multiple imputation strategy in our study used multinomial regression imputation for stage and included the Nelson-Aalen estimate of cumulative hazard, alongside the event indicator. In the different setting of Cox proportional hazards models, this strategy was recommended www.epidem.com  |  425

Falcaro et al

Epidemiology  •  Volume 26, Number 3, May 2015

FIGURE.  Stage-specific net survival at 1 year. Circles and horizontal lines represent, respectively, the average and the range of the estimates across the 100 replications. The vertical lines correspond to the reference estimates obtained from the fully observed data. CD-MAR indicates covariate-dependent missing at random.

by White and Royston.12 Its theoretical justification in our excess hazards setting relies on the true distribution of stage given survival and the other covariates being well approximated by including the Nelson-Aalen estimate of the cumulative hazard and the event indicator in the imputation model. In our context, this relies on (1) assuming proportional total hazards and (2) an approximation that will tend to break down when we have the combination of a large coefficient estimate for a covariate that itself has high variability. Methodological work is ongoing to establish how extreme violations of (1) and (2) need to be to introduce substantial bias. However, we saw no sign of the approximation breaking down in the cancer registry examples we have considered. In our study, we imputed stage using either an ordinal logistic or a multinomial model. Alternative specifications, such as continuation-ratio or stereotype logistic models, could also be considered, but they may not be readily available in standard statistical software. Nevertheless, our simulation results, based on real data, provide no evidence that this additional complexity is warranted. 426  |  www.epidem.com

The choice of selection mechanism was motivated by the fact that in the data we routinely analyze stage has a higher chance of being missing for patients who are too frail (eg, old and with very poor prognosis) to undergo thorough staging investigation. In this case, the probability of stage being missing is linked to survival, so that, in practice, missing data are at least MAR with dependence on the response in the substantive model. Although some of the scenarios in this article may seem extreme and in some studies the association between missingness and outcome may not be as strong, our principal aim here is to show that our preferred approach performs well even in extreme situations, so that we can be confident in its behavior in less drastic circumstances. Furthermore, our results highlight the practically important but hitherto unremarked point that estimates of net survival based on complete records may be severely biased even when the bias on the excess hazard ratios estimates is negligible. After multiple imputation, it is important to evaluate the robustness of the inferential conclusions to changes of © 2015 Wolters Kluwer Health, Inc. All rights reserved.

Epidemiology  •  Volume 26, Number 3, May 2015

Strategies for Multiple Imputation in Cancer Survival

TABLE 4.  Average Bias (Ave. Bias), Percentage Relative Bias (%Bias), Maximum Positive or Negative Bias (Max Bias), and rMSE over the 100 Replications for the Estimates of 1-Year Stage-specific Net Survival (%) Stage 1

Stage 2

Stage 3

Stage 4

[96.14]

[91.45]

[81.58]

[38.13]

Missing Data Scenario Ave. Max Ave. Max and Method Bias %Bias Bias rMSE Bias %Bias Bias rMSE MCAR  CRA  MI_mlogit_na  MI_ologit_na  MI_mlogit_surv  MI_ologit_surv CD-MAR  CRA  MI_mlogit_na  MI_ologit_na  MI_mlogit_surv  MI_ologit_surv MAR  CRA  MI_mlogit_na  MI_ologit_na  MI_mlogit_surv  MI_ologit_surv

Ave. Bias

%Bias

Max Bias rMSE

Ave. Bias

%Bias

Max Bias rMSE

-0.01 -0.22 0.60 -0.21 0.48

0.0 -0.2 0.6 -0.2 0.5

0.37 -0.55 0.89 -0.56 0.81

0.14 0.26 0.61 0.26 0.49

-0.02 -0.13 -0.55 -0.16 -0.67

0.0 -0.1 -0.6 -0.2 -0.7

-0.37 -0.42 -0.78 -0.47 -0.90

0.14 0.18 0.56 0.20 0.68

-0.02 -0.04 -3.20 0.12 -3.14

0.0 0.0 -3.9 0.1 -3.8

0.53 -0.42 -3.54 0.56 -3.51

0.20 0.17 3.20 0.20 3.14

-0.02 -0.02 4.08 0.38 3.95

-0.1 -0.1 10.7 1.0 10.4

0.72 -0.63 4.43 0.76 4.30

0.28 0.20 4.08 0.43 3.96

0.72 0.01 1.14 -0.05 1.09

0.7 0.0 1.2 0.0 1.1

1.00 -0.32 1.35 -0.40 1.29

0.72 0.13 1.14 0.15 1.09

1.41 -0.13 -0.18 -0.33 -0.17

1.5 -0.1 -0.2 -0.4 -0.2

1.72 -0.53 -0.64 -0.73 -0.54

1.41 0.20 0.22 0.36 0.21

2.48 -0.14 -3.48 0.08 -3.18

3.0 -0.2 -4.3 0.1 -3.9

3.03 -0.62 -3.93 0.53 -3.64

2.49 0.21 3.48 0.18 3.19

4.11 0.05 4.00 0.80 3.24

10.8 0.1 10.5 2.1 8.5

4.73 0.57 4.31 1.30 3.66

4.12 0.19 4.00 0.82 3.24

3.44 0.18 1.06 1.35 2.63

3.6 0.2 1.1 1.4 2.7

3.48 0.69 1.38 2.21 2.88

3.44 0.23 1.06 1.40 2.63

7.24 0.10 -0.12 0.99 3.63

7.9 0.1 -0.1 1.1 4.0

7.29 0.53 -0.42 2.17 4.00

7.24 0.20 0.16 1.10 3.64

14.18 -0.60 -7.70 -0.85 -0.20

17.4 -0.7 -9.4 -1.0 -0.2

14.32 -1.26 -7.96 -2.10 -1.11

14.18 0.66 7.70 0.99 0.39

37.97 -0.86 3.30 -2.67 -6.15

99.6 -2.2 8.7 -7.0 -16.1

38.79 -1.22 3.66 -3.62 -6.68

37.97 0.87 3.30 2.70 6.16

The figures reported between square brackets in the column head of the table are the net survival (%) reference values obtained from the analysis of the original fully observed data. CD-MAR indicates covariate-dependent missing at random.

the imputation model and to departures from MAR. Appropriate sensitivity analysis along the lines proposed by Carpenter and Kenward3(Ch. 10) is recommended. For example, we could modify the parameter estimates obtained after each imputation to reflect our uncertainty about MAR and we could then investigate the impact of these changes on our inferences. In this article, we have considered predictors whose effect was assumed to be constant over time. Work is in progress to explore how to impute appropriately for models with time-dependent effects. In conclusion, studies estimating excess hazard ratios and net survival where the cancer registry data have a nontrivial proportion of missing observations should not use complete records analysis. When the missingness can be inferred from available data, multiple imputation should be used instead. In the context of flexible parametric proportional hazards models with a partially observed stage covariate, we recommend using a multinomial logistic imputation model for stage that includes the Nelson-Aalen cumulative hazard estimate and the event indicator.

ACKNOWLEDGMENTS We are grateful to the reviewers for their helpful and constructive comments. © 2015 Wolters Kluwer Health, Inc. All rights reserved.

REFERENCES 1. Little RJA, Rubin DB. Statistical Analysis with Missing Data. 2nd ed. New York, NY: John Wiley & Sons; 2002. 2. White IR, Royston P, Wood AM. Multiple imputation using chained equations: Issues and guidance for practice. Stat Med. 2011;30:377– 399. 3. Carpenter JR, Kenward M. Multiple Imputation and Its Application. Chichester: John Wiley & Sons; 2013. 4. Royston P, Parmar MK. Flexible parametric proportional-hazards and proportional-odds models for censored survival data, with application to prognostic modelling and estimation of treatment effects. Stat Med. 2002;21:2175–2197. 5. Estève J, Benhamou E, Raymond L. Statistical methods in cancer research, volume IV: descriptive epidemiology. IARC Scientific Publications 128. Lyon: International Agency for Research on Cancer; 1994. 6. Sarfati D, Blakely T, Pearce N. Measuring cancer survival in populations: relative survival vs cancer-specific survival. Int J Epidemiol. 2010;39:598–610. 7. Perme MP, Stare J, Estève J. On estimation in relative survival. Biometrics. 2012;68:113–120. 8. Danieli C, Remontet L, Bossard N, Roche L, Belot A. Estimating net survival: the importance of allowing for informative censoring. Stat Med. 2012;31:775–786. 9. Nelson CP, Lambert PC, Squire IB, Jones DR. Flexible parametric models for relative survival, with application in coronary heart disease. Stat Med. 2007;26:5486–5498. 10. Harrell FE. Regression Modeling Strategies with Applications to Linear Models, Logistic Regression, and Survival Analysis. New York, N.Y.: Springer; 2001. 11. Durrleman S, Simon R. Flexible regression models with cubic splines. Stat Med. 1989;8:551–561. 12. White IR, Royston P. Imputing missing covariate values for the Cox model. Stat Med. 2009;28:1982–1998.

www.epidem.com  |  427

Falcaro et al

13. Marshall A, Altman DG, Holder RL, Royston P. Combining estimates of interest in prognostic modelling studies after multiple imputation: current practice and guidelines. BMC Med Res Methodol. 2009;9:57. 14. Burton A, Altman DG, Royston P, Holder RL. The design of simulation studies in medical statistics. Stat Med. 2006;25:4279–4292. 15. StataCorp. Stata Statistical Software: Release 12. College Station, TX: StataCorp LP; 2011. 16. van der Heijden GJ, Donders AR, Stijnen T, Moons KG. Imputation of missing values is superior to complete case analysis and the missing-

428  |  www.epidem.com

Epidemiology  •  Volume 26, Number 3, May 2015

indicator method in multivariable diagnostic research: a clinical example. J Clin Epidemiol. 2006;59:1102–1109. 17. White IR, Carlin JB. Bias and efficiency of multiple imputation compared with complete-case analysis for missing covariate values. Stat Med. 2010;29:2920–2931. 18. Marshall A, Altman DG, Royston P, Holder RL. Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study. BMC Med Res Methodol. 2010;10:7.

© 2015 Wolters Kluwer Health, Inc. All rights reserved.

Estimating excess hazard ratios and net survival when covariate data are missing: strategies for multiple imputation.

Net survival is the survival probability we would observe if the disease under study were the only cause of death. When estimated from routinely colle...
500KB Sizes 0 Downloads 16 Views