Article

Imputing a randomly censored covariate in a linear regression model

Statistical Methods in Medical Research 0(0) 1–10 ! The Author(s) 2015 Reprints and permissions: sagepub.co.uk/journalsPermissions.nav DOI: 10.1177/0962280215586011 smm.sagepub.com

Emmanuel Sampene1 and Folefac D Atem2

Abstract Researchers are often faced with the problem of randomly censored covariates. The simplest and most straightforward approach for dealing with such data is to remove variables with censored observations or case deletion. The former leads to model misspecification, while the latter leads to overestimation of standard error due to limited power. In this paper, we propose two approaches, a single and a multiple imputation framework for handling a randomly censored covariate. Both of these frameworks are based on the conditional expectation of a Kaplan-Meier estimator or Cox proportional hazard model. The multiple imputation approach is very similar to non-parametric bootstrapping, in that it does not involve simulated data. In the present study, we used simulations to compare the performance of these two approaches to that of a deletion approach. These methods are applied to study the association between amyloid protein in offspring and maternal onset of dementia. Keywords censored covariates, complete case, Cox proportional hazard, limit of detection, Kaplan-Meier estimator, multiple imputation, single imputation

1 Introduction Researchers are commonly presented with the problem of censored covariates. Clayton1 was one of the ﬁrst to assess censoring in bivariate models. He evaluated correlates of age of cardiac arrest in parents and their oﬀspring, with a focus on confounding. In his study, Clayton assumed that the observed association between the oﬀspring’s age of cardiac arrest and the parent’s age of cardiac arrest emerged because they shared a common ancestor. Therefore, if the parent’s age of cardiac arrest predicted the oﬀspring’s age of cardiac arrest, then the experience of the oﬀspring should similarly predict the parent’s. He modeled this association using a bivariate model, which produced a valid test of the association between the dependent variable and the independent variable, both subject to censoring. In the present study, we will build upon Clayton’s work by testing the association between the dependent variable and a random independent variable subject to censoring. 1 2

Food and Drug Administration, Center for Drug Evaluation and Research (CDER), Silver Spring, MD, USA Department of Biostatistics, University of Texas Health Science Center at Houston, Houston, TX, USA

Corresponding author: Folefac D. Atem, University of Texas, 1200 Pressler Street, Houston, TX P.O. Box 2 United States. Email: [email protected]

Downloaded from smm.sagepub.com at CAMBRIDGE UNIV LIBRARY on September 1, 2015

XML Template (2015) [14.5.2015–12:25pm] [1–10] //blrnas3.glyph.com/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/150030/APPFile/SG-SMMJ150030.3d (SMM) [PREPRINTER stage]

2

Statistical Methods in Medical Research 0(0)

Prior to 1992, the standard practice in most statistical packages was to discard observations with censored or missing data at the case-level.2 This approach is known as the complete case or deletion approach,2 and this approach is still the most commonly used method for handling censored cases in epidemiological research. These approaches are simple and straightforward to implement. Valid inferences are produced when censoring is independent of the dependent variable, and in some cases of non-ignorable censoring.2 However, complete deletion of observations with any censored or missing data could waste information, and may lead to reduction in power when the number of predictors with missing information is large. Furthermore, inferences from analyses using deletion approaches will apply only to cases with data for all relevant variables. Results may not apply to the entire population of interest. Hence, although the complete case approach may be used as a baseline method for comparison, it is not recommended for inference.2 The substitution methods involve replacing censored pﬃﬃﬃvalues by a multiple of the observations (i.e multiply the time of censoring by a constant such as 2, 2, etcetera) or assuming that the censored values are true events. However, many authors have concluded that such approaches are statistically inappropriate for data with a censored covariate.3 For example, Helsel concluded that this approach led to highly biased estimates and has no theoretical basis for implementation.4 However, some researchers still use this approach because of its ease and simplicity. Other researchers such as Richardson and Ciampi,5 Schisterman and colleagues,6 Wang and Feng,7 Nie et al.,8 Rigobon and Stoker,9,10 May et al.11 have worked out methods shown to work with type I censoring, ﬁxed censoring or limit of detection. However, in this paper, we investigate random censoring. We developed an improved single imputation approach, which improves upon the Richardson and Ciampi approach5 in that it is non-parametric. The paper by Richardson and Ciampi5 has been implemented for type I censoring, ﬁxed censoring or limit of detection assumed normality for censored observations. We know randomly censored data are often heavily skewed, so assuming normality may not be appropriate. Furthermore, as described by Lynn,3 the approach by Richardson and Ciampi5 may underestimate the variance of the regression coeﬃcients. This is because it is computationally similar to an expectation maximization (EM) algorithm, but without the second expectation. This is the second moment of predictor conditioning on a predictor greater than the time of censoring, i.e. EðX2 jX 4 CÞ. This variance underestimation is corrected in our multiple imputations approach, calculated using the sum of within-and betweenimputation variance. This multiple imputation approach diﬀers from Rubin12 in that our approach does not involve simulations designed to handle missing data. These approaches should be appropriate for use in cases of a right or left censored covariate, but in this study we limit ourselves to a right-censored covariate for consistency. The organization of this paper is as follows: First, Section 2 includes notation deﬁnitions and a description of the most commonly used approaches for handling censored data. In Section 3, our new approaches are presented, and Section 4 includes a simulation and data analysis example of our method. Conclusions are provided in Section 5.

2 Notation and model specification Let us consider the linear model: Y ¼ 0 þ 1 X þ 2 Z þ ",

ð1Þ

where the covariate of interest X is right censored by C. Diﬀerent censoring scenarios can be speciﬁed by the censoring variable C. The censoring is ﬁxed censoring or type I censoring

Downloaded from smm.sagepub.com at CAMBRIDGE UNIV LIBRARY on September 1, 2015

XML Template (2015) [14.5.2015–12:25pm] [1–10] //blrnas3.glyph.com/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/150030/APPFile/SG-SMMJ150030.3d (SMM) [PREPRINTER stage]

Sampene and Atem

3

(limit of detection) if the censoring variable C is constant ck for all observations i.e C ¼ ck for all C. Otherwise, if C 6¼ ck for all C, then censoring is said to be random. Most studies focused on the former, but in this paper we will focus on the latter, which is more general. Other covariates not subject to right censoring are denoted by Z. Without loss of generality, we consider Z to be a 1 1 vector here. The response variable is Y, and the random error is ". The regression coeﬃcients 0 , 1 and 2 are the intercept, the eﬀect of X and the eﬀect of Z, respectively. We assumed that " is not correlated with X, C and Z. For simplicity, it follows a normal distribution with mean 0 and variance 2 . Also, we assume C is either independent of Z or dependent of Z, and 0 5 PðX 4 CÞ 5 1. The sample we observe includes fYi , Ti , Zi , Di g, i ¼ 1, . . . , , n, where n is the number of subjects in the sample, and Ti ¼ minðXi , Ci Þ is the observed value of X for the i th subject. The censoring indicator Di ¼ 0 if Xi Ci and Di ¼ 1 if Xi 4 C. Our primary goal is to estimate 1 . Little and Rubin2 provide diﬀerent scenarios in which the complete case produces unbiased estimates for 1 . Tu and Greenwood13 looked at partially observed covariates and came up with similar conclusions as Little and Rubin.2 The underlying conclusion is that once the information missing on X is independent of the response, the complete case approach will produce unbiased results. The proof is as follows. Consider the regression model (1). If the information missing is independent of Y, then fD ðDjY, X, ZÞ ¼ fD ðDjX, ZÞ. Therefore R yðfY ð yjx, zÞfD ð0jx, z, yÞdy EðYjX, Z, D ¼ 0Þ ¼ R ðfY ð yjx, zÞfD ð0jx, z, yÞdy R yðfY ð yjx, zÞfD ð0jx, zÞdy ¼ R ðfY ð yjx, zÞfD ð0jx, zÞdy R yðfY ð yjx, zÞdy ¼ R ðfY ð yjx, zÞdy Hence EðYjX, ZÞ ¼ 0 þ 1 X þ 2 Z: This simple example shows that, when there is no association between the dependent and independent variable of interest with missing information, or when the two members of a pair have no common inﬂuence, the complete case produces unbiased estimates. Nonetheless, the complete case approach does not always produce unbiased estimates. It produces biased estimates when the information missing on the covariates variable subjected to censoring is non-ignorable. The substitution approach simply replaces censored values by p a constant multiple. For example, ﬃﬃﬃ we multiply the censored values by a constant such as 1, 2 , 2, etcetera. This model is: pﬃﬃﬃ D Yj ¼ 0 þ 1 X1D S þ ", where S is either C or 2 C or 2C, etcetera. However, this approach has j no mathematical basis for the substitution.

3 Proposed approaches We propose two new nonparametric approaches that apply to the setting of random censoring. The ﬁrst approach is based on a Kaplan-Meier or Cox model estimator of the distribution of the censored covariate and modiﬁes that of Richardson and Ciampi,5 which assumes a parametric distribution. The second approach is a fully valid multiple imputation, again based on the Kaplan-Meier or Cox model estimator for the censored covariate.

Downloaded from smm.sagepub.com at CAMBRIDGE UNIV LIBRARY on September 1, 2015

XML Template (2015) [14.5.2015–12:25pm] [1–10] //blrnas3.glyph.com/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/150030/APPFile/SG-SMMJ150030.3d (SMM) [PREPRINTER stage]

4

3.1

Statistical Methods in Medical Research 0(0)

Single imputation

Our ﬁrst approach is a modiﬁcation of the method for handling missing data brieﬂy described by Little12 and proposed for handling a covariate subjected to limit of detection by Richardson and Ciampi.5 We provide a non-parametric version of the approach proposed by Richardson and Ciampi.5 In the case of simple linear regression with no additional covariates, with Z for a right-censored covariate, we impute the conditional expectation and approximate using the trapezoidal rule 31 2 Z Z1 Z 7 6 1 EðXj jXj 4 Cj Þ ¼ 4 f ðtÞdt5 SðuÞdu þ Cj ¼ S ðCj Þ SðuÞdu þ Cj cj n P

¼ i¼1

Cj

Cj

IfTðiÞ 4 Cj g½SfTðiÞ g þ SfTðiþ1Þ gfTðiþ1Þ TðiÞ g 2SðCj Þ

þ Cj

where Sð:Þ is the survival function of X, is the upper limit of the support of X and the ordered observed values of the covariates are Tð1Þ 5 Tð2Þ 5 Tð3Þ 5 , . . . , 5 TðnÞ (i.e. T ¼ minðX, CÞ). We ^ estimate Sð:Þ with the Kaplan-Meier estimator of X, Sð:Þ and linearly interpolate S^ to approximate the values of S at censored observations. The survival estimates for censored T were interpolated from the survival curve. For improved approximation of the integral of the KaplanMeier estimate, we treat the largest observation as an event even if it is censored.14 For the case of additional covariates, Z and the censored covariate X, we use a Cox model-based estimator of the adjusted survivor distribution in our approximation of the conditional expectation, EðXjX 4 C, ZÞ for imputation (See appendix for SAS code). Speciﬁcally, we assume that hðxjzÞ ¼ ho ðxÞ expðzÞ, where hðxjzÞ is the hazard function for X given Z ¼ z evaluated at x, h0 ðxÞ is the baseline hazard function for X at Z ¼ 0 and S0 ð:Þ is the baseline survivor function for X, and approximate EðXj jXj 4 Cj , Zj Þ as n P

IfTðiÞ 4 Cj g½S0 fTðiÞ gexpðZj Þ þ S0 fTðiþ1Þ gexpðZj Þ fTðiþ1Þ TðiÞ g

i¼1

2½S0 ðCj ÞexpðZj Þ

þ Cj

Realistically, we estimate the baseline survivor function using the method of Breslow.15 An alternative approach is based on a Cox model that conditions on Y, as well as X. This is advisable if there is high dependency between Y and X given the observed portion of X.12 This approach requires no dependency between X and C conditional on X.

3.2

Multiple imputation

The single imputation approach tends to underestimate the variance,3 as it does not account for EðX2j jXj 4 Cj , Zj Þ. We consider a multiple imputation that preserves the association between the censored observation and uncensored observation by conditioning on the uncensored variables. In order not to underestimate variance, our multiple imputation considers both the withinimputation variance and between-imputation variance. Multiple imputation was originally developed for the purpose of accounting for variability in imputed estimates of missing data.16 We proposed a multiple imputation approach based on the imputer model and analyst model. Our imputer model is based on a Kaplan-Meier or Cox

Downloaded from smm.sagepub.com at CAMBRIDGE UNIV LIBRARY on September 1, 2015

XML Template (2015) [14.5.2015–12:25pm] [1–10] //blrnas3.glyph.com/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/150030/APPFile/SG-SMMJ150030.3d (SMM) [PREPRINTER stage]

Sampene and Atem

5

proportional hazard model, and the analyst model is a simple regression model. The only problem of inconsistency is when the imputer model makes more assumptions than the analyst.17 This process results in valid statistical inferences that properly reﬂect the uncertainty due to censoring values. Our multiple imputation algorithm proceeds in the following steps: (1) Sample with replacement from the original data. (2) Sort the sample data by Xj . (3) Similar to single imputation, impute the right-censored covariate with the conditional expectation EðXj jXj 4 Cj Þ or EðXj jXj 4 Cj , Zj Þ. (4) Fit a linear regression model of Y on the completed data ðX, ZÞ and estimate the parameters of the model, ð^0m , ^1m , ^1m Þ, where the superscript m labels the estimates from the mth imputation. (5) Repeat Steps 1-4 M times. (6) Obtain multiple imputation estimates and variances P (sum of within imputation variance and between imputation variance), for example, ^1 ¼ ^1m =M and Varð^1 Þ ¼

M X m¼1

Varð^1m Þ=M þ ð1 þ 1=MÞ

M X

2 ð^1m ^1 Þ =ðM 1Þ

m¼1

where Varð^1m Þ is the model-based variance from ﬁtting the linear regression to the mth imputed data set.18 This approach diﬀers from those implemented in SAS and R, where multiple imputation approaches are based on missing data. In this case we have available information that will be lost if deleted or treated as missing data.

4 Simulations We conducted simulations to evaluate and compare the performances of the standard analysis when there is no censoring, complete case analysis, single imputation and multiple imputation. We simulated data from two sample sizes; n ¼ 150 and n ¼ 250 resulting in approximately 60% and 82% power respectively, for cases with no censoring. It is worth noting that, most often with censored observations, power is mostly deﬁned in terms of event rate. In this case we considered no censoring to represent a 100% event rate. We assumed the true linear regression model to be given by model (1), with ð0 , 1 , 2 Þ ¼ ð1, 0:5, 0:25Þ. We generated X Weibull ð3=4, 1=4Þ, C Weibull ð1, qÞ, " Nð0, 1Þ and Z Nð6, 0:25Þ, q ¼ 2 to obtain light censoring (20%) and q ¼ 0.35 to obtain heavy censoring (40%). Unlike the limit of detection case where both X and C are simulated from a normal distribution, we simulated both X and C from a Weibull distribution since randomly censored data are most often heavily skewed. We set M, the number of imputations, to 20. We can still obtain a good estimate of using M < 20 since multiple imputation relies on diﬀerent imputed values to solve censored data, and the rules for combining the M complete imputed data account for Monte Carlo error. From Tables 1 and 2, we see that under light censoring, all three methods (complete case, single imputation and multiple imputation) produce similar bias, but there is little drop in power in the complete case approach as compared to the single imputation and multiple imputation. As censoring increases, more subjects are deleted from the complete case approach. This leads to an increase in MSE and a considerable reduction in power as compared to the single imputation and multiple imputation approaches. Nonetheless, the MSE for

Downloaded from smm.sagepub.com at CAMBRIDGE UNIV LIBRARY on September 1, 2015

XML Template (2015) [14.5.2015–12:25pm] [1–10] //blrnas3.glyph.com/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/150030/APPFile/SG-SMMJ150030.3d (SMM) [PREPRINTER stage]

6

Statistical Methods in Medical Research 0(0)

Table 1. Estimating y1: light-independent censoring, i.e. censoring is independent of Y; 1000 replicates. Method n ¼ 150 No censoring Complete case Single imputation Multiple imputation n ¼ 250 No censoring Complete case Single imputation Multiple imputation

Bias

SE

MSE

Power

Coverage probability

0.0117 0.0129 0.0119 0.0118

0.2229 0.3117 0.2339 0.2710

0.0498 0.0976 0.0549 0.0736

0.60 0.37 0.48 0.47

0.97 0.97 0.97 0.96

0.0026 0.0042 0.0028 0.0028

0.1687 0.2338 0.1757 0.1994

2.9E-4 0.0547 0.0309 0.0398

0.8 0.8 0.72 0.71

0.97 0.97 0.97 0.96

Table 2. Estimating y1: heavy independent censoring i.e. censoring is independent of Y; 1000 replicates. Method n ¼ 150 No censoring Complete case Single imputation Multiple imputation n ¼ 250 No censoring Complete case Single imputation Multiple imputation

Bias

SE

MSE

Power

Coverage probability

0.0117 0.0403 0.0414 0.0132

0.2229 0.7757 0.3852 0.3957

0.0498 0.6033 0.1501 0.1568

0.60 0.11 0.29 0.28

0.97 0.95 0.96 0.96

0.0026 0.0395 0.0410 0.0040

0.1687 0.5782 0.2791 0.2795

2.9E-4 0.3359 0.0796 0.0781

0.82 0.17 0.46 0.46

0.97 0.96 0.96 0.95

the single imputation is slightly lower than the MSE for multiple imputation. Although both approaches lead to unbiased estimates, the single imputation slightly underestimates the variance since its value neither reﬂects sampling variability about actual nor additional uncertainty in ﬁtting the model.

5 Case study: association between beta-amyloid in offspring and maternal history of dementia The data for this study was collected at Massachusetts General Hospital and Brigham and Women’s Hospital for studying the association between beta-amyloid deposition. It is measured through in vivo imaging with Pittsburgh Compound B (PiB) related to maternal age of onset of dementia.19–21 More than 300 participants enrolled in this study, but after cleaning our sample the data was reduced to 141 participants. Using complete case analysis, single imputation and multiple imputation, we ﬁt a linear regression model to the continuous outcome measure of beta-amyloid, as a function of maternal age of onset of dementia. We adjusted for oﬀspring age, gender, clinical dementia rating (CDR) and education. We dichotomized all our confounding variables as follows: the reference group for age was either less than or equal to 81, for CDR either zero or half, the gender as male or female and the

Downloaded from smm.sagepub.com at CAMBRIDGE UNIV LIBRARY on September 1, 2015

XML Template (2015) [14.5.2015–12:25pm] [1–10] //blrnas3.glyph.com/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/150030/APPFile/SG-SMMJ150030.3d (SMM) [PREPRINTER stage]

Sampene and Atem

7

Table 3. Application to study of beta-amyloid and maternal age of onset of dementia: estimation of y1. Method

Estimate

SE

p Value

% Deleted

Complete case Single imputation Multiple imputation

0.0099 0.0079 0.0102

0.0049 0.0018 0.0021

0.0514 0.0001 0.0022

70.2%

education variable less than or greater than 16 years of schooling. Table 3 shows the results from the analysis. These results are consistent with that observed in the simulations. There is a tremendous drop in power in the complete case approach. It overestimates the standard error due to deletion of observations. The single imputation has the least standard error, as expected, hence it is the most powerful. However, this approach slightly underestimates the variance.

6 Conclusion We have provided a novel approach in dealing with randomly censored covariates. We are aware of other parametric approaches to evaluate covariates subjected to type I censoring,3.4 but decided to limit our approaches to non-parametric. Both approaches are non-parametric. We compared our approach to the complete case approach. Our multiple imputation approach corrects for the reduction in variance of the single imputation by introducing between imputation variance estimation. As a result, it bears a close resemblance to the Expectation-Maximization (EM) algorithm and other computational methods for computing maximum-likelihood function based on the observed data alone. May et al.11 describes an approach based on maximum likelihood methods for covariates data subject to limit of detection. This method summarizes a likelihood function that has been averaged over a predictive distribution for the covariate subject to limit of detection. Our approach performs this same type of averaging by Monte Carlo rather than numerical methods. Unlike other multiple imputation approaches for missing data, this approach does not invent or create data. It is a model-based imputation approach. Diﬀerences in imputed values are based on slight diﬀerences in Kaplan-Meier estimates resulting from sample variability. Our approach of multiple imputation, based on a Kaplan-Meier or Cox model, uses information similar to EM or other well-accepted likelihood-based methods. This averages over a predictive distribution for C by numerical techniques rather than by Monte Carlo averaging. Acknowledgement The data for this study was made available to the last author while on a T32NS048005 training grant. Views expressed in this paper do not necessarily represent the oﬃcial positions of the US FDA.

References 1. Clayton DG. A model for association in bivariate life tables and its application in epidemiology studies of familial tendency in chronic disease incidence. Biometrika 1978; 65: 141–151. 2. Little RJA and Rubin DB. Statistical analysis with missing data: Wiley Series in Probability and Statistics, 2002.

3. Lynn HS. Maximum likelihood inference for left-censored HIV RNA data. Stat Med 2001; 20: 33–45. 4. Helsel DR. Nondetects and data analysis. Statistics for censored environmental data. Hoboken, NJ: Wiley-Interscience, 2005. 5. Richardson DB and Ciampi A. Effects of exposure measurement error when an exposure variable is constrained by a lower limit. Am J Epidemiol 2003; 157: 355–363.

Downloaded from smm.sagepub.com at CAMBRIDGE UNIV LIBRARY on September 1, 2015

XML Template (2015) [14.5.2015–12:25pm] [1–10] //blrnas3.glyph.com/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/150030/APPFile/SG-SMMJ150030.3d (SMM) [PREPRINTER stage]

8

Statistical Methods in Medical Research 0(0)

6. Schisterman EF, Vexler A, Whitcomb BW, et al. The limitations due to exposure detection limits for regression models. Am J Epidemiol 2006; 163: 374–383. 7. Wang H and Feng X. Multiple imputation for Mregression with censored covariates. J Am Stat Assoc 2010; 107: 194–204. 8. Nie L, Chu H, Liu C, et al. Linear regression with an independent variable subject to a detection limit. Epidemiology 2010; 21: S17–S24. 9. Rigobon R and Stoker TM. Bias from censored regressors. J Business Econ Stat 2009; 27(3): 340–353. 10. Rigobon R and Stoker TM. Estimation with censored regressors: basic issues. Int Econ Rev 2007; 48: 1441–1467. 11. May RC, Ibrahim JG and Chu H. Maximum likelihood estimation in generalized linear models with multiple covariates subject to detection limits. Stat Med 2011; 30: 2551–2561. 12. Little RJA. Regression with missing X’s: a review. J Am Stat Assoc 1992; 87(420): 1227–1237. 13. Tu YK and Greenwood DC. Modern methods for epidemiology. Springer Science and Business Media, 2012.

14. Datta S. Estimating the mean life time using right censored data. Stat Methodol 2005; 2: 6569. 15. Breslow NE. Contribution to the discussion of the paper by DR Cox. J Royal Stat Soc, Series B 1972; 34(2): 216–217. 16. Rubin DB. Multiple imputation for nonresponse in survey. New York: John Wiley, 1987. 17. Schafer JL. Analysis of incomplete multivariate data. CRC press, 1997. 18. Schafer JL. Multiple imputation: a primer. Stat Methods Med Res 1999; 8(1): 3–15. 19. Atem F, Qian J, Maye JE, et al. Linear regression with a randomly censored covariate: application to an Alzheimer’s Study. Technical report. 20. Atem F, Qian J, Maye JE, et al. Multiple imputation of a randomly censored covariate improves logistic regression analysis. Technical report. 21. Maye JE, Gidicsin C, Pepin LE, et al. Maternal dementia age of onset in relation to amyloid burden in nondemented offspring. In: 6th Annual human amyloid imaging meeting, Miami, January, 2012.

Appendix 1 SAS Code proc iml; use kaplan ; read all var{time survival} into DM; close; survival ¼ DM[,2]; time ¼ DM[,1]; n ¼ nrow(DM); segment ¼ J(nrow(DM),1,1); SS ¼ J(nrow(survival),1,1); do i ¼ 1 to n ; if i ¼ n then SS[i] ¼ 0; else if i < n then SS[i] ¼ survival[i]-survival[iþ1]; end; TT ¼ J(nrow(time),1,1); do i ¼ 1 to n; if i ¼ n then TT[i] ¼ 0; else if i < n then TT[i] ¼ time[iþ1]-time[i]; end; create kaplann var{time survival SS TT }; append; quit;

Downloaded from smm.sagepub.com at CAMBRIDGE UNIV LIBRARY on September 1, 2015

XML Template (2015) [14.5.2015–12:25pm] [1–10] //blrnas3.glyph.com/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/150030/APPFile/SG-SMMJ150030.3d (SMM) [PREPRINTER stage]

Sampene and Atem

9

data kaplan_sum ; set kaplann ; sime ¼ survival*TTþ(SS*TT)/2; drop time; run; data one ; set MItest; set kaplan_sum ; drop SS TT C1 C2 obs; run; Data one; set one ; ord ¼ _N_; run; %let num ¼ n; proc transpose data ¼ one out ¼ s1 ; ID ord; var Sime; run; Data s1; set s1 ; rename _1-_&num ¼ s1-s# run; proc transpose data ¼ one out ¼ s2 ; ID ord; var censored; run; Data s2; set s2 ; rename _1-_&num ¼ c1-c# run; Data ttwo ; merge s1 s2 ; run; data cal1 ; set ttwo ; array s(n) s1-s# array c(n) c1-c# array v(n) v1-v# do i ¼ 1 to &num-1; v[i] ¼ 0; if c[i] ¼ 0 then v[i] ¼ s[i]; if c[i] ¼ 1 then do j ¼ iþ1 to # v[i] ¼ s[j]þv[i]; end; end; v[&num] ¼ s[&num]; keep v1-v# run; Data te; set cal1; array v(n) v1-v# do i ¼ 1 to # simes ¼ v[i]; output; end; keep simes; run; Data te; set te; ord ¼ _N_; run;

Downloaded from smm.sagepub.com at CAMBRIDGE UNIV LIBRARY on September 1, 2015

XML Template (2015) [14.5.2015–12:25pm] [1–10] //blrnas3.glyph.com/cenpro/ApplicationFiles/Journals/SAGE/3B2/SMMJ/Vol00000/150030/APPFile/SG-SMMJ150030.3d (SMM) [PREPRINTER stage]

10

Statistical Methods in Medical Research 0(0)

Data final; merge one te; by ord; drop ord; run; data KMCompp; set final; if censored ¼ 1 then k¼(simes)/(survival) þ time; else if censored ¼ 0 then k ¼ time; run;

Downloaded from smm.sagepub.com at CAMBRIDGE UNIV LIBRARY on September 1, 2015