Accepted 27 February 2014

Published online 22 March 2014 in Wiley Online Library

(wileyonlinelibrary.com) DOI: 10.1002/sim.6152

Calibration plots for risk prediction models in the presence of competing risks Thomas A. Gerds,a * † Per K. Andersena and Michael W. Kattanb A predicted risk of 17% can be called reliable if it can be expected that the event will occur to about 17 of 100 patients who all received a predicted risk of 17%. Statistical models can predict the absolute risk of an event such as cardiovascular death in the presence of competing risks such as death due to other causes. For personalized medicine and patient counseling, it is necessary to check that the model is calibrated in the sense that it provides reliable predictions for all subjects. There are three often encountered practical problems when the aim is to display or test if a risk prediction model is well calibrated. The first is lack of independent validation data, the second is right censoring, and the third is that when the risk scale is continuous, the estimation problem is as difficult as density estimation. To deal with these problems, we propose to estimate calibration curves for competing risks models based on jackknife pseudo-values that are combined with a nearest neighborhood smoother and a cross-validation approach to deal with all three problems. Copyright © 2014 John Wiley & Sons, Ltd. Keywords:

calibration plots; competing risks; kernel smoothing; pseudo-values; risk models

1. Introduction In order to be useful for personalized medicine and patient counseling, a statistical model that predicts the absolute risk of an event should be calibrated. A calibration plot displays how well observed and predicted event status connects on the absolute probability scale. In this article, we show how calibration plots can be obtained and summarized in survival analysis with competing risks and right-censored data. To define calibration, first pick a time origin at which it is of interest to predict the future status of a patient. Until time t after the time origin, three things can happen. 1. The event has occurred. 2. A competing event has occurred. 3. The patient is alive and event free. A risk prediction model uses patient characteristics and other measurements that are available at the time origin to predict the probability that the event will occur within the next t years. The model is said to be calibrated if the expected percentage of the subpopulation who receives a predicted risk of p is p for all predicted probabilities p. The calibration plot, also known as the reliability diagram, is a method to visually inspect calibration; it shows predicted against expected probabilities. If some patients are lost to follow-up before time t , their event time is right censored, and what happened between their lost to follow-up time and time t remains unknown. To consistently estimate the expected probabilities, and hence the calibration plot, one has to deal with the right-censored data. Thus, cases that are lost to follow-up within the predicted time point need to be imputed, to account for the possibility that they would indeed have the event if followed up to the predicted time point. Our

a Department

Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 3191–3203

3191

of Biostatistics, University of Copenhagen, Copenhagen, Denmark of Quantitative Health Sciences, Cleveland Clinic, Cleveland, OH, U.S.A. *Correspondence to: Thomas A. Gerds, Department of Biostatistics, University of Copenhagen, Denmark. † E-mail: [email protected] b Department

T. A. GERDS, P. K. ANDERSEN AND M. W. KATTAN

approach for this imputation is to replace the status indicator at time t with a jackknife pseudo-value [1]. Importantly, in order to get this to work, we have to impute the status for all patients not just those censored before time t .

2. Calibration curves based on pseudo-values For patient i, let Ti be the time from the time origin until one of mutually exclusive events occurs (competing risks) and i the type of the event. In all what follows, i D 1 is the event of interest, and the remaining 1 events are competing risks. Observed are the right-censored time TQi D min.Ti ; Ci / and the right-censored event type Q i D i i , where i D 1 if Ti 6 Ci and i D 0 otherwise. Here, Ci denotes the time of censoring. For any future time point t , a risk prediction model r responds to patient-specific covariates Zi 2 Rd with a probability of the event: r.t jZi /. It may happen that the model predicts the same probability for two or more different patients. However, if at least one of the covariates is continuous, then typically the predicted risks for patients with different covariate values are different. To define the calibration curve formally, we introduce for patient i the event status indicator variable Yi .t / D 1fTi 6 t; i D 1g, that is, Yi .t / D 1 if event 1 occurred to subject i by time t , and Yi .t / D 0, otherwise. For each probability p 2 Œ0; 1, we also define Gr .t I p/ D f´ 2 Rd W r.t j´/ D pg, the set of patients characteristics for which the model predicts that the event occurs with probability p until time t . The calibration curve of the model at time t is defined as the graph of the mapping p 7! C.p; t; r/ D EY;Z fY.t/jZ 2 Gr .t I p/g D EY;Z fY.t/ j r.t jZ/ D pg:

(1)

To obtain the graph, we need to estimate the expectation on the right-hand side of (1). Three often encountered practical problems arise:

Continuity: the size of the sets Gr .t I p/ may be small, and it may happen that a set includes only a single patient. Right censoring: if patient i is not followed up until time t , the status Yi .t / is unknown. Generalizability: we would like to know if the model will be reliable for new patients, not just in the data set that was used to specify and estimate the model.

For the remainder of this section, we hide the generalizability issue by assuming an independent external validation data set, which includes right-censored event times and patient characteristics from n patients. In the next section, we also discuss internal validation where the same data set is used for fitting the model and for estimating the calibration curve. To overcome the right-censoring problem, we first construct a pseudo-value for all patients. For patient i, the pseudo-value is given by the following: YQi .t / D nFOn .t / .n 1/FOn.i / .t /: Here, FOn is the Aalen–Johansen estimate [2] of the cumulative incidence function for event 1, F .t / D E.Yi .t // is based on all patients, and FO .i / is based on the subset where the data of the ith patient are removed. See [3] for a comprehensive introduction to the pseudo-value approach for censored survival analysis in medical statistics. Suppose first that the model is discrete in the sense that it assigns all patients to one of L different risk groups, Gr .t I p1 /; : : : ; Gr .t I pL /. Risk group models occur naturally when all predictor variables are class variables and can also be obtained by discretizing the continuous predictor variables [4]. This makes most sense when there are different medical or healthcare actions corresponding to different risk groups. The calibration curve of a risk group model consists only of L points, which can be estimated by averaging the pseudo-values of the patients whose predicted probability is in this group: n

1X Q CO .pl ; t; r/ D Yi .t / IfZi 2 Gr .t I pl /g: n

(2)

i D1

3192

Now suppose the model is continuous in the sense that it predicts a probability exactly p with probability zero. One way to overcome this problem is to group the predicted risks into K different groups. For no good reason, K D 10 is a popular choice, which is often used in connection with the Hosmer– Lemeshow test [5]. The disadvantages of using a fixed number of groups or a fixed group size were illustrated in [6] where the authors propose nonparametric smoothing as an alternative. Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 3191–3203

T. A. GERDS, P. K. ANDERSEN AND M. W. KATTAN

A simple smoother is a moving average where at each probability p, the calibration of the model is estimated by averaging the pseudo-values of the patients that have received a predicted probability within a prespecified interval around p. We here consider a nearest neighborhood smoother [7–9] to obtain a smooth estimate of the calibration curve of a continuous model r: n

1X Q CO an .p; t; r/ D Yi .t /Kan .p; r.t jZi // : n

(3)

i D1

Here,PKan .p; q/ D IfHn .p/ Hn .q/ 6 an g=an is a rectangular kernel function, and Hn .p/ D 1=n niD1 Ifr.t jZi / 6 pg denotes the empirical distribution function of the predicted risks. The bandwidth an defines the size of the neighborhoods and needs to be chosen for each prediction model separately (Section 2.1). If an converges toward 0, then CO an .p; t; r/ converges to C.p; t; r/ in uncensored data. In right-censored data, if the censoring mechanism is independent of the event time and the covariates, the results of Graw et al. [10] imply that the estimates in (2) and (3) consistently estimate the calibration curve. If the censoring mechanism depends on the covariates, the pseudo-values can be constructed based on a working Cox model for the censoring probabilities to avoid bias [11]. As an alternative to the pseudo-value approach, (a smoothed version of) the Aalen–Johansen estimate can be applied directly to estimate C.p; t; r/. Similar results are expected because of the asymptotic equality of the average of the pseudo-values and the Aalen–Johansen estimate [10]. An advantage of the pseudo-value approach is that the pseudo-values can be shown together with the calibration graph indicating the amount of censoring and outliers. 2.1. Bandwidth selection When the predicted risk is a continuous variable, the proposed kernel estimator (3) depends on the bandwidth an . A large bandwidth will lead to a very smooth plot, and a small bandwidth will lead to a wiggly calibration plot. A good choice of the bandwidth should be a trade-off between the bias and the variance of the estimate [12, Chapter 24]. There is extensive literature on nonparametric kernel smoothing and statistical curve estimation. Silverman [13] discusses both subjective approaches for choosing the bandwidth and also automatic methods to choose the bandwidth data adaptively, for example, based on an approximation of the integrated squared error. In the present context, the following general comments can be made (see also Figure 1 for further illustration):

Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 3191–3203

3193

The bandwidth is a value between 0 and 1, which is chosen based on the predicted risks only and does not depend on the outcome (pseudo-values). The bandwidth controls the maximal deviation of the predicted risks of the patients who are joined into a neighborhood around the predicted risk at which we aim to assess the reliability of the model. A bandwidth greater or equal to 1 will lead to a degenerated calibration plot, which contains only a single point. This point compares the average of the predicted risks to the average outcome (Figure 1, panel A) and has been termed ‘calibration in the large’ in the literature. In the other extreme case, the bandwidth approaches zero and can become so small that each ‘neighborhood’ contains only a single patient’s risk prediction (Figure 1, panel B). A meaningful subjective approach is to choose the bandwidth equal to the minimal distance on the risk scale, which reflects a clinically meaningful difference in predicted risk. This could, for example, lead to a bandwidth equal to 0.1 (Figure 1, panel D). Another meaningful subjective approach, which however may not be consistent with the previous approach for a given data set, would be to require a minimum number of patients in each neighborhood. It is possible to try several different bandwidths and to choose the one that leads to the most appealing results. However, this has two disadvantages. First, the results can hardly be reproduced by a different person, and second, it may be misleading to compare risk prediction models based on calibration plots selected in this way. Automatic selection algorithms can be applied to identify the optimal bandwidth for estimating the density function of the predicted risks. In Figure 1, panel C, and in the remaining figures in this article, unless stated otherwise, we have used an approach based on an approximation of the integrated squared error suggested by Sheather and Jones [14]. It is implemented in the R library [15] accompanying the monograph [16].

T. A. GERDS, P. K. ANDERSEN AND M. W. KATTAN

A

B

C

D

Figure 1. Illustration of the effect of different bandwidth applied to the risk predictions of the same cause-specific Cox regression model that was fitted and on 100 simulated observations. The calibration plots where calculated based on 100 independent observations, and the plots commonly show that the model is not perfectly calibrated.

The bandwidth needs to be chosen separately for each risk prediction model. It is therefore not recommended to compare two prediction models by solely relying on the calibration plots. Instead, models should be compared by their respective (cross-validated) Brier score and C -index. Notably both measures do not depend on the bandwidth.

3. Internal validation

3194

Internal validation is what one can do in the absence of external validation data. Internal validation works by repeated splitting and cross-validation of the one and only data set. From now on, we call the model built in all data the final model. The apparent calibration plot uses the predictions of the final model in its own learning data set. For data-intensive modeling strategies, the apparent performance typically overestimates the prediction accuracy, and the apparent calibration plot is systematically closer to the diagonal than the expected calibration plot of the final model in new patients. Many different but similar versions of cross-validation were proposed, and the terminology can be quite confusing. We therefore define here the proposed procedure explicitly. Suppose we have competing risks data from n patients. These are randomly split B times into learning sets L1 ; : : : ; LB and corresponding validation sets V1 ; : : : ; VB , such that Lb and Vb are nonoverlapping for all b. The special case where B < n and also the learning sets are nonoverlapping is called B-fold cross-validation and leave-one-out cross-validation when B D n 1. A variant is to draw the learning sets from the data either of size n with replacement (regular bootstrap) or of size m < n without replacement (subsampling bootstrap). For the regular bootstrap, the validation sets include on average 36.8% of the data. The procedure that then averages the validation sample performance is called bootstrap cross-validation or leave-one-out bootstrap [17, 18]. Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 3191–3203

T. A. GERDS, P. K. ANDERSEN AND M. W. KATTAN

To define the cross-validated calibration curve, we let R denote the strategy that when applied to a data set performs all model specification steps and parameter estimations needed to eventually return a prediction model. Specifically, we apply R separately to each learning data set. This yields a series of prediction models R.Vb / D RO b , b D 1; : : : ; B. Predicted event probabilities RO b .t jXi / are then obtained for all patients i 2 Vb . We now describe our proposed approach to combine cross-validation with pseudo-values. The idea is simply to collect the predictions and pseudo-values from the B validation sets into a new data set that is then used to calculate the leave-one-out bootstrap calibration curve at p: n 1X 1 X Q O CaQ n ;B .p; t; R/ D Yi .t /KaQ n p; RO b .t jZi / : n mi i D1

bWi 2Vb

Here, it is practical to choose aQ n as the optimal bandwidth for smoothing the predictions of the final model for the full data set [14]. Note also that we compute the pseudo-values only once based on all data. Alternatively, one could replace YQi by pseudo-values YQi;b , which are calculated using the information in the bth validation set. A different approach would be to calculate one calibration curve for each split and then somehow average across calibration curves. However, an average of smoothed functions seems not very attractive.

4. Summarizing predictive accuracy The accuracy of a risk prediction model can be decomposed into calibration and discrimination. The discrimination ability of a competing risks model can be assessed by a time-truncated version of the concordance index [19, 20] or by a version of the time-dependent area under the ROC curve [21]. The calibration of a competing risks model can be visualized by the calibration plots described earlier. The closer the calibration curve of the model is to the diagonal, the better. For the predicted probability at Z, the bias of the model RO n is given by EY .Y.t /jRO n .t jZ// RO n .t jZ/, and a useful summary of the calibration curve is given by the squared bias: Bias2 D EZ .fEY .Y.t /jRO n .t jZ// RO n .t jZ/g2 /: Further, an estimate of the squared bias equals the average squared distance between the observed calibration curve and the diagonal. The population level Brier score for event 1 at time t is defined as the expected squared distance between the observed status at time t and the predicted probability. It can be decomposed into the squared bias term and a variance term: BS.t; RO n / D EY;Z fY.t/ RO n .t jZ/g2 D Bias2 C Var where Var D EZ fY.t/ EY .Y.t /jRO n .t jZ//g2 . The Brier score measures discrimination and calibration at the same time. To illustrate this, consider first a constant model that predicts a constant event risk of 50% for all subjects. As any constant model, this model has zero discrimination ability, that is, AUC.t / D C -index D 50%, and the Brier score is 25%. Among the constant models, there is one perfectly calibrated model. It predicts to every patient the true population level event prevalence at time t (in practice, the prevalence at time t can be estimated by the Aalen–Johansen estimate). It is easy to see that the Brier score of the perfectly calibrated constant model is smaller than that of the 50% risk model (with equality only in the case where the event prevalence equals 50%). This shows that the Brier score can see a difference between two models, which have equal discrimination ability. Now consider a model that distinguishes between men and women and predicts the true event prevalence among males for men and the true event prevalence among females for women. This stratified model is also perfectly calibrated, but it has positive discrimination ability and hence a lower Brier score than the perfectly calibrated constant model (with equality only in the case where the population prevalence is the same for women and men).

5. Calibrating the eye

Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 3191–3203

3195

To read a calibration plot is a similar adventure as it is to read a normal probability plot. The points of this plot will lie approximately on the diagonal line if the risk prediction model is well calibrated. In order

T. A. GERDS, P. K. ANDERSEN AND M. W. KATTAN

to calibrate the eye to recognize plots that indicate miscalibration, it is helpful to do several calibration plots. Thus, we here present the results of a simulation study where we do a series of calibration plots obtained by fitting the data-generating model in a series of simulated training data sets and estimating the calibration plots in independent validation data (the same validation set for all training sets). We compare the results to those of a corresponding series where calibration plots are obtained for misspecified models. In a given application, it may then be useful to imitate this procedure, for example, by fully specifying a parametric model for the joint distribution of the predictors. Then, by drawing observation from a given risk prediction model, one can simulate a series of calibration plots in the situation where this model is perfectly calibrated. 5.1. Simulation study Right-censored event time data .T D min.T1 ; T2 ; C /; D I fT 6 C g; D I fT1 6 T2 g/ were generated under Cox–Weibull models [22] for two competing events, a right-censored time, and predictor values X1 and X2 drawn independently from the standard normal distribution: Event 1: 1 .t jX / D 01 exp ˇ11 X1 C ˇ21 X2 C ˇ13 X12 1 t 1 1 Event 2: 2 .t jX / D 02 2 t 2 1 Censored: 0 .t jX / D 00 0 t 0 1 : We present simulation results where the data are generated with shape parameters 1 D 2 D 0 D 2, fixed baseline hazards 02 D 00 D 0:01 for event 2 and the censoring time, respectively, and regression coefficients ˇ11 D 1; ˇ12 D 2; ˇ13 D 1. By also setting 01 D 0:01, a single validation data set is generated with 1000 random observations. In each step of a simulation loop, three training data sets are generated, each containing 500 observations, based on the following scale values: 01 D 0:01 (same event prevalence as in validation data set), 01 D 0:02 (higher event prevalence as in validation data set), and 01 D 0:005 (lower event prevalence as in validation data set). In each training data set, causespecific Cox regression models are fitted based on the data-generating linear predictor X1 CX2 CX12 , and three risk prediction models are obtained, which we label ‘Data-generating model’, ‘Too low training prevalence’, and ‘Too high training prevalence’, respectively, for the models fitted in the training data generated with the scale parameters 0:01; 0:02, and 0:005, respectively. For the purpose of illustration, we predict risk and estimate calibration at t D 7. Here, the marginal risk of cause 1 is respectively 30.8% for 01 D 0:01, 40% for 01 D 0:02, and 22.5% for 01 D 0:005. In all training data sets that were generated with the same scale parameter as the validation set, 01 D 0:01, we also fitted three misspecified models: ‘Misspecified functional form’ drops the quadratic term X12 from the linear predictor, ‘Omitted covariates’ drops variable X2 from the linear predictor, and ‘Added noise variables’ adds four normally distributed noise variables to the linear predictor, which have no effect on the right-censored event time distribution. From Figure 2 and Table I, the following conclusions can be drawn:

3196

The calibration curves clearly indicate when a model is grossly miscalibrated in the sense that all risk predictions are either systematically too high (panel D, Figure 2) or systematically too low (panel C, Figure 2). The C -index is not at all affected, but the Brier score clearly shows miscalibration of the ‘Too Low training prevalence’ and ‘Too high training prevalence’ models (Table I). The ‘Omitted covariates’ model is well calibrated according to the calibration plots (panel F, Figure 2). The discrimination ability is considerably reduced for this model (average C -index: 63.5%) compared to that of the other models (86.4%) (Table I). Comparison of panel A (data-generating model) with panel B (added noise variables) or panel E (misspecified functional form) shows that it will hardly be possible to detect these types of misspecification by inspection of calibration plots, as the calibration curves are close to the diagonal, except perhaps for a small deviation seen in panel E in the area of large predicted risks. Notably, also the C -index cannot distinguish these misspecified models from the data-generating model, and only the Brier score shows the problems (Table I). It can be suspected that the main reason why the misspecified models in panels B, E, and F are reasonably well calibrated is the nature of the maximum likelihood method, which finds the parameter estimates based on the logarithmic scoring rule and thus cares for calibrated predicted risk.

Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 3191–3203

T. A. GERDS, P. K. ANDERSEN AND M. W. KATTAN

Figure 2. Simulation study: simulated calibration curves for predictions of cause 1 events based on pseudovalues obtained in the simulation settings described in Section 5.1. For all models, the results of the first 100 simulated training sets are shown. The same validation set is used for all curves.

6. Illustration

Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 3191–3203

3197

PBC3 was a multicenter randomized clinical trial conducted in six European hospitals [23]. Between January 1, 1983 and January 1, 1987, 349 patients with the liver disease primary biliary cirrhosis were randomized to treatment with either cyclosporin A or placebo. The purpose of the trial was to study the effect of treatment on the survival time. However, during the course of the trial, the use of liver

T. A. GERDS, P. K. ANDERSEN AND M. W. KATTAN

Table I. Simulation study: the average Brier score and average C -index across 1000 simulations for the same models and simulation settings as in Figure 2. Model

Brier

C

Data-generating model Misspecified functional form Omitted covariates Added noise variables Too low training prevalence Too high training prevalence

10.9 14.4 19.7 11.1 12.7 12.2

86.4 81.9 63.5 86.3 86.4 86.3

transplantations for patients with this disease increased considerably, and transplantation will therefore be regarded as a competing cause of failure of medical treatment. Patients were followed up from randomization until treatment failure, drop out, or January 1, 1989; 61 patients died, another 29 were transplanted, and 4 patients were lost to follow-up before January 1, 1989. At entry, a number of clinical, biochemical, and histological variables, including serum bilirubin, sex, age, and standard liver function measures such as aspartate transaminase and alkaline phosphatase, were recorded. Data are available from the home page of the textbook [24]. 6.1. External validation The PBC3 study is a multicenter study that combines data from patients treated in the following six clinical centers: Hvidovre, Denmark (n D 23); London, England (n D 150); Copenhagen, Denmark (n D 46); Barcelona, Spain (n D 79); Munich, Germany (n D 23); and Lyon, France (n D 28). All analyses are for the sole purpose of illustrating the pseudo-value-based calibration plots for competing risks. To achieve an external validation setting, we use the London data only to build four different models that all predict the probability of dying within 1000 days after the randomization, handling a liver transplant as a competing risk. The first two models are based on Fine–Gray regression analyses [25], where in the first model, the variables treatment, age, bilirubin, aspartate transaminase, and alkaline phosphatase enter in linear form into the linear predictor, whereas for the second model, the variables bilirubin, aspartate transaminase, and alkaline phosphatase are log-transformed. Note that the Fine–Gray model directly specifies the covariate-specific cumulative incidence at 1000 days. It, however, requires a model for the censoring distribution and was performed under the working hypothesis that the censoring is independent of the covariates and the event times. Models 3 and 4 are both obtained by combining two cause-specific Cox regression models, one for death and one for liver transplant, into a model for the cumulative incidence at 1000 days. The values of bilirubin, aspartate transaminase, and alkaline phosphatase enter model 3 in raw form and model 4 after log-transformation. Using the same strategies, further four models were obtained, which predict the risk of liver transplantation handling death as a competing risk. Table II shows summaries of prediction accuracy for the four models. The results show that logtransformation of the variables bilirubin, aspartate transaminase, and alkaline improves the prediction accuracy (Brier score) and also the discrimination ability (C -index). A log-transformation also has a positive effect on calibration of the models (Figure 3). The figure also indicates that the Fine–Gray models for predicting the risk of liver transplantation are not well calibrated and that none of the models are well calibrated for the predicted probabilities above 50%. However, few predicted probabilities are above 75%. The log-transformations are seen to improve the results. It is worth noting that the Brier score of the models without log-transformations is larger than the Brier score of the reference model. This confirms the miscalibration of these models. However, the concordance index, being a rank statistic, does not assess calibration. In this example, the concordance index indicates that the two miscalibrated models are better than the null model.

3198

6.1.1. Comparison with grouped predictions. A standard approach to calibration curves is to group the predicted risk and then to compute the observed risk of the event by applying a nonparametric estimate (e.g., Kaplan–Meier for survival and Aalen–Johansen for competing risks) locally in each risk group. Our nonparametric estimate is the average of the pseudo-values in the risk groups. Figure 4 compares Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 3191–3203

T. A. GERDS, P. K. ANDERSEN AND M. W. KATTAN

Table II. PBC3 trial: external validation results. External data performance (1000 days) C -index Brier score C -index AJ FGR FGR (log) CSC CSC (log)

Reference Model 1 Model 2 Model 3 Model 4

Risk of liver transplantation 50.00 3.18 84.98 3.23 92.78 2.49 61.31 3.34 90.29 2.82

Brier score

Risk of death 50.00 12.01 81.63 12.67 86.27 8.67 82.00 11.17 85.58 8.86

C -index was truncated at 1000 days and Brier scores evaluated at 1000 days for prediction models built in the data of one center (London). The calibration curves are estimated using the data of the other centers (Hvidovre, Copenhagen, Barcelona, Munich, and Lyon). AJ, Aalen–Johansen estimate; FGR, Fine–Gray regression; CSC, cause-specific Cox regression.

Figure 3. PBC3 trial: pseudo-values and smoothed calibration curves for prediction models built in the data of one center (London) and estimated using the data of the other centers (Hvidovre, Copenhagen, Barcelona, Munich, and Lyon). The variables bilirubin, aspartate transaminase, and alkaline phosphatase are log-transformed in the lower panel but not in the upper panel. All lines are obtained with a nearest neighborhood smoother.

Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 3191–3203

3199

the calibration curves obtained by nonoverlapping groups based on 5, 10, and 20 groups obtained by risk quantiles to the smooth calibration curves obtained by the nearest neighborhood method. The dotted lines are estimated by the Aalen–Johansen estimate and the solid lines by average pseudovalues. The curves show roughly identical patterns although that based on 20 groups appears very ragged

T. A. GERDS, P. K. ANDERSEN AND M. W. KATTAN

Figure 4. PBC3 trial: comparison of smoothed calibration curve with differently coarse classifications of the risk predicted by the cause-specific Cox regression model for death within 1000 days.

3200

Figure 5. PBC3 trial: smoothed calibration curves for the risk of death predicted by the cause-specific Cox regression model evaluated at different time points.

Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 3191–3203

T. A. GERDS, P. K. ANDERSEN AND M. W. KATTAN

Figure 6. PBC3 trial: smoothed calibration curves obtained via internal validation of the cause-specific Cox regression model for predicting the risk of death within 1000 days. Table III. PBC3 trial: internal validation results based on 1000 bootstrap cross-validation steps. Apparent performance (1000 days) Risk of liver transplantation Risk of death C -index Brier score C -index Brier score AJ FGR FGR (log) CSC CSC (log)

Reference Model 1 Model 2 Model 3 Model 4

50 80.21 85.5 81.91 86.4

5.88 4.56 4.45 4.56 4.51

50 76.14 80.51 76.65 80.39

13.28 10.87 10.14 10.77 10.29

AJ FGR FGR (log) CSC CSC (log)

Bootstrap cross-validation (B D 1000; 1000 days) Risk of liver transplantation Risk of death Reference 50 5.87 50 13.35 Model 1 77.91 5.09 74.21 11.67 Model 2 83.77 4.85 78.58 10.87 Model 3 78.45 4.93 74.55 11.47 Model 4 85 4.92 78.38 10.99

C -index was truncated at 1000 days and Brier scores evaluated at 1000 days. AJ, Aalen–Johansen estimate; FGR, Fine–Gray regression; CSC, Cause-specific Cox regression.

and none of the curves based on separate Aalen–Johansen estimates suggest any problems with the calibration for large predicted probabilities. 6.1.2. Effect of time point. The pseudo-values can have values below zero and above as explained in [3]. Their value depends on the censoring percentage. In the beginning where few patients are right censored, almost all of the pseudo-values are very close to one or zero. As time proceeds, the amount of censoring increases and the range of the predicted gets larger (Figure 5). 6.2. Internal validation

Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 3191–3203

3201

To illustrate our internal validation approach, we applied 1000 bootstrap cross-validation steps to the full PBC3 data set and compared the resulting calibration curve to the apparent calibration curve. Figure 6 shows the results for the prediction model for the risk of death obtained by combining cause-specific

T. A. GERDS, P. K. ANDERSEN AND M. W. KATTAN

Cox regressions for death and liver transplantation where in both models, the variables bilirubin, aspartate transaminase, and alkaline phosphatase entered log-transformed. In each bootstrap sample, the parameters of the cause-specific Cox regression models were estimated, and no variable selection was performed. The apparent calibration curve is obtained by using the full data set both for fitting the model and for estimating the calibration curve. The two bootstrap procedures are seen to provide quite similar results and differ somewhat from the apparent calibration curve (Figure 6). Table III summarizes the results. We get a quantification of the overoptimism associated with the apparent calibration curve. Further, the improvement of results by log-transforming the biochemical variables is clearly seen.

7. Discussion There are numerous published statistical models that predict the probability of an event within a given time frame (e.g., the probability of cardiovascular death within 10 years). It can be difficult to choose between alternative risk models on the basis of an accuracy measure that reflects only discrimination and not calibration. Calibration, or how closely predicted probabilities and observed proportions match, has often generally been restricted to graphical assessment rather than quantification. The main difficulty with assessing the calibration of a prognostic model lies in the fact that, almost always, the test data used for assessing calibration involve censoring. The presence of censoring causes that some standard procedures for assessing calibration may lead to conclusions that are not consistent with the data. Competing risks constitute another hurdle for the estimation of calibration and also for the interpretation. The estimation of a calibration plot has similar problems as density estimation. For the nearest neighbor method, we need to select a bandwidth, and for the grouped risk version of the calibration plot, we need to define the groups. Some studies have based a graphical assessment of a prognostic model on a grouping of the predicted event probabilities. Then, the observed outcome is usually reflected by the subgroup Kaplan–Meier estimate. In the presence of competing risks, Kattan et al. [26] used the subgroup Aalen–Johansen estimate to estimate the observed risk of the event of interest. An informal description of this approach is given in [27]. However, this calculation requires arbitrary grouping of patients (e.g., deciles of predicted risk), and this grouping can affect the visual interpretation of calibration. Similarly, by selecting different bandwidths, the user can affect the result of the nearest neighbor approach. To achieve somewhat unbiased results, it is recommended to group risk only if there are different medical actions associated with the risk groups. For the nearest neighbor method, we recommend to use an optimal bandwidth [14]. Many measures of discrimination are in use that do not (and are not intended to) reflect calibration. For example, the concordance index or time-dependent AUC is useful for specific discrimination assessments. Because they do not reflect calibration, they generally require supplementation with a calibration plot to determine (and not just visualize) calibration accuracy. The Brier score summarizes both calibration and discrimination at the same time. It is also quite interpretable and intuitive: the square root of the Brier score (root mean squared error) is the expected distance between the observation and the prediction on the probability scale. The relation of the Brier score of a prediction model to that of the reference model that ignores the covariates can further be used to construct a time-dependent R2 measure [28]. We have applied cross-validation for internal validation of the prediction models where the learning sets were either bootstrap samples or subsamples of the data set. Commonly, these procedures are underestimating the performance of the final model because during cross-validation, the models are built on less information. Several estimates have been proposed to reduce this bias by a linear combination of the optimistic apparent performance and the pessimistic cross-validation performance, for example, bootstrap validation [29], the bias corrected bootstrap [30], and the .632+ bootstrap [17]. While these approaches may work for grouped predictions for most modeling strategies, it is not clear how to combine these approaches with smoothing. An important limitation of our study is that we are assuming equal misclassification costs in the prognostic model setting. In addition, we are not addressing the use of a prognostic model for a specific decision.

3202

References 1. Andersen PK, Klein JP, Rosthøj S. Generalised linear models for correlated pseudo-observations, with applications to multi-state models. Biometrika 2003; 90(1):15–27.

Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 3191–3203

T. A. GERDS, P. K. ANDERSEN AND M. W. KATTAN 2. Aalen O, Johansen S. An empirical transition matrix for non-homogeneous Markov chains based on censored observations. Scandinavian Journal of Statistics 1978; 5:141–150. 3. Andersen PK, Perme MP. Pseudo-observations in survival analysis. Statistical Methods in Medical Research 2010; 19:71–99. 4. Breiman L, Friedman J, Olshen R, Stone C. Classification and Regression Trees. Wadsworth International Group: Belmont, California, 1984. 5. Hosmer DW, Lemeshow S. Applied Logistic Regression. Wiley: New York, 1989. 6. Le Cessie S, Van Houwelingen JC. A goodness-of-fit test for binary regression models, based on smoothing methods. Biometrics 1991; 47:1267–1282. 7. Yang SS. Linear functions of concomitants of order statistics with application to nonparametric estimation of a regression function. Journal of the American Statistical Association 1981; 76:658–662. 8. Dabrowska DM. Uniform consistency of the kernel conditional Kaplan-Meier estimate. Annals of Statistics 1989; 17:1157–1167. 9. Akritas MG. Nearest neighbor estimation of a bivariate distribution under random censoring. The Annals of Statistics 1994; 22:1299–1327. 10. Graw F, Gerds TA, Schumacher M. On pseudo-values for regression analysis in competing risks models. Lifetime Data Analysis 2009; 15(2):241–255. 11. Binder N, Gerds TA, Andersen P. Pseudo-observations for competing risks with covariate dependent censoring. Lifetime Data Analysis 2013. :Epub ahead of print. 12. van der Vaart A. Asymptotic Statistics. Cambridge University Press: Cambridge, 1998. 13. Silverman B. Density Estimation for Statistics and Data Analysis. Chapman & Hall: London, 1986. 14. Sheather SJ, Jones MC. A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society. Series B 1991; 53(3):683–690. 15. Wand M. Kernsmooth: functions for kernel smoothing for Wand & Jones (1995), 2013. http://CRAN.R-project.org/ package=KernSmooth R package version 2.23-10. 16. Wand MMP, Jones MC. Kernel Smoothing. Chapman & Hall: London, 1995. 17. Efron B, Tibshirani R. Improvements on cross-validation: the .632+ bootstrap method. Journal of the American Statistical Association 1997; 92(438):548–560. 18. Mogensen UB, Ishwaran H, Gerds TA. Evaluating random forests for survival analysis using prediction error curves. Journal of Statistical Software 2012; 50:1–23. 19. Gerds TA, Kattan MW, Schumacher M, Yu C. Estimating a time-dependent concordance index for survival prediction models with covariate dependent censoring. Statistics in Medicine 2013; 32(13):2173–2184. 20. Wolbers M, Koller MT, Witteman JCM, Gerds T. Concordance for prognostic models with competing risks. Technical Report 3, University of Copenhagen, Department of Biostatistics, 2013. 21. Saha P, Heagerty PJ. Time-dependent predictive accuracy in the presence of competing risks. Biometrics 2010; 66:999–1011. 22. Bender R, Augustin T, Blettner M. Generating survival times to simulate Cox proportional hazards models. Statistics in Medicine 2006; 25:1978–9. 23. Lombard M, Portmann B, Neuberger J, Williams R, Tygstrup N, Ranek L, Ring-Larsen H, Rodes J, Navasa M, Trepo C, Pape G, Schou G, Badsberg JH, Andersen PK. Cyclosporin A treatment in primary biliary cirrhosis: results of a long-term placebo controlled trial. Gastroenterology 1993; 104(2):519–526. 24. Andersen PK, Skovgaard LT. Regression with Linear Predictors. Springer Verlag: New York, 2010. 25. Fine J, Gray RJ. A proportional hazards model for the subdistribution of a competing risk. Journal of the American Statistical Association 1999; 94(446):496–509. 26. Kattan MW, Giri D, Panageas KS, Hummer A, Cranor M, Van Zee KJ, Hudis CA, Norton L, Borgen PI, Tan LK. A tool for predicting breast carcinoma mortality in women who do not receive adjuvant therapy. Cancer 2004; 101(11):2509–2515. 27. Wolbers M, Koller MT, Witteman JC, Steyerberg EW. Prognostic models with competing risks: methods and application to coronary risk prediction. Epidemiology 2009; 20:555–561. 28. Gerds TA, Cai T, Schumacher M. The performance of risk prediction models. Biometrical Journal 2008; 50(4):457–479. 29. Steyerberg EW. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. Springer: New York, 2009. 30. Harrell FE. Regression Modeling Strategies. Springer Verlag: New York, 2001.

3203

Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 3191–3203