Letter to the Editor (wileyonlinelibrary.com) DOI: 10.1002/sim.5928

Published online in Wiley Online Library

Interpretation of concordance measures for clustered data David van Klaveren,* † Ewout W. Steyerberg and Yvonne Vergouwe

Mauguen et al. [1] extended two censoring-robust estimators of the concordance probability to frailty models. Uno et al. [2] and Gönen and Heller [3] proposed the estimators (‘Uno’ and ‘GH’ estimators, respectively). The authors followed the suggestion of Van Oirbeek and Lesaffre [4] to derive separate concordance probability estimates for patients within the same cluster and for patients in different clusters and to pool them into an overall estimate. Although the proposed techniques add to the assessment of prognostic model performance in clustered survival data, we would like to discuss three issues related to their interpretation and practical use. First, the model-based GH estimator does not use observed survival times directly in contrast to Harrell’s c-index [5] and the Uno estimator. Instead, the effect of observed survival times is mediated through the regression coefficients. As a consequence, the concordance probability in a new population is estimated under the assumption that the regression coefficients are correct. The GH estimator should therefore be interpreted with care when applied to new populations. The authors applied the GH estimator in clusters of a validation population, using the regression coefficients of the development population. The resulting GH estimates are similar to benchmark estimates suggested before [6] and differ only from the concordance probability estimates in the development population due to differences in patient heterogeneity (case-mix). We undertook a small simulation study with different external validation settings to illustrate the interpretation of the GH estimator (Table I). When both case-mix distribution (X) and coefficient (ˇ) were equal to the development population (validation 1), the concordance probability estimates gave similar results as in the development setting, apart from small differences due to sensitivity for censoring. When we lowered case-mix heterogeneity (validation 2), all concordance measures decreased similarly. When we lowered the coefficient (validation 3), the GH and the Benchmark estimates remained almost the same while the c-index and the Uno estimate decreased further, empirically supporting the aforementioned reasoning. Second, the authors recommended using cluster-specific (conditional) predictions for validation of a prognostic model. They suggested using the validation data to estimate frailties for new clusters. However, using validation data to derive predictions does not correspond to a direct external validation of a prognostic model’s performance in new settings. It might better be labeled a form of internal validation [7]. We recommend to use population (marginal) predictions for external validation and to limit the use of cluster-specific predictions to temporal validation, with frailties estimated on development data and validated on more recent data from the same clusters. Third, the authors did not give guidance when to use within-cluster, between-cluster, or overall concordance measures. We propose using the within-cluster concordance probability in clinical practice as decisions on interventions are commonly taken within centers (clusters). A valuable prognostic model should be able to separate patients within the same center into those with good outcome and poor outcome. In contrast, we consider the overall concordance measure appropriate when decisions are taken at the population level, where between-center heterogeneity can be used to guide decision making.

714

Department of Public Health, Erasmus MC, Rotterdam, The Netherlands *Correspondence to: David van Klaveren, Department of Public Health, Erasmus MC, Dr. Molewaterplein 50, 3015 GE Rotterdam, The Netherlands. † E-mail: [email protected]

Copyright © 2013 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 714–716

LETTER TO THE EDITOR

Table I. Simulation study results; means (empirical standard errors) of concordance probability estimates in a development setting and three validation settings. Development X  Unif[0,1] ˇD3

Validation 1 X  Unif[0,1] ˇD3

Validation 2 X  Unif[0.1,0.9] ˇD3

Validation 3 X  Unif[0.1,0.9] ˇD2

Censoring (%) ˇO c-index Uno GH Benchmark

0.0 (0.0) 3.015 (0.221) 0.710 (0.013) 0.710 (0.013) 0.710 (0.012)

0.0 (0.0)

0.0 (0.0)

0.0 (0.0)

0.710 (0.013) 0.710 (0.013) 0.710 (0.011) 0.711 (0.017)

0.677 (0.014) 0.677 (0.014) 0.678 (0.011) 0.678 (0.018)

0.625 (0.015) 0.625 (0.015) 0.678 (0.011) 0.678 (0.018)

Censoring (%) ˇO c-index Uno GH Benchmark

50.4 (2.4) 3.026 (0.295) 0.721 (0.019) 0.717 (0.018) 0.710 (0.015)

50.7 (2.6)

50.6 (2.5)

64.2 (2.5)

0.720 (0.018) 0.716 (0.017) 0.711 (0.015) 0.717 (0.021)

0.683 (0.020) 0.681 (0.019) 0.678 (0.014) 0.682 (0.021)

0.628 (0.025) 0.629 (0.025) 0.678 (0.014) 0.683 (0.021)

For each setting, 1000 replications of 400 patient profiles X were drawn from a uniform distribution (column heading). Survival times were generated by multiplication of exp(Xˇ/ with independent draws from the exponential distribution (ˇ in column headings). Right-censoring times were drawn from a uniform distribution with support [0,c], where c was chosen to target 0% and 50% censoring. Concordance probability estimates were based on predictions XˇO with ˇO estimated in the development data. The time-dependent Uno estimator was calculated at  D 0:9c. To obtain the Benchmark estimate, we calculated the predicted survival function for each patient in X based on the model fit in the development data. The predicted survival functions were used to sample 400 survival times. The Benchmark estimate was then calculated as the c-index in this new sample.

Following the same line of reasoning when patient data are clustered in clinical trials, we recommend using the within-cluster concordance probability. Our rationale is that between-trial heterogeneity is not exploitable in clinical practice. We dispute the authors’ conclusion in a head and neck cancer case study that external validation in a US population confirmed the performance of a prognostic model developed in a European population. Regardless of the GH overall concordance probability estimate based on cluster-specific predictions (0.625), we consider the Uno within-cluster probability estimates the most appropriate indicators of discriminative ability of the proposed prognostic model. These estimates were significantly lower in the US validation population (mean 0.488) than in the European development population (mean 0.615). Furthermore, these estimates were similar for the frailty model and the Cox model, but varied substantially across clusters, both in the European and in the US population. The difference between the GH estimates of the within-cluster concordance (0.570) and the between-cluster concordance (0.612) reflected substantially stronger heterogeneity between patients from different clusters than between patients within the same cluster. In conclusion, for external validation in clinical practice, we recommend using nonparametric withincluster concordance probability estimates (c-index or Uno), without using cluster-specific (conditional) predictions. The use of GH estimates is valuable for benchmark purposes. Between-cluster concordance probability estimates may be useful when between-cluster heterogeneity in case-mix is exploitable for guidance of decision making.

Acknowledgement This work was supported by the Netherlands Organisation for Scientific Research (grant 917.11.383).

References

Copyright © 2013 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 714–716

715

1. Mauguen A, Collette S, Pignon J-P, Rondeau V. Concordance measures in shared frailty models: application to clustered data in cancer prognosis. Statistics in Medicine 2013. DOI: 10.1002/sim.5852.

LETTER TO THE EDITOR 2. Uno H, Cai T, Pencina MJ, D’Agostino RB, Wei LJ. On the C-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data. Statistics in Medicine 2011; 30:1105–1117. 3. Gönen M, Heller G. Concordance probability and discriminatory power in proportional hazards regression. Biometrika 2005; 92:965–970. 4. Van Oirbeek R, Lesaffre E. An application of Harrell’s C-index to PH frailty models. Statistics in Medicine 2010; 29:3160–3171. 5. Harrell FE, Jr., Califf RM, Pryor DB, Lee KL, Rosati RA. Evaluating the yield of medical tests. JAMA 1982; 247:2543–2546. 6. Vergouwe Y, Moons KG, Steyerberg EW. External validity of risk models: use of benchmark values to disentangle a case-mix effect from incorrect coefficients. American Journal of Epidemiology 2010; 172:971–980. 7. Altman DG, Royston P. What do we mean by validating a prognostic model? Statistics in Medicine 2000; 19:453–473.

716 Copyright © 2013 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 714–716

Interpretation of concordance measures for clustered data.

Interpretation of concordance measures for clustered data. - PDF Download Free
51KB Sizes 0 Downloads 0 Views