Accepted 27 January 2014

Published online 18 February 2014 in Wiley Online Library

(wileyonlinelibrary.com) DOI: 10.1002/sim.6114

Assessing calibration of multinomial risk prediction models Kirsten Van Hoorde,a,b Yvonne Vergouwe,c Dirk Timmerman,d,e Sabine Van Huffel,a,b Ewout W. Steyerbergc and Ben Van Calsterd * † Calibration, that is, whether observed outcomes agree with predicted risks, is important when evaluating risk prediction models. For dichotomous outcomes, several tools exist to assess different aspects of model calibration, such as calibration-in-the-large, logistic recalibration, and (non-)parametric calibration plots. We aim to extend these tools to prediction models for polytomous outcomes. We focus on models developed using multinomial logistic regression (MLR): outcome Y with k categories is predicted using k 1 equations comparing each category i (i D 2; : : :; k) with reference category 1 using a set of predictors, resulting in k 1 linear predictors. We propose a multinomial logistic recalibration framework that involves an MLR fit where Y is predicted using the k 1 linear predictors from the prediction model. A non-parametric alternative may use vector splines for the effects of the linear predictors. The parametric and non-parametric frameworks can be used to generate multinomial calibration plots. Further, the parametric framework can be used for the estimation and statistical testing of calibration intercepts and slopes. Two illustrative case studies are presented, one on the diagnosis of malignancy of ovarian tumors and one on residual mass diagnosis in testicular cancer patients treated with cisplatin-based chemotherapy. The risk prediction models were developed on data from 2037 and 544 patients and externally validated on 1107 and 550 patients, respectively. We conclude that calibration tools can be extended to polytomous outcomes. The polytomous calibration plots are particularly informative through the visual summary of the calibration performance. Copyright © 2014 John Wiley & Sons, Ltd. Keywords:

risk prediction model; polytomous diagnosis; calibration; calibration plot

1. Introduction Risk prediction models for diagnostic or prognostic outcomes are useful tools for clinical decision support. Most commonly, a dichotomous outcome is considered. Especially in diagnostic problems, however, a differential diagnosis often includes more levels than categorization of subjects as ‘diseased’ versus ‘non-diseased’. Two important aspects on which to evaluate prediction models are discrimination (how well can the outcome categories be separated) and calibration (how reliable are the predicted risks). For dichotomous outcomes, there are many tools to assess discrimination and calibration [1, 2]. Discrimination is commonly assessed using the c-statistic (or area under the ROC curve). Several aspects of calibration can be evaluated that differ in their level of generality. One option is to consider an overall measure such as calibration-in-the-large, which compares the observed event rate of disease in a dataset with the expected event rate based on the model. A more detailed measure is the Hosmer–Lemeshow (H-L) test a KU

Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 2585–2596

2585

Leuven Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing, and Data Analytics, Leuven, Belgium b KU Leuven iMinds Future Health Department, Leuven, Belgium c Department of Public Health, Erasmus MC, Rotterdam, The Netherlands d KU Leuven Department of Development & Regeneration, Leuven, Belgium e Department of Obstetrics & Gynecology, University Hospitals Leuven, Leuven, Belgium *Correspondence to: Ben Van Calster, Department of Development and Regeneration, KU Leuven, Herestraat 49 Box 7003, 3000 Leuven, Belgium. † E-mail: [email protected]

K. VAN HOORDE ET AL.

statistic. This goodness-of-fit test assesses whether or not the observed event rates match the expected event rates in subgroups (usually based on deciles of predicted risk) of the patient population. The test has several drawbacks such as arbitrariness of subgroup construction, low power, and instability [3–5]. An elegant and better interpretable approach is the logistic recalibration framework that involves a calibration intercept and calibration slope [6, 7]. Calibration can be visualized in a calibration plot to show the relationship between observed proportion and predicted risk [8]. A parametric approach using the logistic recalibration framework can be considered as well as more flexible non-parametric approaches, for example, using loess or spline smoothers [8–10]. Extensions of the c-statistic and the ROC curve have been suggested for nominal and ordinal outcomes [11–16]. But polytomous calibration tools have not yet been suggested, perhaps with the exception of suggestions for a multinomial H-L variant [17–19]. The aim of this work is to extend calibration statistics and calibration plots to prediction models for nominal polytomous outcomes. We focus on models developed using multinomial logistic regression (MLR), which is the typical approach for predicting such outcomes in medical statistics and epidemiology. First, we describe calibration-in-the-large and the logistic recalibration framework for multinomial prediction models. Then, we suggest methods to construct parametric and non-parametric multinomial calibration plots from the recalibration results. Finally, we describe how the recalibration framework can be used for the estimation and statistical testing of calibration intercepts and slopes. These tools are illustrated on two case studies: one on ovarian tumor diagnosis and one on residual mass diagnosis in patients treated with cisplatin-based chemotherapy for testicular cancer.

2. Methods In this section, we discuss existing calibration tools for dichotomous outcomes [1, 8] and their extension to prediction models for nominal outcomes. We do not consider the H-L test statistic because of the reported drawbacks [3, 4, 20]. 2.1. Dichotomous calibration statistics and plots 2.1.1. Calibration measures. Calibration-in-the-large is an overall measure that gives the difference between the observed event rate of disease and the expected event rate obtained as the average predicted probability of disease. This refers to a population-based level of calibration that does not guarantee calibrated risks at the individual patient level, that is, conditional on the covariate patterns. The calibration slope is a second important aspect of calibration, assessing the amount of overfitting of the predicted risks. Both measures can be estimated and tested in the logistic recalibration framework [6]. The outcome is regressed on the linear predictor lp of the prediction model through the logistic regression model log ŒP .Y D 1/=P .Y D 0/ D a C b lp, where b is the calibration slope. Overfitting of the model coefficients results in b < 1. The intercept for a calibration slope of 1, that is, ajb D 1, indicates a systematic difference in observed outcome versus predicted outcome, that is, calibration-in-the-large. When risks are perfectly calibrated, the calibration intercept is 0, and the calibration slope is 1. The logistic recalibration framework allows for a joint likelihood ratio chi-squared test of the calibration intercept and slope (df D 2) [7]. The test statistic equals twice the difference in log likelihood between the model without and the model with intercept and slope adaptation. If significant, separate tests for the calibration intercept in case of a calibration slope of one (i.e., ajb D 1/ and calibration slope b can be considered [7].

2586

2.1.2. Dichotomous calibration plots. Dichotomous calibration plots visualize the relation between predicted probability (x-axis) and observed proportion of patients with the outcome (y-axis). This relation can be estimated using the logistic recalibration framework, hence assuming linearity on the level of the linear predictor. The results of the recalibrated model can be used as observed proportions in a parametric calibration plot. This plot visualizes whether predicted probabilities are overestimated or underestimated, too extreme (overfitted) or underfitted, or a combination (Supporting information Figure S1). We may show the probabilities of both outcome categories, P (outcome D 0) and P (outcome D 1), which makes an introduction to a potential polytomous plot where we consider all k outcome categories instead of k 1 categories. A more general non-parametric approach considers the model a0 C b 0 s.lp/ with s.:/ estimated by, for example, a loess or spline smoother [21]. Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 2585–2596

K. VAN HOORDE ET AL.

2.2. Extensions to prediction models for nominal outcomes To extend calibration measures and plots to prediction models for nominal outcomes, we focus on the MLR model of outcome Y with k categories, also known as the baseline category logit model [22]. The MLR model consists of k 1 submodels: 8 h i n P P .Y D2/ ˆ ˆ .ˇm;2 Xm / D lp2;1 log D ˇ0 ;2 C ˆ P .Y D1/ ˆ ˆ mD1 ˆ ˆ h i n ˆ P < .Y D3/ .ˇm; 3 Xm / D lp3;1 log P C D ˇ 0 ;3 P .Y D1/ (1) mD1 ˆ ˆ ˆ ::: ˆ ˆ h i n ˆ P ˆ P .Y Dk/ ˆ :log ˇm; k Xm D lp D ˇ0 ;k C P .Y D1/

k;1

mD1

with n the number of considered predictors (m D 1; : : :; n). The choice of reference category (i.e., i D 1) is important for the interpretation of the regression coefficients ˇm;i but does not affect the predicted probabilities. We denote the linear predictor for category i versus category 1 as lpi;1 . It is straightforward to investigate calibration-in-the-large for MLR models by assessing the differences of the observed and expected (i.e., average predicted probability) event rate for each outcome category. Perhaps the most pragmatic and straightforward approach to assess calibration beyond calibration-in-the-large is to use dichotomous calibration tools to independently investigate each of the k categories. In this way, the calibration of the probability of every category i (i D 1; : : :; k) would be assessed relative to the probability of ‘not i’: P .Y D i/ pOi log D ai +bi log 1 P .Y D i/ 1 pOi with pOi as the estimated probability of category i obtained from the MLR model. This method is flawed as, contrary to the recalibration framework proposed in the next section, this pragmatic approach results in imperfect calibration for the development data. In addition, the result of the recalibration for each category does not sum to one for each patient, such that normalization would be required. Therefore, we searched for an alternative approach, which is unfortunately more complicated. We first describe the methodology of the proposed approach and further clarify using a worked-out example with k D 3. 2.2.1. Nominal recalibration framework. We propose a nominal logistic recalibration framework where the outcome is regressed on the linear predictors derived from the prediction model through an MLR fit with the following k 1 submodels: 8 h i k P ˆ P .Y D2/ ˆ b2;j lpj;1 C D a log ˆ 2 P .Y D1/ ˆ ˆ j D2 ˆ ˆ ˆ h i k ˆ P < .Y D3/ b3;j lpj;1 log P D a3 C P .Y D1/ (2) j D2 ˆ ˆ ˆ : : : ˆ ˆ ˆ i h k ˆ P ˆ .Y Dk/ ˆ bk;j lpj;1 D ak C :log P P .Y D1/ j D2

j D2;j ¤k

Copyright © 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 2585–2596

2587

with lpj;1 the linear predictor for category j versus 1 of the MLR prediction model (Equation (1)). Note that the reference category in the recalibration fit does not have to be the same as the reference category in Equation (1). We can split up the lpj;1 into what we call ‘corresponding’ linear predictors when lpj;1 D lpi;1 and ‘non-corresponding’ linear predictors when lpj;1 ¤ lpi;1 : 8 i h k P ˆ P .Y D2/ ˆ b2;j lpj;1 ˆ log P .Y D1/ D a2 C b2;2 lp2;1 C ˆ ˆ ˆ j D2;j ¤2 ˆ ˆ ˆ h i ˆ k P < .Y D3/ b3;j lpj;1 log P D a3 C b3;3 lp3;1 C P .Y D1/ (3) j D2;j ¤3 ˆ ˆ ˆ ::: ˆ ˆ ˆ i h ˆ k ˆ ˆlog P .Y Dk/ D a C b lp C P b lp ˆ : k k;k k;j k;1 j;1 P .Y D1/

K. VAN HOORDE ET AL.

In case of perfect calibration, the intercepts ai are zero, the corresponding slopes bi;i are one, and the non-corresponding slopes bi;j Ij ¤i are zero. 2.2.2. Nominal recalibration framework: worked-out example. Consider a prediction model with k D 3 and two predictors X1 and X2 . The prediction model is written as 8 h i P .Y D2/ ˆ ˆ