Commentary (wileyonlinelibrary.com) DOI: 10.1002/sim.6212

Published online in Wiley Online Library

Calibration of models is not sufficient to justify NRI Thomas A. Gerds*† and JΓΈrgen Hilden We thank Leening, Steyerberg, van Calster and Pencina for their comments on our note [1], in which we show that fundamental concerns and mathematical analysis lead us to reject IDI and NRI. We are surprised that, despite our results, the commentators continue to promote the use of NRI and IDI. Their main point, it seems, is this: when calibration is kept in check, the chief drawbacks of NRI and IDI are neutralized. They proceed to write a basic, but far from elementary, mini-textbook on the calibration concept, citing important key papers. So could we have done, as is natural in view of a common root: of the commentators the first three represent the Erasmus University branch of a long-standing Rotterdam-Copenhagen collaboration, which began in 1976 around Dik Habbema and Hilden. Notably, it had two items on its main agenda: the importance of calibration (trustworthiness) [2,3] and the desirability of Proper Scoring [4] in the developement of prediction models. Bernado & Smith [5] state this more broadly, on page 71: It would be reasonable that, in scientific inference context, one should require a score function to be proper. Our elaboration in [1] shows that NRI and IDI are not proper. Now, the commentators seem to hope that NRI and IDI are proper when applied in a calibrated context. This does not seem be the case, at least not for the NRI. Even if the two models being compared are well calibrated, NRI can be false positive. In Table 7 in reference 18 of the commentary, Pepe & Janes show that the discretized NRI can be false positive in a situation where both models have the same prediction performance, including calibration. If that example is not convincing (after all, so many things can happen with discretization), consider the standard continuous NRI. Here we can refer to Table 3 of our note [1] where situations with spuriously positive NRI values are constructed. One may imagine a series of further fake markers that correspond to πœ– becoming arbitrarily small. E.g., when πœ– = 0.000001, the corresponding constructed model ξ‰†πœ– does not change the original risk prediction substantially. Hence, if the original model is calibrated, then so is the artificial model. Yet, NRI has the same spuriously positive value as was seen with the other choices of πœ–. This implies that it is not safe to apply NRI, not even in calibrated contexts. It is not so clear if IDI will still have problems when both models are calibrated. Anyhow, the commentary bypasses a decisive practical aspect of their proposal: In practice it is not so easy to judge if a model is sufficiently well calibrated. Calibration is not a single number, at least not if calibration in a strong sense is desired. Assessment of calibration is essentially as hard as estimation of a density. Even the plot that checks crude calibration is as hard to judge as a normal probability plot, and a non-significant Hosmer-Lemeshow statistic, or the application of shrinkage, may not be sufficient to guarantee that IDI becomes a practically proper measure of added predictive ability. Let’s turn to statistical properties. It has been shown [6] that adding a random noise predictor to a correctly specified logistic regression model results in highly variable and spuriously positive NRI. These results also clarify another issue raised in Section 3.1. of the commentary: Even though miscalibration due to overfitting is not expected to be pronounced in adequate sample sizes, the NRI provides systematically misleading figures. Other recent research is similarly critical [7, 8]. The commentators spend some time on our initial small example, complaining that even Proper Scoring Rules seem to produce paradoxes. The purported paradox is that the Brier score does not always agree with the AUC (area under ROC). This is not suprising: Brier scoring can see differences between a bad model which ignores the covariates, or predicts 50% to everyone (Brier = 25%), and an ugly model,

3419

Department of Biostatistics, University of Copenhagen, Denmark of Biostatistics, University of Copenhagen, Denmark.

*Correspondence to: Department † E-mail: [email protected]

Copyright Β© 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 3419–3420

COMMENTARY

which predicts a random number between 0 and 1 (Brier = 33%), whereas for AUC all these models have the lowest possible value. Brier scores > 25% can only arise in case of poor calibration, as illustrated here as well as by a score of 31% calculated in their example. However, we do not share the opinion of the commentators that this could lead to confusion, because, as they point out themselves, the situation can be cured by focusing on well calibrated models; any model that performs worse than the null model which ignores all covariates is ruled out a priori. In our note [1] we produced inflated IDI and inflated NRI by artificially shifting high risks to 1, low risks to zero (extreme overconfidence). The commentators add that one can produce the same effect simply by artificially doubling all risks – and no "simulations and mathematical development are required". But, this surprising insight is not entirely correct: doubling all risks leaves NRI = 0. It is only IDI that comes out false positive. But can one really verify that without mathematics? We believe it does require some 10 lines of algebra. The commentary concludes that "Like most summary statistics, NRI and IDI should not be interpreted on their own, but combined with other metrics, including calibration tests and decision analysis." We would argue that the burden to show "proof of concept" is theirs: can one devise a combination of metrics – with NRI or IDI as a non-redundant component – such that its application can be shown not to invite false recommendation of new markers? Finally, the commentators ask if we need to mistrust the conclusions in the 1000 papers citing NRI and IDI? The answer is yes: we do not mistrust the many authors, but we do mistrust their conclusions. Someone should summon the authors and urge them to check that their results do not hinge on a positive NRI or IDI.

References 1. Hilden J, Gerds TA. A note on the evaluation of novel biomarkers: do not rely on integrated discrimination improvement and net reclassification index. Statistics in medicine 2013. DOI: 10.1002/sim.5804. 2. Hilden J, Habbema JDF, Bjerregaard B. The measurement of performance in probabilistic diagnosis. II. Trustworthiness of the exact values of diagnostic probabilities. Methods of information in medicine 1978; 17:227–237. 3. Habbema JDF, Hilden J, Bjerregaard B. The measurement of performance in probabilistic diagnosis. V. General recommendations. Methods of information in medicine 1981; 20:97–100. 4. Hilden J, Habbema JDF, Bjerregaard B. The measurement of performance in probabilistic diagnosis β€” III. Methods based on continuous functions of the diagnostic probabilities. Methods of Information in Medicine 1978; 17:238–246. 5. Bernardo JM, Smith AFM. Bayesian theory. John Wiley & Sons: Chichester, 2009. 6. Pepe M, Fang J, Feng Z, Gerds T, Hilden J. The net reclassification index (NRI): a misleading measure of prediction improvement with miscalibrated or overfit models, Technical Report, Bepress, 2014. 7. Kerr KF, Wang Z, Janes H, McClelland R, Psaty BM, Pepe MS. Net reclassification indices for evaluating risk prediction instruments: a critical review. Epidemiology 2014; 25:114–121. 8. Hilden J. Commentary: On NRI, IDI, and "good-looking" statistics with nothing underneath. Epidemiology 2014; 25(2):265–267.

3420 Copyright Β© 2014 John Wiley & Sons, Ltd.

Statist. Med. 2014, 33 3419–3420

Calibration of models is not sufficient to justify NRI.

Calibration of models is not sufficient to justify NRI. - PDF Download Free
57KB Sizes 0 Downloads 5 Views