Intensive Care Med DOI 10.1007/s00134-014-3227-6

Jose´ Labare`re Renaud Bertrand Michael J. Fine

Received: 4 November 2013 Accepted: 21 January 2014 Ó Springer-Verlag Berlin Heidelberg and ESICM 2014

STATISTICS FOR INTENSIVISTS

How to derive and validate clinical prediction models for use in intensive care medicine

J. Labare`re ()) UQEM, Pavillon Taillefer, CHU BP217, Grenoble 38043, Grenoble Cedex 9, France e-mail: [email protected]; [email protected] Tel.: ?33-4-76768767 Fax: ?33-4-76768831

Abstract Background: Clinical prediction models are formal combinations of historical, physical examination and laboratory or radioJ. Labare`re Quality of Care Unit, University Hospital, graphic test data elements designed to Grenoble 38043, France accurately estimate the probability that a specific illness is present J. Labare`re (diagnostic model), will respond to a TIMC UMR 5525 CNRS, Universite´ Joseph form of treatment (therapeutic model) Fourier–Grenoble 1, Grenoble, France or will have a well-defined outcome (prognostic model) in an individual R. Bertrand Emergency Department, Cochin and Hoˆtel patient. They are derived and valiDieu Hospitals, Assistance Publiquedated using empirical data and used Hoˆpitaux de Paris (AP-HP), Paris, France to assist physicians in their clinical decision-making that requires a R. Bertrand quantitative assessment of diagnostic, ´ ´ Faculte de Medecine Paris Descartes, therapeutic or prognostic probabilities Paris, France at the bedside. Purpose: To provide intensivists with a comprehensive M. J. Fine Veterans Affairs Center for Health Equity overview of the empirical developand Research Promotion, VA Pittsburgh ment and testing phases that a clinical Healthcare System, Pittsburgh, PA, USA prediction model must satisfy before its implementation into clinical pracM. J. Fine tice. Results: The development of a Division of General Internal Medicine, clinical prediction model encomUniversity of Pittsburgh Medical Center, passes three consecutive phases, Pittsburgh, PA, USA

namely derivation, (external) validation and impact analysis. The derivation phase consists of building a multivariable model, estimating its apparent predictive performance in terms of both calibration and discrimination, and assessing the potential for statistical over-fitting using internal validation techniques (i.e. split-sampling, cross-validation or bootstrapping). External validation consists of testing the predictive performance of a model by assessing its calibration and discrimination in different but plausibly related patients. Impact analysis involves comparative research [i.e. (cluster) randomized trials] to determine whether clinical use of a prediction model affects physician practices, patient outcomes or the cost of healthcare delivery. Conclusions: This narrative review introduces a checklist of 19 items designed to help intensivists develop and transparently report valid clinical prediction models. Keywords Clinical prediction models  Clinical decision rule  Prognosis  Severity of illness index  Intensive care

Introduction

Derivation of a clinical prediction model

Clinical prediction models, also referred as to clinical prediction scores or rules, are mathematical tools derived from original research and primarily intended to assist physicians in their clinical decision-making at the bedside [1–3]. They typically combine three or more predictors, including patient demographics, history and physical examination findings, and/or basic laboratory and radiographic test results, to accurately estimate the probability that a certain illness is present (diagnostic model), will respond to a form of treatment (therapeutic model) or will have a well-defined clinical outcome (prognostic model) in an individual patient [1]. Examples of generic prognostic models that are widely used within the field of adult intensive care medicine include the Acute Physiology and Chronic Health Evaluation (APACHE) II, APACHE III, APACHE IV, the Simplified Acute Physiology Score (SAPS) II, SAPS 3, and the Mortality Probability Model III [4]. A subtle but important difference exists between clinical prediction models and decision rules [5]. Clinical prediction models are assistive and generate probabilities without recommending decisions: the interpretation of predicted probabilities is left to the discretion of clinicians [5–7]. Although debatable, it is assumed that accurate predictions will improve clinical decisions [5]. In contrast, clinical decision rules are directive and suggest a specific course of action depending on the predicted probability values generated by a model [5–7]. The establishment of a clinical prediction model encompasses three consecutive research phases [3, 8], namely derivation [9], external validation [10] and impact analysis [6] (Fig. 1). Methodological standards for the derivation and validation of clinical prediction models were originally described by Wasson et al. [11] and more recently updated [1, 12–14]. Although clinical prediction models have been derived for various conditions and applications in intensive care medicine, recommendations for their development in this setting are currently lacking. The aim of this paper is to provide intensivists with a methodological framework for conducting the derivation, validation and impact analysis of clinical prediction models. The reader may find a more thorough discussion of the many issues to consider when developing a clinical prediction model in Steyerberg’s textbook [7]. Empirical examples were taken from the literature on clinical prediction modelling of early admission to the intensive care unit (ICU) for patients with community-acquired pneumonia (CAP) [15, 16] (‘‘Appendix 1’’). Although this paper focuses on clinical prediction models for pneumonia prognosis, most of the issues addressed can be applied to diagnostic prediction models or those focused on other clinical conditions.

The first phase in the development of a clinical prediction model involves deriving a multivariable model, estimating its predictive performance characteristics and quantifying the potential for over-fitting and optimism in performance using internal validation techniques (Fig. 1; [13]). It requires collecting appropriate data, defining the outcome of interest, selecting relevant predictors from a larger set of candidate predictors and combining the selected predictors into a prediction score. Study design The appropriate source of data for developing a clinical prediction model depends on whether the model is intended to predict prognostic or diagnostic probabilities. Data for developing prognostic prediction models come from cohorts of patients who are followed over time for the outcome of interest to occur [2, 7, 13]. In contrast, data for developing diagnostic prediction models usually come from cross-sectional studies, which relate candidate predictors to a reference standard for the target condition [17]. Ideally, prospective data collection using a case report form specifically designed for model development provides optimal documentation of candidate predictors and the outcome of interest [2]. Because randomized controlled trials are a special case of prospective cohort studies, they may be suitable for developing prognostic prediction models [2]. However, prediction models derived from trial data may have limited generalizability because of more stringent eligibility criteria [7]. Whether it is preferable to derive a clinical prediction model using data from a trial’s control group only (rather than combining data from the intervention and control groups) continues to be debated [7]. Including the randomized treatment variable in multivariable analyses to derive the model may help overcome this problem [2]. Although databases assembled for other purposes generally imply inappropriate data quality for developing clinical prediction models [1, 14], they may also have some advantages, particularly if large in size and rich in clinical data elements or if the duration of follow-up is long. Only unselected consecutive patients or random samples will produce unbiased risk estimates and ensure representativeness of the target population for whom the clinical prediction model is intended [18, 19]. Convenience samples of patients and case–control designs are not suitable for developing prediction models because they do not allow for estimating absolute risks [2].

Fig. 1 Consecutive research phases for the establishment of a clinical prediction model. I Derivation and internal validation, II External validation, III Impact analysis

I. Derivation

II. External validation

III. Impact analysis

Original population

Related populations

Related population

Derivation sample

Model building

Temporal samples

Internal validation

Apparent Overfitting performance

Original model

Outcomes of interest The outcome predicted by the model should be clinically relevant and clearly defined in terms of timing and methods of ascertainment [11]. ‘‘Hard’’ outcomes, such as all-cause mortality, are preferred, although statistical power considerations may motivate the use of composite or even surrogate outcomes [7]. Of note, ICU mortality may be a misleading outcome because the patient’s death may occur after transfer to a non-ICU ward. To minimize the potential for assessment bias, the outcome should be measured without knowledge of the candidate predictor variables. Such blinding is critical when the outcome assessment requires observer interpretation, whereas ascertainment of more objective outcomes are less prone to bias [1]. Candidate predictors Candidate predictors are the independent variables that are considered for inclusion in multivariable analysis of the outcome of interest. Theoretically, all baseline variables believed to be associated with the outcome may be considered as candidate predictors [13]. Omission of important candidate predictors may lead to model misspecification. In practice, investigators tend to rely on variables that are routinely available and not too costly to obtain [7]. Importantly, candidate predictors should include only variables that will be available at the time when the model is intended to be used [2]. Previous studies have shown that clinical intuition may not be suitable for identifying candidate predictors [19]. A better approach is to combine a systematic literature review of prognostic factors associated with the outcome of interest with the opinions of field experts [17].

Geographical samples

Fully independent samples

Implementation sample

Adjustement / reestimation

Cluster randomized trial

Generalizability

Effectiveness

Updated model

Implemented model

To ensure generalizability, candidate predictors should be clearly defined and reliable [2]. Definitions should be in line with daily practice to allow routine use of the prediction model [7]. To avoid ascertainment bias, the individuals who record the candidate predictors should be blinded to the outcome, as in prospective cohort studies [1]. Inter-rater reliability can be quantified with kappa statistics or intra-class correlation coefficients, where appropriate [20]. Continuous predictors are often converted into categorical variables using clinically meaningful thresholds used in routine medical practice and easily remembered by physicians [21, 22]. Yet the simplicity of categorizing a continuous predictor has a trade-off in terms of information lost [23]. This loss is greatest when the predictor is dichotomized [23]. Indeed dichotomizing a continuous predictor assumes a constant risk up to the threshold and then a different constant risk for all values beyond the threshold (‘‘Appendix 2’’) [18]. In the absence of a prespecified threshold, it is common to select an optimal cutoff point, which minimizes the P value relating the predictor to the outcome. This may cause a considerable inflation of the type I error probability and should be avoided [23, 24]. If graphical inspection is used for specifying the functional form of the relationship or optimal cutpoints for continuous predictors, the resulting prognostic model will be optimal to the development data set but may not validate well for new subjects [25]. In practice, a linear function is often acceptable for modelling continuous predictors and various authors have advised against categorization [26]. Because the data may require relaxing the linearity assumption [9], it is recommended to check for possible nonlinearity using fractional polynomials [26]. Simple transformations of linear terms as well as more complex functions such as multivariable fractional polynomials or splines can be

useful for nonlinear modelling of a continuous predictor rather than the number of predictors selected in the final model [27]. It includes all degrees of freedom (including [7, 27]. nonlinear terms for continuous predictors) for candidate predictors. Interestingly, more recent studies suggest that data structure (i.e. regression coefficient size and correMissing values lations between predictors) influences the performance of Despite investigator efforts to ensure data quality, missing prespecified logistic models, beyond the number of events values occur in almost all medical studies and may per predictor, and logistic regression modelling may pose involve predictor as well as outcome variables [28]. substantial problems even if this number exceeds 10 [37]. Different mechanisms for missing data coexist (‘‘Appendix 3’’). Dealing with missing values poses important challenges in the development of clinical prediction Statistical methods models [7]. To ensure transparency, the completeness of data should be reported for each variable separately and Common statistical methods used to derive clinical prediction models include multivariable linear, logistic and overall for observations [18]. Standard statistical packages perform case-wise dele- Cox regression and other approaches such as classification, which discards all observations with any missing tion tree analysis [11]. The continuous, binary or survival value [27]. When the number of incomplete observations nature of the outcome determines the appropriate class of is large, this leads to a decreased sample size with a major the multivariable regression model [27]. Classification loss of statistical power [29]. Unless data elements are tree analysis is a nonparametric method that constructs a missing completely at random, case-wise deletion may binary tree by repeatedly partitioning a data set into two also lead to biased parameter estimates [27]. Alternate descendant subsets, using the candidate predictor that best approaches include creating an indicator variable for discriminates between individuals with and without the missing predictor values [30], replacing missing values by outcome of interest [38, 39]. Classification tree analysis is the overall mean of the variable [7] and assuming that well suited for capturing complex interactions between predictors and for deriving prediction rules with high unknown predictor values are normal [21, 22]. The limitations of these simple approaches support the sensitivity [1, 14], such as the Ottawa ankle rule [8]. use of multiple imputations (‘‘Appendix 4’’), which can Importantly, multivariable regression model assumptions be equally applied to both predictor and outcome vari- need to be assessed, including the additivity assumption, ables [31, 32]. Multiple imputations is a simulation-based which can be evaluated with a limited number of interstatistical technique whereby missing values are replaced action terms that are a priori plausible [27]. by plausible values predicted from the individual’s available data [7, 33]. Imputation methods assume data are missing at random, although this assumption is gen- Selecting predictors erally unverifiable. The main reason for limiting the number of predictors comprising a model is that parsimonious prediction Sample size models are easier to use and to interpret in routine practice [6]. Moreover, some candidate predictors may have The sample size required for developing a clinical pre- very small or implausible independent effects. Finally, the diction model cannot be estimated straightforwardly problem of over-fitting is real when the number of canbecause of the multivariable nature of modelling [2]. didate predictors does not match the size of the derivation Actually, the effective sample size is determined by the study sample. number of outcome events and is therefore much smaller Ideally predictors are selected before modelling, than indicated by the overall number of patients [3, 17]. without investigating their association with the outcome Indeed, the potential for over-fitting is real when there are of interest in the derivation sample [7]. When subject too few outcome events relative to the number of pre- matter knowledge is not available, data-driven methods dictors included in multivariable analysis [34]. Over- are required to select candidate predictors [26]. Candidate fitting means that the model fits the development data set predictors with large numbers of missing values or too closely and is likely to perform less well in new but skewed distributions in the development sample can be comparable individuals [7]. As a rule of thumb, supported omitted while still being blinded to the relationship with in part by simulation studies, there should be a minimum the outcome [7]. Related candidate predictors may be of ten outcome events per candidate predictor considered combined on the basis of subject knowledge, prespecified for inclusion in a multivariable logistic regression model scores or multivariate statistical techniques (e.g. cluster [35, 36]. This number is the number of candidate pre- analysis, principal component analysis) [27]. This should dictors screened for association with the outcome event be distinguished from examining the predictor–outcome

relationship, either informally (graphical or cross-classification table inspection) or formally (univariable selection based on P values) [27]. Although commonly reported, testing the univariable associations between candidate predictors and the outcome of interest at a nominal prespecified significance level is not recommended for inclusion in a clinical prediction model. Indeed, this approach may incorrectly reject potentially important predictors in the presence of confounding factors [40] and may introduce multiple comparison problems leading to optimistic models [35]. Selecting predictors while modelling is based on backward elimination or forward selection procedures, which are implemented in most statistical packages [41]. In a backward elimination approach, the least significant candidate predictors are sequentially removed from the full non-parsimonious multivariable model, using a nominal prespecified significance level for exclusion [27]. Conversely, in a forward selection approach, the model is sequentially built up from the most significant candidate predictors [13, 27]. Backward and forward approaches may be combined with stepwise selection, a method that allows dropping or adding predictors at each step. In a backward approach, stepwise selection involves removing candidate predictors from the full non-parsimonious multivariable model and then potentially adding back predictors if they later appear to be significant. Backward elimination is preferred to forward selection because the former assesses all candidate predictors simultaneously and performs better when predictors are highly correlated with each other [13, 27]. The choice of significance level has a major effect on the number of predictors selected; use of lower significance levels generates models with fewer predictors with the trade-off of potentially missing important predictors [9]. Although objective and reproducible, automated selection procedures tend to yield biased model coefficient estimations and optimistic model performance as a result of over-fitting [7, 9, 27]. It may be reasonable to force a nonsignificant predictor in a multivariable model when its prognostic value is supported by prior research [7]. Indeed, nonsignificance does not mean evidence for the absence of a predictor effect, especially for studies with a limited sample size [7]. Omitting an important well-known predictor would result in biased estimated regression coefficients and predictions. Conversely, simulation studies suggest that inclusion of random covariates may have a limited impact on model performance [7]. Presentation format Presentation is a key factor for successful implementation of a clinical prediction model [3]. For transparency purposes, it is recommended to report the regression coefficient for each predictor in the final model, along

with the intercept for a logistic regression model [13]. Several presentation formats are possible for clinical prediction models, including regression formula, score chart, or nomogram [7]. However, very simple presentations inevitably contribute to deteriorating clinical prediction model performance. Computerized presentation and web-based technologies may facilitate the implementation of complex clinical prediction models [3]. Clinical prediction models may be simplified for routine use in clinical practice by multiplying regression coefficients by ten and rounding them to integers that are easy to remember and to add [7, 42] (‘‘Appendix 1’’). Predictive performance Measures to assess model predictive performance include overall, discrimination and calibration measures [7, 27, 43]. Overall model performance can be quantified using various pseudo-R2 measures, which are estimates of explained variation calculated on the natural logarithm likelihood scale ranging from 0 to 1, with higher values indicating better fit. The Brier score is another overall model performance measure, which quantifies the average squared difference between the observed outcome and the predicted probability [44]. The Brier score ranges from 0 for a perfect model to a maximum value that depends on the incidence of the outcome for a noninformative model [7]. Discrimination refers to the ability of a model to distinguish individuals with and without the outcome of interest [43]. It can be quantified by the concordance (c) statistic, which is identical to the area under the receiver operating characteristic curve for logistic models [27]. The c statistic can be interpreted as the chance that a patient with the outcome of interest is assigned a higher probability of the outcome by the clinical prediction model than a randomly chosen patient without the outcome [7]. A c statistic value of 0.5 indicates that the model does not perform better than random prediction, whereas a value of 1 indicates perfect discrimination between individuals with and without the outcome [27]. The discrimination slope is another simple measure for assessing how well the clinical prediction model separates individuals with and without the outcome of interest [43]. It is calculated as the absolute difference in the mean of predicted probabilities for individuals with and without the outcome of interest [7]. Calibration refers to the agreement between outcome probability predicted by the model and observed outcome frequency [13]. It can be investigated graphically by plotting the observed outcome frequencies against the predicted outcome probabilities for subjects grouped by quantiles of predicted probabilities (Fig. 2; [7, 9]). A calibration plot on the 45° line denotes perfect agreement between the observed frequency of outcome and predicted

Observed fequency of early ICU admission

1

.8

.6

.4

.2

0 0

.2

.4

.6

.8

1

Predicted probability of early ICU admission

Fig. 2 Calibration plot for the Risk of Early Admission to the Intensive Care Unit prediction model in the external validation sample (n = 850)

probabilities over the whole range of probabilities [9]. Yet the visual impression of calibration may be influenced by the choice of quantiles, with higher variability observed for small group size [7]. Calibration can be formally assessed by modelling a regression line with intercept (a) and slope (b) [7]. These parameters can be estimated in a logistic regression model with the observed outcome as the dependent variable and the linear prediction as the only independent variable. The intercept is 0 and the calibration slope 1 for well-calibrated models. The unreliability (U)-statistic is a joint test of calibration intercept and slope, testing the null hypothesis a = 0 and b = 1 [7]. Calibration intercept is relevant only for external validation because there is correspondence between the average predicted probability and observed frequency of the event of interest in the development sample [7, 45]. Although commonly used, the Hosmer and Lemeshow [41] goodness of fit test may lack statistical power to reject poor calibration [9, 13]. Internal validation The apparent performance of a clinical prediction model is often optimistic compared with the performance found when the model is applied in patients from the same underlying population, but not included in its derivation [1, 46]. The reason is that the model was designed to optimally fit the derivation data set but predicts the event of interest in new patients less accurately [13, 47]. The potential for optimistic performance increases with multiple testing and

with the number of candidate predictors relative to the number of outcome events [3]. Internal validity (also referred to as reproducibility [46]) requires the clinical prediction model to replicate its predictive performance in the setting where the derivation sample originated [7, 47]: basically, internal validation uses no other data than the derivation sample [13]. Importantly, internal validation aims at validating the process used to fit the model rather than a specific model [7, 27]. It is therefore recommended to derive the final model from the full derivation sample and not waste precious information by developing the model on a random part of the original data set only [7]. Internal validation is commonly done by randomly splitting the derivation sample into two subsets (‘‘Appendix 5’’): The clinical prediction model is derived using a derivation subset and its performance is assessed using an internal validation subset [10]. Typical splits are half-sampling or 2/3:1/3 splitting [7]. However, the splitsampling method is statistically inefficient and therefore requires a large sample size [27, 48]. Only part of the data set is used for model derivation, resulting in a less stable model compared with models derived from the full derivation data set [27]. Additionally, the internal validation subset may be relatively smaller, potentially leading to imprecise performance estimates [7, 47]. Cross-validation is an extension of split-sampling that solves some of the aforementioned problems [7, 27]. In tenfold cross-validation, the original derivation data set is randomly divided into ten equal subsets. The clinical prediction model is derived in nine of the ten subsets and its performance assessed using the remaining subset. The process is repeated ten times to estimate an average performance measure, with each subset used once for assessing model performance. Compared with split-sampling, cross-validation has the advantage of using a larger portion (i.e. 90 %) of the original data set for model derivation [7]. Because cross-validation can yield different performance measure estimates when the whole procedure is replicated, more than 200 replications may be necessary to achieve accurate estimates [7, 13, 27]. Bootstrapping is currently the preferred method for internal validation, especially for studies of limited sample size [13]. Basically, bootstrapping is a nonparametric technique for estimating the sampling distribution of a summary measure by repeatedly drawing samples with replacement from the original data set [49]. Drawing with replacement mimics the sampling from an underlying population [7, 27], making bootstrap samples comparable, but not identical to the original data set [13]. In order to preserve the precision of estimates, bootstrap samples are the same size as the original data set [7, 13]. Quantifying the optimism in the apparent performance of a clinical prediction model is an important application of bootstrapping [7, 27, 49]. Briefly, a clinical prediction model is derived in each bootstrap sample and its performance assessed both in the bootstrap sample and in the original

when applied to different but plausibly related patients [46, 50, 51]. The reason is that internal validation only examines sampling variability and does not address variations in the study population [46]. Poor performance in new patients may be explained by inadequate model derivation, overfitting or omission of important predictors [7, 18, 46, 47]. It may also arise from differences in settings, patient selection criteria, or predictor or outcome definitions between the validation and development studies [7, 18, 46, 47]. External validation is therefore essential to support the generalizability (also referred as to transportability [46]) of a clinical prediction model [7, 18]. External validation consists of using the original model, with its inherent predictors and parameter coefficients, to predict the outcome of interest in new patients and comparing these predictions with the observed outcomes (Fig. 1; [18, 47]). Thus, external validation requires documentation of all predictors and outcome values for the new patient sample [12, 47]. The predictive performance of the model should be evaluated in terms of both calibration and discrimination. As emphasized by others [12], external validation is not simply repeating the External validation of a clinical prediction model model development process in a new sample of patients to The performance of a clinical prediction model may be check whether the same predictors and regression coefreproducible in the same underlying population but degrade ficients are found. It is also not refitting the final model in development data set (Fig. 3). The first reflects apparent performance and the second tests performance in a new sample; the difference indicates optimism in performance [7]. The process may require 100–1,000 replications to achieve a reliable estimate of the optimism. Then the optimism estimate is subtracted from the apparent performance measure of the prediction model derived using the original development data set [7, 27]. Bootstrapping has at least two advantages over alternate approaches to internal validation. First, all observations from the original derivation sample are used to derive the model, as well as to assess its performance, yielding more stable estimates than those obtained by split-sampling or cross-validation [7]. Second, bootstrapping not only quantifies the potential for optimism in the apparent model performance, but also estimates a uniform shrinkage factor that can be used to adjust regression coefficients for over-fitting [7, 13, 49].

Derivation

Validation (100-1,000 replications)

Derivation sample

3. Bootstrap sample

1. Model derivation

4. Bootstrap model derivation

2. Apparent performance

5. Bootstrap apparent performance

6. Bootstrap test performance

7. Bootstrap optimism estimate

8. Optimism-corrected performance estimate Fig. 3 Bootstrap estimate of optimism-corrected performance (adapted from [7]). (1) Derive the model on the original derivation sample. (2) Determine the apparent performance of the derived model on data from the original derivation sample. (3) Draw a bootstrap sample from the original derivation sample with replacement. (4) Derive a model in the bootstrap sample, performing every step that was done in the original derivation sample. (5) Determine bootstrap apparent performance of the bootstrap model on data from the bootstrap sample. (6) Determine

bootstrap test performance by applying the bootstrap model on data from the original derivation sample. (7) Compute bootstrap optimism as the difference between bootstrap apparent performance (step 5) and bootstrap test performance (step 6). (8) Perform 100–1,000 replications of steps 3–7 to obtain a stable bootstrap optimism estimate. Then subtract the bootstrap optimism estimate from apparent performance (step 2) to obtain the optimismcorrected performance estimate

a new sample and determining whether the predictive performance differed from that observed in the development study [12]. Different dimensions of model generalizability can be investigated through external validation [7, 46, 51]. Temporal generalizability requires model performance to remain stable when applied to cohorts from different time periods [46]. It can be assessed through temporal validation studies, which evaluate the performance of the model in more recent patients from the same centres [18]. Conceptually, temporal validation does not differ from splitting a single data set by the moment of inclusion [7, 18]. The derivation and validation sets are rather similar in that they share the same eligibility criteria and the same predictor and outcome definitions [12]. However, temporal validation may involve a prospective study specifically designed for validation purposes and independent of the derivation process [10, 12]. Geographical generalizability refers to stable model performance across physical locations (e.g. hospitals, regions, countries) [7, 46]. Geographical validation can be done retrospectively by non-random splitting of an existing data set by centre in a multicentre study [7]. It can also be conducted prospectively by recruiting new patients in a specifically designed study [51]. Other dimensions of model robustness include methodological, spectrum, domain and follow-up generalizability [7, 46, 51]. Methodological generalizability refers to stable model performance when assessed using data collected by alternative methods (e.g. claims data, retrospective chart review, prospective data collection) [7, 46]. Spectrum generalizability refers to stable model performance in patients with varying disease severity or prevalence of the outcome of interest [7, 46]. Domain generalizability refers to stable model performance across settings (e.g. general versus university hospital) or patient age groups [7, 12, 51]. Follow-up period generalizability refers to stable model performance across varying followup periods [46]. Fully independent external validation is performed by investigators independent of those who developed the model, using slightly different predictor or outcome definitions and including patients who were selected differently from the derivation setting [7, 12]. Relatively few data exist on sample size requirements for external validation studies of clinical prediction models [52]. For a logistic regression model including six predictors, simulation studies showed that external validation samples with 100 events had approximately 80 % power to detect miscalibration with predicted probabilities 1.5 times too high or too low on an odds scale and a 0.1 decrease in the c statistic [45]. These authors advised that at least 100 events and 100 non-events were required for assessing model performance in an external validation sample. For the detection of smaller but relevant differences in model performance, larger sample sizes (with more than 250 events and 250 non-events) are required [7].

Model updating When a clinical prediction model yields poorer performance in new patients, researchers are tempted to reject the original model and derive a new prediction model using the validation data set only [51, 53]. Yet, this approach has the disadvantage of neglecting valuable information from previous derivation studies [12, 51]. The new model is also likely to be over-fitted and therefore even less generalizable than the original one, because validation studies often include fewer patients than derivation studies [6, 53]. Furthermore, physicians are faced with the impractical situation of having to choose among many concurrent clinical prediction models for the same event of interest [12, 51]. A suitable alternative to developing a new prediction model is updating the original model with data from the validation study [6, 51, 53]. Updated models have the advantage of combining prior information from original models with the information obtained from new patients enrolled in the validation study [53]. Various updating methods have been proposed, varying from simple recalibration to more extensive model revision [6, 53]. When the prevalence of the outcome of interest differs between the derivation and validation studies, calibration can be improved by adjusting the original model regression intercept to the individuals in the validation sample, such that the mean predicted probability equals the observed outcome frequency [53]. Performing overall recalibration is also possible by applying a single calibration factor estimated from the validation data set to each original model regression coefficient [53]. Although these methods are often sufficient to improve model calibration, they do not alter discrimination, given that the relative ranking of predicted probabilities remains unchanged [12]. When the discrimination of a model needs to be improved, revision methods are necessary [53]. They include re-estimating individual regression coefficients and addition of new predictors [3, 6, 53]. Because updated models are adjusted to validation samples, it is recommended that they be tested for internal and external validity before being applied in routine medical practice [12, 51]. Model performance comparisons Direct comparisons of newly developed clinical prediction models with existing models are few. Thus intensivists often need to evaluate from indirect comparisons which concurrent model performs best in different situations. It has been recommended that investigators who have large data sets available conduct external validation studies of multiple existing models at once, after recalibration if necessary, in order to determine which one is most useful [54]. Of note, studies that report the

series designs can be used to strengthen before-and-after study designs and constitute an appropriate quasi-experimental alternative to randomized designs. In an interrupted time series design, data on healthcare providers’ behaviour and/or patient outcomes are collected at multiple instances over time before and after a clinical prediction model is introduced, to detect whether it has an effect that differs from the underlying secular trend [58]. Finally, if coupled with formal evaluations of the implementation process, impact studies may provide insights Impact analysis of a clinical prediction model into potential facilitators of or barriers to prediction Clinical prediction models are expected to inform physi- model implementation into routine medical practice [51]. cians’ decision-making and ultimately improve patient outcomes or the cost-effectiveness of care [3, 6, 8]. Many reasons may explain why validated clinical prediction models do not work or are not used by physicians in Checklists for clinical prediction model development routine practice [8]. Therefore, it is important that the effectiveness of validated clinical prediction models be Three published checklists or guidelines for clinical preascertained in clinical impact studies (Fig. 1; [6, 8]). diction model development have been identified. Of these, Impact studies are comparative in design and differ two were partially redundant and proposed by the same from validation studies in that they typically require a authors [27, 35]. The tools identified are varied in their control group [6, 12]. The preferred design for an impact objective, content and presentation. In their guidelines, study is a randomized implementation trial, with the Harrell et al. [35] and Harrell [27] focused on specific control group receiving usual care without the use of the modelling strategy issues (i.e. handling of missing values for model and the intervention group receiving the model predictor and outcome variables, modelling of continuous through a prespecified implementation strategy [8]. Fail- predictors, data reduction, regression coefficient shrinkage, ure to show any difference between the two study groups and internal validation) and did not address other aspects of may reflect the lack of effectiveness of the clinical pre- clinical prediction model development. In contrast, the diction model, the implementation strategy or both. checklist proposed by Steyerberg comprised 16 items related Cluster randomized trials provide the highest level of to general considerations (seven items), modelling steps evidence of a clinical prediction model impact [51, 55]. (seven items) and validity (two items). Interestingly, this Randomization of study centres is preferred to randomi- checklist has been applied by the author for reviewing the zation of healthcare providers because it avoids development of two clinical prediction models [7]. contamination across study groups within a single study In this context, we introduce a checklist of 19 items centre [6, 12, 56]. To prevent the clinical prediction model designed to help intensivists develop and transparently effect from being confounded with the cluster effect, the report valid clinical prediction models (Table 1). The minimum number of clusters per arm should be four [55]. A content addresses all three research phases of clinical prestepped-wedged trial is a variant of the cluster randomized diction model development (derivation, validation and trial that is particularly suitable for implementing complex impact analysis). This checklist includes explicit recomclinical prediction model-based interventions [57]. In this mendations for issues relevant to intensivists (e.g. model study design, all the clusters cross over from the control to performance comparison). Recommendations are formuthe intervention study group at randomly allocated time lated in a language adapted for a non-statistician readership. intervals [57]. Authors have advised against patient-level The proposed checklist was not intended to substitute but randomization because the same physician will alternately rather to complement the Strengthening the Reporting of use and not use the clinical prediction model in consecutive Observational Studies in Epidemiology (STROBE) statepatients, with the potential for a learning effect and ment [59], addressing issues specifically related to clinical reducing the contrast between the two study groups [6, 8]. prediction model development. Although 12 items are However, clinical trials with randomization at the patient similar in content, the proposed checklist includes seven level may play a role by first establishing the efficacy and items focusing on model development issues that are not safety of using a clinical prediction model. addressed in the STROBE statement. Conversely, ten items Although much simpler in design, before-and-after in the STROBE statement do not appear in the proposed studies are not recommended for assessing the impact of a checklist, although they may apply to studies reporting clinical prediction because of the potential for bias due to clinical prediction model development or validation. That secular trends, regression towards the mean or sudden overlapping exists between the STROBE statement and the changes in physician practices. Yet, interrupted time proposed checklist is not surprising because data for development of a new prediction model and then compare it with existing models using the derivation data set are methodologically flawed. Indeed, prediction models tend to perform better on the data set from which they were derived and unsurprisingly better than existing models when validated on that data set [54].

Table 1 Checklist for developing and reporting valid clinical prediction models in the intensive care setting Item

Recommendation

1. Title

Identify the report as the development, validation or impact analysis study of a clinical prediction model in the title Explain the scientific background and rationale for developing a clinical prediction model State specific study objectives Describe the study design (cohort, randomized controlled trial or cross-sectional), including whether patient selection was retrospective or prospective Specify the study inclusion and exclusion criteria. Report the flow of patients throughout model development. If applicable, consider use of flow diagrams complying with the STROBE statement for the derivation and validation steps and with the CONSORT extension to cluster randomized trials for impact analysis Precisely define the outcome of interest in terms of timing and methods of ascertainment (‘‘hard’’ outcomes are preferred) List all candidate predictors initially considered for inclusion in the clinical prediction model (give sources of data and methods of ascertainment where relevant) Report rationale for sample size: The minimum number of events per candidate predictor is at least 10 for derivation studies External validation data set should contain at least 100 events and 100 non-events Specify statistical model building strategy, including the type of model and details of candidate predictor selection procedures Describe how continuous predictors were handled in the analyses. If relevant, specify how thresholds were determined Report the completeness of data for each variable separately and overall for observations. Describe how missing values for predictor and outcome variables were handled in the analyses Specify internal validation approach (split-sampling, cross-validation or bootstrapping). Bootstrapping is recommended for studies with limited effective sample size Report all dimensions of external validation level (temporal, geographical or fully independent) Derive the model on the full data set Where relevant, specify shrinkage method used in order to attenuate over-fitting Indicate whether external information was used for model updating Report the final clinical prediction model, including the regression coefficient for each predictor, along with the model intercept Specify how calibration and discrimination were evaluated Report apparent, internal validation and external validation performance Describe model presentation for use in routine clinical practice Where relevant, directly compare models developed for similar outcomes and target populations in the external validation sample Report effectiveness in altering intensivist practices, patient outcomes and/or costs of care. Cluster randomized trial is the preferred implementation study design Discuss internal (potential for over-fitting) and external (generalizability) validity and clinical usefulness

2. Rationale 3. Objectives 4. Study design 5. Patients

6. Outcomes 7. Candidate predictors 8. Sample size 9. Model specification 10. Continuous predictors 11. Missing values 12. Internal validation 13. External validation 14. Model estimation

15. Model performance 16. Model presentation 17. Model comparison 18. Model impact 19. Model validity

clinical prediction model derivation or validation come from cross-sectional or cohort studies, which are observational in nature. Of note, guidelines for studies reporting the development of clinical prediction models are being developed [54].

Acknowledgments This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors. On behalf of all authors, the corresponding author states that there is no conflict of interest. The authors thank Linda Northrup from English Solutions for her assistance in editing the manuscript. Conflicts of interest report.

The authors have no conflict of interest to

Conclusion The number of clinical prediction models for use in adult intensive care medicine is steadily growing in the literature and these models are increasingly incorporated into clinical practice guidelines. Although the more influential clinical prediction models have been properly derived, externally validated and updated, many are used without clear evidence of their validity and few studies have analysed their impact on physician practices and patient outcomes.

Appendix 1. Development of the Risk of Early Admission to the Intensive Care Unit prediction model Rationale Previous studies have reported that patients with progressive CAP but no major pneumonia severity criteria on emergency department presentation may

benefit from direct intensive care unit (ICU) admission [60], although identification of these patients remains a daily challenge for clinicians [61, 62]. Therefore a model that accurately predicts the risk for admission to the ICU might be helpful for early recognition of severe CAP that is not obvious on presentation [63]. Study design The Risk of Early Admission to the ICU prediction model was derived using data obtained from four prospective multicentre cohort studies conducted in the USA and Europe. With the exception of the EDCAP cluster randomized trial [64], all studies were observational in design. Outcome of interest The outcome of interest predicted by the Risk of Early Admission to the ICU model was defined by admission to the ICU within 3 days following emergency department presentation. This time frame was chosen because most sepsis-related organ failures in severe CAP occur early, whereas late ICU admission may be associated with the worsening of a pre-existing comorbid condition or the occurrence of a pneumoniaunrelated adverse event such as hospital-acquired infection or venous thromboembolism. However, ICU admission may be confounded by various factors, including local admission practices, bed availability or presence of an intermediate care unit. Modelling ICU admission might lead to circular reasoning, with a clinical prediction model fitting observed physician practices (i.e. modelling ‘‘what physicians do’’ rather than ‘‘what should be done’’) [62]. Reassuringly, the Risk for Early Admission to the ICU prediction model also demonstrated satisfactory accuracy in predicting early intensive respiratory or vasopressor support [15], which is considered a more reliable outcome measure of severe CAP than ICU admission across institutions and healthcare systems. Statistical methods Given the binary nature of the outcome, multivariable logistic regression was used for model derivation. The model was developed by removing candidate predictors from a full main effects regression model using a backward approach with a cut-off value of P = 0.10. Overall, 25 prespecified candidate predictors were entered in the model including baseline demographic characteristics (age and gender), comorbid conditions (eight predictors), and physical (six predictors), radiographic (two predictors) and laboratory (seven predictors) findings [15, 16]. The number of events per candidate predictors was 12 (303/25).

Appendix 2. Modelling continuous candidate predictors Although not recommended, it is usual to divide the range of continuous candidate predictors into two groups at suitable cutpoints. Most pneumonia prognostic models include binary baseline systolic blood pressure, with a prespecified cutpoint at 90 mmHg. Yet, the resulting step function may be a poor approximation of the nonlinear relationship between the candidate predictor and outcome of interest. As illustrated (Fig. 4), the odds of early admission to the ICU rose steadily with decreasing systolic blood pressure values and was poorly modelled with a constant category for systolic blood pressure less than 90 mmHg. In contrast, a constant category might be suitable for modelling the odds of early admission to the ICU for systolic blood pressure values of at least 90 mmHg. In practice, a quadratic function, involving systolic blood pressure and (systolic blood pressure)2 terms, or fractional polynomials proved to fit this nonlinear relationship well.

Appendix 3. Missing-data mechanisms (adapted from [7, 27]) Three types of missing-data mechanisms exist: 1. Missing completely at random (MCAR): the missing values occur completely at random; data elements are missing for reasons that are unrelated to any observed or unobserved characteristics of individuals. The individuals with missing values are a simple random sample from the complete population. Examples of MCAR include missing laboratory tests resulting from a dropped test tube. 2. Missing at random (MAR): missing values do not occur at random; the probability that a value is missing depends on the values for other observed variables (but does not depend on values for unmeasured variables). The individuals with missing values are no longer a simple random sample from the complete population. Yet, they are only randomly different from other subjects, given the values of other observed variables. As an example, one may consider missing values for older subjects. 3. Missing not at random (MNAR): missing values do not occur at random; the probability that a value is missing depends on the values that are missing or on other unobserved predictors. For example, clinicians may choose not to measure a laboratory value (pH) in individuals suspected of having normal values.

Model simplification Regression coefficients were divided through the smallest coefficient (which was assigned by definition a value of 1) and then rounded to the closest integer (Table 2). However, this approach is not optimal Appendix 4. Handling missing values in predictors because it capitalizes on the estimate of one coefficient and may lead to unnecessary uncertainty in the converted Of 6,560 patients included in the full development data set, 4,618 (70 %) had missing values for one or more coefficients [7].

Table 2 Point scoring system of the Risk for Early Admission to the Intensive Care Unit prediction model (n = 6,560) Characteristics

b (95 % Confidence interval)

Points

Intercept Male gender Age \80 One or more comorbid conditionsa Respiratory rate C30/min Pulse rate C125/min White blood cell count \3 or C20 G/L Blood urea nitrogen C11 mmol/L SpO2 \90 % or PaO2 \60 mmHgb Arterial pH \7.35 Multilobar infiltrates or pleural effusion Sodium \130 mEq/L

-5.29 0.39 0.57 0.45 0.53 0.55 0.54 0.94 0.85 0.91 0.79 1.06

– 1 1 1 1 1 1 2 2 2 2 3

(-5.82 to -4.77) (0.08 to 0.70) (0.18 to 0.95) (0.11 to 0.78) (0.18 to 0.88) (0.14 to 0.95) (0.14 to 0.94) (0.61 to 1.28) (0.53 to 1.17) (0.38 to 1.44) (0.48 to 1.09) (0.58 to 1.53)

a A total point score for a patient is obtained by summing the points Comorbid conditions include neoplastic disease, liver disease, for each applicable characteristic. Assignment to risk classes I (B3 renal dysfunction, cerebrovascular disease, congestive heart failure, points), II (4–6 points), III (7–8 points) and IV (C9 points) was coronary artery disease, chronic pulmonary disease and diabetes mellitus determined by the patient’s total risk score b With or without supplemental oxygen PaO2 partial pressure of arterial oxygen, SpO2 pulse oximetric saturation

ICU admission). In contrast, the two other approaches used the full data set. Yet, the c statistic was higher for the approach assuming that unknown values were normal in comparison to multiple imputations of missing values. This might be explained by the mechanism of datamissing for laboratory values (see ‘‘Appendix 3’’).

Logit of early ICU admission

1

0

−1

−2

Appendix 5. Internal and external validation

−3 60

70

80

90

100 110 120 130 140 150 160 170

Systolic blood pressure on admission (mmHg)

Fig. 4 Modelling of the relationship between systolic blood pressure and the odds of early admission to the intensive care unit in the full development data set (n = 6,560)

predictors included in the Risk for Early Admission to the Intensive Care Unit prediction model. The percentages of missing values for predictors ranged from 5 % for heart rate to 65 % for arterial pH. To assess the robustness of the model, we used the following approaches for handling missing values: case-wise deletion of observations with any missing predictor value, assuming that unknown values were normal, and performing multiple imputations of missing values (Table 3). The predictor and dependent variables were entered into the imputation model. Sixty imputed data sets were created with a total run length of 60,000 iterations and imputations made every 1,000 iterations. As illustrated, performing case-wise deletion was inefficient (70 % of observations in the development data set were discarded) and had the potential for selection bias (as suggested by the 9.5 % prevalence of early

In order to assess the internal and external validity of the Risk for Early Admission to the Intensive Care Unit prediction model, we evaluated the predictive performance in the derivation and external validation samples (Table 4). In the derivation sample, we estimated apparent and internal validation predictive performance measures. Internal validation was performed using splitsampling and bootstrapping approaches, respectively. In the split-sampling approach, 70 % of the patients were randomly assigned to a derivation cohort and 30 % to an internal validation cohort. In the bootstrapping approach, 1,000 bootstrap samples were drawn with replacement form the derivation set. Optimism-corrected performance estimates were computed (Fig. 3). External validation was done using the original data from a multicentre prospective randomized controlled trial conducted by investigators independent of those who developed the model [15, 16]. Because the full derivation data set was large, the apparent performance measures were likely to be valid and therefore model optimism appeared to be limited in internal validation, for both split-sampling and bootstrapping procedures. Yet, the calibration intercept estimate was 0.20 (for an expected value of 0) and likely

Table 3 Apparent performance measures for the Risk for Early Admission to the Intensive Care Unit according to strategies for handling missing values

No. subjects (%) No. ICU admissions (%) Pseudo-R2 c statistic (95 % CI) Discrimination slope

Case-wise deletion

Unknown values assumed to be normal

Multiple imputation of missing values

1,942 (30) 184/1, 942 (9.5) 0.12 0.74 (0.70–0.78) 0.11

6,560 (100) 303/6, 560 (4.6) 0.16 0.81 (0.78–0.83) 0.10

6,560 (100) 303/6, 560 (4.6) 0.13 0.74 (0.71–0.77) 0.08

CI confidence interval, ICU intensive care unit

Table 4 Apparent, internal and external validation performance measures for the Risk for Early Admission to the Intensive Care Unit Full derivation sample

No. subjects No. ICU admissions (%) c statistic (95 % CI) Discrimination slope Calibration intercept Calibration slope P for overall calibration test (U-statistic)

6,560 303 (4.6) 0.81 (0.78–0.83) 0.10 0.00 1.00 1.00

Internal split-sampling validation Derivation cohort

Validation cohort

4,593 201 (4.4) 0.80 (0.78–0.83) 0.09 0.00 1.00 1.00

1,967 102 (5.2) 0.81 (0.77–0.85) 0.10 0.20 0.96 0.17

Internal bootstrapping validation

External validation sample

6,560 303 (4.6) 0.80 (0.78–0.83) 0.09 0.00 0.97 0.48

850 54 (6.3) 0.76 (0.69–0.83) 0.10 -0.35 0.88 0.04

CI confidence interval, ICU intensive care unit

reflected less stable results for the split-sampling procedure. In the external validation sample, miscalibration was mainly driven by a significant decrease in intercept (referred as to ‘‘calibration-in-the-large’’), reflecting that the mean predicted probability was too high in comparison to observed frequency of early admission to the ICU. Although the calibration slope term was not significantly different from 1.00 (P = 0.37), only 54 patients had the

outcome of interest in the external validation sample and the test may lack power for the detection of relevant miscalibration. In the external validation sample, the Risk for Early Admission to the ICU prediction model performed better than the pneumonia severity assessment tools but failed to demonstrate an accuracy advantage over alternate prediction models in predicting early ICU admission (data not shown).

References 1. Laupacis A, Sekar N, Stiell IG (1997) Clinical prediction rules. A review and suggested modifications of methodological standards. JAMA 277:488–494 2. Moons KG, Royston P, Vergouwe Y, Grobbee DE, Altman DG (2009) Prognosis and prognostic research: what, why, and how? BMJ 338:375 3. Steyerberg EW, Moons KG, van der Windt DA et al (2013) Prognosis Research Strategy (PROGRESS) 3: prognostic model research. PLoS Med 10:e1001381 4. Vincent JL, Moreno R (2010) Clinical review: scoring systems in the critically ill. Crit Care 14:207

8. McGinn TG, Guyatt GH, Wyer PC et al 5. Reilly BM, Evans AT (2006) (2000) Users’ guides to the medical Translating clinical research into literature: XXII: how to use articles clinical practice: impact of using about clinical decision rules. Evidenceprediction rules to make decisions. Ann Based Medicine Working Group. Intern Med 144:201–209 JAMA 284:79–84 6. Moons KG, Altman DG, Vergouwe Y, 9. Royston P, Moons KG, Altman DG, Royston P (2009) Prognosis and Vergouwe Y (2009) Prognosis and prognostic research: application and prognostic research: developing a impact of prognostic models in clinical prognostic model. BMJ 338:b604 practice. BMJ 338:b606 10. Altman DG, Vergouwe Y, Royston P, 7. Steyerberg EW (2009) Clinical Moons KG (2009) Prognosis and prediction models: a practical approach prognostic research: validating a to development, validation, and prognostic model. BMJ 338:b605 updating. Springer, New York

11. Wasson JH, Sox HC, Neff RK, Goldman L (1985) Clinical prediction rules. Applications and methodological standards. N Engl J Med 313:793–799 12. Moons KG, Kengne AP, Grobbee DE et al (2012) Risk prediction models: II. External validation, model updating, and impact assessment. Heart 98:691–698 13. Moons KG, Kengne AP, Woodward M et al (2012) Risk prediction models: I. Development, internal validation, and assessing the incremental value of a new (bio)marker. Heart 98:683–690 14. Stiell IG, Wells GA (1999) Methodologic standards for the development of clinical decision rules in emergency medicine. Ann Emerg Med 33:437–447 15. Labarere J, Schuetz P, Renaud B et al (2012) Validation of a clinical prediction model for early admission to the intensive care unit of patients with pneumonia. Acad Emerg Med 19:993–1003 16. Renaud B, Labarere J, Coma E et al (2009) Risk stratification of early admission to the intensive care unit of patients with no major criteria of severe community-acquired pneumonia: development of an international prediction rule. Crit Care 13:R54 17. Guyatt GH (2006) Determining prognosis and creating clinical decision rules. In: Haynes RB, Sackett DL, Guyatt GH, Tugwell P (eds) Clinical epidemiology: how to do clinical practice research. Lippincott Williams & Wilkins, New York 18. Altman DG (2009) Prognostic models: a methodological framework and review of models for breast cancer. Cancer Invest 27:235–243 19. Randolph AG, Guyatt GH, Calvin JE, Doig G, Richardson WS (1998) Understanding articles describing clinical prediction tools. Evidence Based Medicine in Critical Care Group. Crit Care Med 26:1603–1612 20. Altman DG (1991) Practical statistics for medical research. Chapman & Hall/ CRC, London 21. Aujesky D, Obrosky DS, Stone RA et al (2005) Derivation and validation of a prognostic model for pulmonary embolism. Am J Respir Crit Care Med 172:1041–1046 22. Fine MJ, Auble TE, Yealy DM et al (1997) A prediction rule to identify low-risk patients with communityacquired pneumonia. N Engl J Med 336:243–250 23. Royston P, Altman DG, Sauerbrei W (2006) Dichotomizing continuous predictors in multiple regression: a bad idea. Stat Med 25:127–141

24. Altman DG, Lausen B, Sauerbrei W, Schumacher M (1994) Dangers of using ‘‘optimal’’ cutpoints in the evaluation of prognostic factors. J Natl Cancer Inst 86:829–835 25. Steyerberg EW, Schemper M, Harrell FE (2011) Logistic regression modeling and the number of events per variable: selection bias dominates. J Clin Epidemiol 64:1464–1465 (author reply 1463–1464.) 26. Sauerbrei W, Royston P, Binder H (2007) Selection of important variables and determination of functional form for continuous predictors in multivariable model building. Stat Med 26:5512–5528 27. Harrell FE Jr (2001) Regression modelling strategies with applications to linear models, logistic regression, and survival analysis. Springer, New York 28. Vergouwe Y, Royston P, Moons KG, Altman DG (2010) Development and validation of a prediction model with missing predictor data: a practical approach. J Clin Epidemiol 63:205–214 29. Altman DG, Bland JM (2007) Missing data. BMJ 334:424 30. Groenwold RH, White IR, Donders AR et al (2012) Missing covariate data in clinical research: when and when not to use the missing-indicator method for analysis. CMAJ 184:1265–1269 31. Groenwold RH, Donders AR, Roes KC, Harrell FE Jr, Moons KG (2012) Dealing with missing outcome data in randomized trials and observational studies. Am J Epidemiol 175:210–217 32. Liublinska V, Rubin DB (2012) Re: ‘‘dealing with missing outcome data in randomized trials and observational studies’’. Am J Epidemiol 176:357–358 33. Janssen KJ, Donders AR, Harrell FE Jr et al (2010) Missing covariate data in medical research: to impute is better than to ignore. J Clin Epidemiol 63:721–727 34. Concato J, Feinstein AR, Holford TR (1993) The risk of determining risk with multivariable models. Ann Intern Med 118:201–210 35. Harrell FE Jr, Lee KL, Mark DB (1996) Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med 15:361–387 36. Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR (1996) A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol 49:1373–1379

37. Courvoisier DS, Combescure C, Agoritsas T, Gayet-Ageron A, Perneger TV (2011) Performance of logistic regression modeling: beyond the number of events per variable, the role of data structure. J Clin Epidemiol 64:993–1000 38. Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Chapman & Hall/CRC, New York 39. Marshall RJ (2001) The use of classification and regression trees in clinical epidemiology. J Clin Epidemiol 54:603–609 40. Sun GW, Shook TL, Kay GL (1996) Inappropriate use of bivariable analysis to screen risk factors for use in multivariable analysis. J Clin Epidemiol 49:907–916 41. Hosmer DW, Lemeshow S (2000) Applied logistic regression, 2nd edn. Wiley, New York 42. Sullivan LM, Massaro JM, D’Agostino RB Sr (2004) Presentation of multivariate data for clinical use: the Framingham Study risk score functions. Stat Med 23:1631–1660 43. Steyerberg EW, Vickers AJ, Cook NR et al (2010) Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology 21:128–138 44. Rufibach K (2010) Use of Brier score to assess binary predictions. J Clin Epidemiol 63:938–939 45. Vergouwe Y, Steyerberg EW, Eijkemans MJ, Habbema JD (2005) Substantial effective sample sizes were required for external validation studies of predictive logistic regression models. J Clin Epidemiol 58:475–483 46. Justice AC, Covinsky KE, Berlin JA (1999) Assessing the generalizability of prognostic information. Ann Intern Med 130:515–524 47. Altman DG, Royston P (2000) What do we mean by validating a prognostic model? Stat Med 19:453–473 48. Steyerberg EW, Harrell FE Jr, Borsboom GJ et al (2001) Internal validation of predictive models: efficiency of some procedures for logistic regression analysis. J Clin Epidemiol 54:774–781 49. Efron B, Tibshirani RJ (1993) An introduction to the bootstrap. Chapman & Hall/CRC, New York 50. Bleeker SE, Moll HA, Steyerberg EW et al (2003) External validation is necessary in prediction research: a clinical example. J Clin Epidemiol 56:826–832 51. Toll DB, Janssen KJ, Vergouwe Y, Moons KG (2008) Validation, updating and impact of clinical prediction rules: a review. J Clin Epidemiol 61:1085–1094

52. Peek N, Arts DG, Bosman RJ, van der Voort PH, de Keizer NF (2007) External validation of prognostic models for critically ill patients required substantial sample sizes. J Clin Epidemiol 60:491–501 53. Janssen KJ, Moons KG, Kalkman CJ, Grobbee DE, Vergouwe Y (2008) Updating methods improved the performance of a clinical prediction model in new patients. J Clin Epidemiol 61:76–86 54. Collins GS, Moons KG (2012) Comparing risk prediction models. BMJ 344:e3186 55. Campbell MK, Piaggio G, Elbourne DR, Altman DG (2012) Consort 2010 statement: extension to cluster randomised trials. BMJ 345:e5661 56. Donner A, Klar N (2000) Design and analysis of cluster randomized trials in health research. Arnold, London

57. Hussey MA, Hughes JP (2007) Design and analysis of stepped wedge cluster randomized trials. Contemp Clin Trials 28:182–191 58. Ramsay CR, Matowe L, Grilli R, Grimshaw JM, Thomas RE (2003) Interrupted time series designs in health technology assessment: lessons from two systematic reviews of behavior change strategies. Int J Technol Assess Health Care 19:613–623 59. von Elm E, Altman DG, Egger M et al (2007) The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. Ann Intern Med 147:573–577 60. Renaud B, Santin A, Coma E et al (2009) Association between timing of intensive care unit admission and outcomes for emergency department patients with community-acquired pneumonia. Crit Care Med 37:2867–2874

61. Chalmers JD, Mandal P, Singanayagam A et al (2011) Severity assessment tools to guide ICU admission in communityacquired pneumonia: systematic review and meta-analysis. Intensive Care Med 37:1409–1420 62. Marti C, Garin N, Grosgurin O et al (2012) Prediction of severe communityacquired pneumonia: a systematic review and meta-analysis. Crit Care 16:R141 63. Ewig S, Woodhead M, Torres A (2011) Towards a sensible comprehension of severe community-acquired pneumonia. Intensive Care Med 37:214–223 64. Yealy DM, Auble TE, Stone RA et al (2005) Effect of increasing the intensity of implementing pneumonia guidelines: a randomized, controlled trial. Ann Intern Med 143:881–894

How to derive and validate clinical prediction models for use in intensive care medicine.

Clinical prediction models are formal combinations of historical, physical examination and laboratory or radiographic test data elements designed to a...
413KB Sizes 0 Downloads 3 Views