Probability estimation for biomedical classification problems.

Probability Estimation for Biomedical Classification Problems Richard A. Olshen Division of Biostatistics, HRP Building Stanford University School of Medicine Stanford, CA, 94305 [email protected] (415) 725-8666

Michael G. Walker Section on Medical Informatics, MSOB X215 Stanford University School of Medicine Stanford, CA, 94305 walkericamis.stanford.edu (415) 723-2916

BP> 160 ?)

ABSTRACT Suppose that we wish to know the probability that an object belongs to a class. For example, we may wish to estimate the probability that a patient has a particular disease, given a set of symptoms, or we may wish to know the probability that a novel peptide binds to a receptor, given the peptide's amino-acid composition. The conventional approach is to first use a classification algorithm to find partitions in feature space and to assign each partition to a class, and then to estimate the conditional probabilities as the proportion of patients or peptides that are correctly and incorrectly classified in each partition. Unfortunately, this estimation method often gives probability estimates that are in error by 20% or more, and thus can cause incorrect decisions. We have implemented and compared alternative methods. In Monte Carlo simulations the alternative methods are substantially more accurate than is the current method. 1. BIOMEDICAL CLASSIFICATION

Classifiers assign objects to classes. For example, suppose that we are developing drugs to relieve pain, and we want to find a peptide (a protein consisting of up to 20 amino acids) that will bind with high affinity to one of the pain receptors in the brain. Using the available information about the binding affinity of various peptides to a particular pain receptor, we attempt to predict what new peptides may bind with yet higher affinity. We predict that the peptide belongs to the class "binds strongly to pain receptor" or to the class "does not bind strongly to pain receptor." Other examples of biomedical classification problems are to assign a segment of DNA to one of the classes (protein-coding region, nonprotein-coding region), or to assign the regions of a magnetic-resonance image of a patient's head to one of the classes (tumor, gray matter, white matter, bone, fat, muscle, CNS fluid, other). There are many classification methods, such as discriminant analysis [1], belief networks [2, 3], criteria tables [4] and neural networks [5]. For the experiments described here, we used the Classification and Regression Tree (CART) software [6]. The CART algorithm has been applied successfully to several biomedical problems [7, 8]. Figure 1-1 shows a hypothetical classification tree.

Yes

/

Male> 55 years?

High risk

\

No

Not high risk

Not high risk

Figure 1-1. A hypothetical classification tree for risk of myocardial infarction based on age, sex and blood pressure. 1.1 Conditional Probability For many biomedical problems we wish to know the probability that the object belongs to the class. We next examine how large errors arise in estimating conditional probability. Consider a set of 120 data points (Figure 1-2). For this example, suppose that each point in the scattergram corresponds to one person who has undergone a rather dangerous surgical procedure. The black dots (class 1) are individuals who died in surgery, and the white dots (class 2) are individuals who survived.

I -1,

II

Figure 1-2. CART partitions for 120 data points.

These 120 data points are a sample from a probability distribution defined as follows. (1) Divide the two-

0195-4210/92/$5.00 © 1993 AMIA, Inc.

451

dimensional plane into four quadrants, using the x and y axes as the dividing lines. (2) Define quadrant 1 (positive values for both x and y ) to contain points uniformly distributed in the proportions 0.9 class 1 objects and 0.1 class 2 objects. (3) Define quadrant 2 (positive values for y, negative values for x ) to contain points uniformly distributed in the proportions 0.4 class 1 and 0.6 class 2. (4) Define quadrant 3 (negative values for both x and y ) to contain points unifonnly distributed in the proportions 0.6 class 1 and 0.4 class 2. (5) Define quadrant 4 (positive values for x, negative values for y ) to contain points unifonnly distributed in the proportions 0.1 class 1 and 0.9 class 2. CART produces the partitions shown in Figure 1-2, corresponding to the CART tree in Figure 1-3. We now ask the question, "Given that a patient falls into node 1, what is the conditional probability that the patient will die in surgery?" Because we defined the probability for each class in each quadrant, we can calculate the true conditional probability. The true probability, calculated using the CART tree partitions and the definitions of the class probability functions, is p(die node 1) = 0.57, and p(survive I node 1) = 0.43. 5 y(y0.176

xx

Y

-0.008

)

x < -0.38

Y

/

Node 1 Class 2 (survive)

\

Nde 4 Class I (die)

\N

/

)

definitions of the class probability functions that the true conditional probability is p(die in surgery I node 1) = 0.57. This 20% error in the estimated probability - 0.37 as estimated by resubstitution versus a true value of 0.57 - is substantial, and could readily cause incorrect clinical decisions or wasted laboratory research time. Table 1-1 shows the true probability of misclassification and the resubstitution estimate for all four nodes in the tree. The resubstitution estimate gives errors ranging from 3% to 20%. Why is the resubstitution estimate so inaccurate? An adaptive classifier, such as CART or a neural network, attempts to find relatively pure partitions - that is, regions where one class is predominant. Because of random sampling, there are regions in which there is a predominance of points from one class over points from another class, regardless of the true underlying probability. Where such regions occur, an adaptive classifier is likely to create a partition. Subsequently, when we attempt to estimate the probability of each class using resubstitution, we calculate biased estimates, because the partitions are biased toward finding pure regions. To further understand the source of the poor estimates, consider that when we estimate the misclassification rate for the entire CART tree, we use different patients than those with which we built the tree. Why should we do otherwise when we estimate the probability of misclassification for the partitions that comprise the tree? The misclassification rate for the entire CART tree is simply the weighted average of the probability of misclassification for the partitions.

Table 1-1. Probability of misclassification for the nodes in the CART tree.

Noe3 N

Node number

True

.probability

Class 2 (survive)

1 2 3 4

Node 2|

.57 .43 .22 .35

Resubstitution estimate .37 .24 .19 .31

Class 1 (die)

Figure 1-3. CART tree corresponding to the CART partitions in Figure 1-2. In most biomedical applications, we do not know the true probability functions, so we must estimate them from the data. How could we estimate p(die in surgery I node 1) using the 120 data points? The conventional method is to estimate p(die in surgery I node 1) as the proportion of patients in the training data who both fall in node 1 and die in surgery. This method is called the resubstitution estimate, because we resubstitute the same patients used to create the node in our procedure for estimating the conditional probability for that node. In Figure 1-4, we see that CART assigns node 1 to class 2 (survive surgery), and 12 of the 19 patients in node 1 survive, whereas seven of the 19 patients die. Thus, according to the resubstitution estimate, p(die in surgery node 1) = 7/19 = 0.37. However, we know from the

2.0 ALTERNATIVE METHODS FOR PROBABILITY ESTIMATION

The problem of unreliable probability estimates has been identified previously [6, 9, 10, 11, 12]. Holte developed methods to remove tenminal nodes that contained few data points and that did not pass a significance test [11], but does not directly address the problem of estimating the conditional probability of classes in the terminal nodes of a given tree. Quinlan [12] estimated probabilities using a weighted average of the resubstitution error rate of training data nearby in feature space to the terminal node, and obtained more accurate probability estimates than those obtained with resubstitution. The method is computationally efficient, but does not use independent test data, and thus gives overly optimistic estimates. Breiman proposed a

452

global error rate estimate to use in estimating the true misclassification probability. Breiman constrains the weighted average of the new local estimates, RBrei(7), to be equal to the global cross-validation estimate:

procedure that we refer to as Breiman's method [6], which we evaluated in the experiments reported here. We next outline four of the methods for conditionalprobability estimation that we evaluated: repeated crossvalidation, Breiman's method, and two variants of Breiman's method. Full descriptions of these and the other methods are given in [13, 14].

R cv(7)

METHODS We used Monte Carlo simulations to evaluate the conditional-probability estimation methods. We used data from 14 distributions, with one to eight dimensions (features), continuous and categorical variables, noise variables, sample sizes from 50 to 1000 data points, and Bayes' error rates ranging from 0.01 to 0.4. One of the data sets consists of DNA sequences from the bacterium E. coli, which provide a practical application of the methods [16]. The remaining 13 data sets are synthetic, and thus provide a reliable gold standard for determining the true misclassification probabilities and comparing the methods.

The following is an outline of the method that Breiman proposed for estimating conditional probability in classification trees [6]. See [14] for a detailed explanation. Breiman's method may be viewed as having the flavor of an empirical Bayes estimator [15]. To begin, construct a CART classification tree, T. Denote by r res(t) the resubstitution estimate of the error rate in a terminal node t. We expect that the bias in r res(t) is inversely proportional to p(t), the proportion of the original sample that falls into t. That is, a node that contains a large proportion of the sample is less likely to be biased than is a node that contains only a few points.The Breiman estimate seeks to correct for this bias:

3.1 Example of Results for One Data Sample In this section we show an example of the experimental results for one data sample, including additional probability estimation methods.

(2-1)

We estimate the correction term as follows. Denote by R* the true misclassification rate for the entire tree. One possible estimate of R* is the resubstitution estimate, R res(T). R res(T) is the weighted average of the resubstitution estimates in each terminal node where weighting is by the number of points in the training set that fall in each terminal node:

p(t).

(2-3)

3. EVALUATION OF THE ESTIMATION

2.2 Breiman's Method

= E r res(t) te T

r Brei () p(t)

The correction term in Equation 2-1 is a function of the difference between R cv(T) and R res(T), distributed inversely proportional to the number of data points in each partition. We also created two variants of Breiman's method by using two alternative global error-rate estimates - the bootstrap global estimate [9] and the repeated crossvalidation global estimate - in place of the single crossvalidation estimate.

The procedure for the repeated cross-validation estimation method follows: (1) Split the original data into a training set (90% of the points), and a test set (10%). (2) Build a crossvalidation tree using the training set. Count how often the points in the test set from each original node are misclassified in the cross-validation tree. (3) Repeat steps 1 and 2 many times, leaving out a different 10% each time. (4) Estimate the error rate for each node as the proportion of times the points in that node were misclassified in the repeated cross-validations.

R res(7)

I

tr T

2.1 Repeated Cross-Validation

r Brei(t) = r (t) + a correction termL

=

& 0lv%fUa

"8

*

I

(2-2)

l5-0;

(The error rate for the entire tree is the weighted average of the error rates in the terminal nodes.) This weighted average of the resubstitution estimates in each partition must be equal to the misclassification rate for the entire tree, and thus provides a constraint on the estimated misclassification probability in each terminal node. We expect that R res(7) is biased, so we would like to replace it by a more accurate

-1.5

qv

0

-1

-.5

ClassI

OClass2

-

0

5

1

1.5

Figure 3-1. A sample of 200 data points and the partitions for the CART tree. CART assigns node 1 to class 1 , node 2 to class 2 , and node 3 to class 1 .

453

results for each estimation method for the 10 data samples in Table 3-3. For these 10 samples from this distribution, Breiman's method using the repeated cross-validation estimate of the global error is the most accurate (RMSE = 0.0340) and the resubstitution estimate is least accurate (RMSE = 0.0655).

Figure 3-1 shows a sample of 200 data points drawn from one of the data sets, and the partitions (tenrinal nodes) for the CART tree created for the sample. The distribution from which we drew these points is the same as the distribution for the points in Figures 1-2 and 1-3, except that it is rotated 45 degrees counter-clockwise. Table 3-1 shows the probability of misclassification error for each estimator for the sample in Figure 3-1. Table 3-2 shows the weighted average absolute error of the per-node misclassification probabilities, which is a summary statistic of the data in Table 3-1 for each estimator. The weighted average absolute error for an estimator is the sum over the three nodes of the absolute value of the true probability minus the estimated probability, weighted by the number of data points in each node.

I

Weighted average absolute error

p(t) true probability nodes

estimated probabilityl.

Table 3-3. Results for 10 samples from one distribution

Weighted p(misclassification) Sample 1 2

3 4 5 6 7 8 9 10

(3-1)

For example, the weighted average absolute error of the Breiman-rcv estimate for the three nodes is 10.4443 0.43061 * 54/200 + 10.3493 - 0.28211 * 90/200 + 10.2790- 0.27261 * 56/200 = 0.0853.

Resub 0.0853 0.0668 0.0172 0.0779 0.0115 0.0607 0.0757 0.0652 0.0896 0.0544

Breiman Breiman rcv

0.3704 0.4685 0.4192 0.4306

Node 2 (n = 90) 0.3493 0.2444 0.2767 0.2747

0.2821

Estimation method Resub Repeated cv Breiman Breiman rcv

Node 3 (n = 56)

0.2790 0.2143 0.2946 0.2615 0.2726

For this particular data sample, the Breiman estimate using the repeated cross-validation global error is most accurate; the resubstitution estimate is least accurate. However, we cannot draw reliable conclusions from just one data sample. We must take repeated samples, to see whether the differences among the estimates are significant Table 3-2. Weighted misclassification probability.

Estimation method

Weighted p(misclassification)

Resubstitution Repeated cv Breiman Breiman rcv

Brei rcv 0.0358 0.0205 0.0207 0.0285 0.0197 0.0293 0.0347 0.0223 0.0527 0.0536

Error in estimate of p(misclassification)

Probability of misclassification error Node 1 (n = 54) 0.4443

0.0488 0.0239 0.0210 0.0507 0.0383 0.0839 0.0663 0.0800

Breiman 0.0453 0.0198 0.0278 0.0281 0.0285 0.0342 0.0326 0.0127 0.0550 0.0458

Table 3-4. Mean, SD, and RMS error for 10 samples.

Table 3-1. p(misclassification error) for one data sample.

Estimation method True Resub Repeated cv

Rep cv 0.0436 0.0422

.0853 .0436 .0453 .0358

Mean

0.0604 0.0499 0.0330 0.0318

Standard deviation 0.0266 0.0213 0.0127 0.0127

Root mean square

0.0655 0.0538 0.0351 0.0340

3.2 Summary of Experimental Results In this section, we summarize the results of the experiments. Section 3.1 contained an example of the complete results for one experiment. Chapter 5 of [13] reports the complete results for all data sets. In these experiments, we compared the accuracy of the estimators on samples of varying sizes from each of the synthetic data distributions. The experiments show that, for these sample sizes and distributions, the Breiman method using the repeated cross-validation global error rate is the most accurate overall. Table 3-5 shows the results of the comparison of the estimators across distributions and sample sizes (other estimators described in [13], but not included in this table, are not competitive).

3.2 Example of Results for Multiple Data Samples Table 3-3 shows the results for 10 data samples from the same distribution as in Table 3-2. Table 3-4 summarizes the

454

Table 3-5. Comparison of estimators across distributions and sample sizes. Estimation Method Resubstitution Repeated cv Breiman Breiman rcv

REFERENCES

Number of Number of times estimator times estimator was most was second accurate most accurate 2 2 1 0 21 9 24 14

The exception to the dominance of the Breiman method using the repeated cross-validation global error rate occurs in distributions with a Bayes' error rate less than 0.1, for which the bootstrap 0.632 estimate or Breiman plus bootstrap are most accurate, although the Breiman plus repeated crossvalidation method is closely competitive for such distributions. 4. CONCLUSIONS

The current method for estimating probability in the partitions (terminal nodes) of classification trees is inaccurate. We have used Monte Carlo simulations to compare existing and new methods for estimating probability in the partitions. We evaluated the methods using 14 data distributions, with one to eight dimensions (features), continuous and categorical variables, noise variables, sample sizes from 50 to 1000 data points, and Bayes' error ranging from 0.01 to 0.4. For these sample sizes and distributions, the Breiman method using the repeated cross-validation global error rate is the most accurate overall, as measured by mean square error. For data sets with low Bayes' error rate (less than 0.1), either a local bootstrap 0.632 estimate or Breiman's method modified to use a bootstrap estimate of the global misclassification rate is most accurate, although the Breiman plus repeated crossvalidation method is closely competitive for such distributions.

ACKNOWLEDGMENTS This research was supported by grant LM-07033 from the National Library of Medicine, grant CA55325-01 from the National Cancer Institute, grant DMS-9101548 from the National Science Foundation, and by grants from IBM. Computing facilities were provided to SUMEX-AIM under grant LM-05208 from the NLM. Thanks to Gio Wiederhold and Douglas Brutlag for serving on Michael Walker's dissertation committee. Ted Shortliffe provided the environment at the Section on Medical Informatics in which this work could be perfonned. Thanks also to our colleagues at Stanford for many useful discussions related to this research.

[1] M. James. Classification Algorithms. London, England: Collins, 1985. [2] R. E. Neapolitan. Probabilistic Reasoning in Expert Systems: Theory and Algorithms. New York: WileyInterscience, 1990. [3] G. F. Cooper and E. Herskovits. A Bayesian Methodfor the Induction of Probabilistic Networks from Data. Section on Medical Informatics, Stanford University. Technical Report KSL-91-02. 1991. [4] J. R. Metzner and B. H. Barnes. Decision Table Languages and Systems. New York: Academic Press, 1977. [5] D. E. Rumelhart and J. L. McClelland. Parallel Distributed Processing. Cambridge, Mass: MIT, 1987. [6] L. Breiman, J. H. Friedman, R. A. Olshen and C. J. Stone. Classification and Regression Trees. Monterey, CA: Wadsworth and Brooks/Cole, 1984. [7] C. Giampaolo, A. T. Gray, R. A. Olshen and S. Szabo. 'Predicting Chemically Induced Duodenal Ulcer and Adrenal Necrosis with Classification Trees," vol. 88(July), pp. 6298-6302, 1991. [8] M. H. Liebenhaut, R. T. Hoppe, B. Efron, J. Halpern, T. Nelsen and S. A. Rosenberg. 'Prognosis Indicators of Laparotomy Findings in Clinical State I-II Supradiaphragmatic Hodgkin's Disease," vol. 7, pp. 81-91, 1989. [9] B. Efron. The Jackknife, the Bootstrap, and Other Resampling Plans. Philadelphia, PA: Society for Industrial and Applied Mathematics, 1982. [101 D. A. Bloch and M. R. Segal. "Empirical Companrson of Approaches to Forming Strata," Journal of the American Statistical Association, vol. 84(408), pp. 897-905, 1989. [11] R. C. Holte, L. E. Acker and B. W. Porter. "Concept Learning and the Problem of Small Disjuncts," Eleventh International Joint Conference on Artificial Intelligence. Detroit, Michigan, USA. (ed). Morgan Kaufmann, San Mateo, California, vol. 1, pp. 813-818, 1989. [12] J. R. Quinlan. "Improved Estimates for the Accuracy of Small Disjuncts," Machine Learning, vol. 6, pp. 93-98, 1991. [13] M. G. Walker. Probability Estimation for Classification Trees and Sequence Analysis. Departments of Computer Science and Medicine, Stanford University. StanCS-92-1422. Ph.D. Dissertation. 1992. [14] M. G. Walker. Probability Estimation for Classification Trees. Section on Medical Informatics, Stanford University. Technical Report 92-01. 1992. [15] B. Efron and C. Morris. "Stein's Estimation Rule and Its Competitors - An Empirical Bayes Approach," vol. 68(341), pp. 117-130, 1973. [16] C. B. Harley and R. P. Reynolds. "Analysis of E. coli Promoter Sequences," Nucleic Acids Research, vol. 15(5), pp. 2343-2361, 1987.

455

Multiple similarly-well solutions exist for biomedical feature selection and classification problems.

Class probability estimation for medical studies.

Calibrating random forests for probability estimation.

Risk estimation using probability machines.

UT Biomedical Informatics Lab (BMIL) Probability Wheel.

Optimal estimation for regression models on τ-year survival probability.

Multiclass Posterior Probability Twin SVM for Motor Imagery EEG Classification.

Probability Distribution Estimation for Autoregressive Pixel-Predictive Image Coding.

Probability-enhanced sufficient dimension reduction for binary classification.

Probability estimation and machine learning--Editorial.

Design of Biomedical Robots for Phenotype Prediction Problems.

Lifetime estimation of a zirconia-alumina composite for biomedical applications.

Convolutional Neural Networks for Biomedical Text Classification: Application in Indexing Biomedical Articles.

Ensemble Deep Learning for Biomedical Time Series Classification.

Literature-based biomedical image classification and retrieval.

An Efficient Optimization Method for Solving Unsupervised Data Classification Problems.

The International Classification of Health Problems for Primary Care.

Diversified Sensitivity-Based Undersampling for Imbalance Classification Problems.

Polynomial probability distribution estimation using the method of moments.

Estimation of probability densities using scale-free field theories.

Estimation of the ascertainment probability of rare diseases.

Aggregate features in multisample classification problems.

Diagnostic classification of learning problems: some data.

Can small institutes address some problems facing biomedical researchers?