Classification in karyometry: performance testing and prediction error.

NIH Public Access Author Manuscript Anal Quant Cytopathol Histpathol. Author manuscript; available in PMC 2014 July 07.

NIH-PA Author Manuscript

Published in final edited form as: Anal Quant Cytopathol Histpathol. 2013 August ; 35(4): 181–188.

Classification in karyometry: performance testing and prediction error PH Bartels1 and HG Bartels2 1College

of Optical Sciences and Arizona Cancer Center, University of Arizona, Tucson, Arizona 85724-5024, USA

2College

of Optical Sciences and Arizona Cancer Center, University of Arizona, Tucson, Arizona 85724-5024, USA

Abstract NIH-PA Author Manuscript

Classification plays a central role in quantitative histopathology. Success is expressed in terms of the accuracy of prediction for the classification of future data points and an estimate of the prediction error. The prediction error is affected by the chosen procedure, e.g., the use of a training set of data points, a validation set, an independent test set, the sample size and the learning curve of the classification algorithm. For small samples procedures such as the “jackknife”, the “leave one out” and the “bootstrap” are recommended to arrive at an unbiased estimate of the true prediction error. All of the procedures rest on the assumption that the data set used to derive a classification rule is representative for the diagnostic categories involved. It is this assumption that in quantitative histopathology has to be carefully verified before a clinically generally valid classification procedure can be claimed.

Introduction


Classification of nuclei, lesions, and patients plays a central role in quantitative histopathologic studies. There is a rich literature on classification procedures, on the training of classification algorithms, and on the testing of their performance. There is the comprehensive collection of seminal studies in the engineering field edited by Agrawala (1). There are the classical texts by Fukunaga (2) and by Duda and Hart (3). There have been extensive studies of the behavior of classification algorithms. Computer simulations and Monte Carlo studies have led to a thorough understanding of the underlying processes. Most authors in the field agree, though, that there is no general theory guiding method development. The existing procedures and recommendations are essentially based on heuristics. Much of the literature, particularly on error estimation, requires a rather advanced background in mathematics and statistics for a reader to appreciate the recommendations for practical applications (4,5,6).

Corresponding author: Hubert Bartels, (520), University of Arizona Cancer Center, 1515 N Campbell Avenue, POB 245024, Tucson AZ 85724-5024, (520) 319-7836, [email protected].

Bartels and Bartels

Page 2

However, the correct practical use of classification algorithms offered in software packages is fairly straightforward (7, 8).


Karyometry presents some particular challenges to the development, evaluation and application of classification procedures. In many instances the validity of certain assumptions underlying classification procedures is in doubt: clinical materials by their very nature rarely offer entirely homogeneous populations. The idea of a cohort of patients - even when matched for certain anamnestic variables - as offering samples from a “single stochastic source” -in the jargon of the pertinent literature - is, at best, an approximation. It is difficult to state with confidence a priori at what size a truly representative sample for a diagnostic category has been attained. An analysis of nuclear populations usually involves thousands of nuclei. Here again, though, the presence of subpopulations of different phenotypes with often-subtle differences in karyometric characteristics raises questions concerning the homogeneity of the clinical samples.


There are three assumptions underlying any discussion of classifier development and performance. These are, first, that the sample used for the training of a classification algorithm is truely representative for its class. Second, the data points - i.e. nuclei, lesions or patients - are assumed to be true random samples. The researcher assembling the data sets must under no circumstances exert any judgment, or preselect nuclei or patients. Third, it is assumed that the data are drawn for each category from a single distribution, as mentioned above, “from a single stochastic source.” The assumption implies that the population from which the objects are taken is homogeneous. For the elements of the training or test sets to be random samples from a homogeneous population, though, is by itself not yet enough. The elements in a training set should also be fully representative of the population “in general”. The literature calls this requirement “that they fully cover the problem space”. This is a condition that in karyometry is often hard to attain or even to verify a priori. Given the notable variability in any biologic entity one might have to assemble a data set of considerable size to make it fairly representative.


Unless one has a fully representative data set for the classifier the system is not really trained to finality and the true prediction error is hard to approximate. Exactly what sample size is required to achieve full representation depends on the task at hand. Practical experience suggests that equality of the apparent error from the training set to the prediction error from the test set indicates that full representation has been reached. It is not that unusual in karyometry that one finds this to be true at sample sizes of a few hundred nuclei. It is the objective of this article to provide guidance for the practical use of classification procedures and to explain the underlying rationale.

Basic concepts The basic process begins with the assembly of two data sets, representing diagnostic categories, to be distinguished by a decision rule. A search is conducted for characteristics,

Anal Quant Cytopathol Histpathol. Author manuscript; available in PMC 2014 July 07.

Bartels and Bartels

Page 3


or “features”, which have differences in value for the two diagnostic categories. The set of selected features is called a “feature vector”. Each feature vector, in karyometry, represents a nucleus, a lesion, or a patient. In the following these shall be referred to as “objects”, or as “data points”. The feature vectors for the two samples to be discriminated are submitted to a classification algorithm. The algorithm derives a decision rule. That rule typically is a linear combination of feature values. Computed for a single data point it results in a “score”. The score value is compared to a threshold. If exceeding the threshold the data point is assigned to one diagnostic category, if less than the threshold value, it is assigned to the other diagnostic category. The decision rule is applied to the data sets and the proportion of correctly assigned objects is determined. The result is presented as a classification matrix. The rows represent the true diagnostic category, for data points of known label. The columns present the assignments made by the classification algorithm. Table I shows an example.


There are objects assigned to their correct category and there are objects that have been misclassified, i.e., assigned to the incorrect category. The correct recognition rate, or “overall accuracy”, here would be 327 + 185/389 + 240 = 81.4%. The estimated classification error would be 18.6%. In this example the distinguishing features did not completely separate objects from the two diagnostic categories. The misclassified objects are referred to as classification errors. At this point one might add a feature, or delete a feature that does not carry a notable weight. Then one would run the classification algorithm again, to see whether a better distinction with a lower error rate could be attained. This brings us to an important concept in classification methodology: the estimation of an error rate.

Classifier performance: estimation of error rates


In the basic procedure shown above, objects used in the development of the decision rule were also involved in the estimation of the decision rule's performance. But, the rule may have been fitted specifically to the data set from which it was derived. The result might be optimistic. It could not be expected to be as favorable when the rule is applied to new objects, independent, and not involved in the rule's derivation. The result of the procedure may, in principle, be biased. The error rate is known as “apparent error rate” Eapp. The bias resulting in an optimistic outcome causes the error rate to be lower than the rate that would be expected in the application of the rule to unknown, new objects from the same diagnostic categories (8). The rate at which the decision rule would classify any new, independent objects is called the “generalization error rate”, or the “true prediction error” Etrue rate. The reason for the bias in the apparent error rate is that the samples used to derive the result may not have been fully representative for their categories. If they had been, the application


Bartels and Bartels

Page 4

to new, independent objects would yield the same misclassification rate as for the original data sets.


This, however, is rarely the case for biologic materials. The conventional wisdom, according to which the procedure leads to bias, is generally accepted. Using the data sets from the formulation of the decision rule to estimate the classifier error rate is known as “resubstitution”, and the classification error Eapp as resubstitution or reclassification error.

The training set/test set procedure A common method to avoid the resubstitution error and to obtain an unbiased estimate of the true prediction error is the training set/test set procedure.


The original data sets are partitioned. Often, 50% of the objects from category A and 50% of the objects from category B are used to derive a decision rule, as a “training set”. The other 50 % of each category are used as independent “test set”. The decision rule is then applied to the test set. The classification result from the test set is free of bias. The recommendation is to report only the result from the test set. In karyometry the clinical materials representing diagnostic categories usually are a set of nuclei from each case, and a set of cases from each diagnostic category. Typical values would be 100 nuclei per case, and from 10 to 50 cases per diagnostic category. This would allow training sets of a minimum of 500 nuclei from five cases or 2,500 nuclei from 25 cases. The partition into training and test sets should be made at the case level. One should not randomly select, e.g., every other nucleus to be assigned to the training or test set, as this would not result in an independent sample for the test set. The results are now presented by two classification matrices, one for the training set, and one for the test set.


It is to be expected that the overall accuracy 1- Etest is somewhat lower than that from the training set. Practical experience from the pattern recognition literature suggests a decrease by less than 15%. If the classification error is more than that one might want to re-examine the selected feature set. It has been customary, finally, to apply the decision rule to the combined training set and test set, thus getting an estimate of the classification error on a larger sample size. This, of course, also involves resubstitution, may introduce bias and provide too optimistic a result. Any resubstitution has been criticized and its use discouraged. The categorical rejection of a classification procedure involving resubstitution is, though, not always justified. The resubstitution bias decreases with increasing sample size. The apparent error induced by the training increases, and asymptotically approximates the true


Bartels and Bartels

Page 5


prediction error. The test set error decreases in a similar manner approximating the true prediction error with increasing sample size, as reflected in the classifier's learning curve. This is seen in fig. 1. The relationship between the apparent prediction error and the generalized, true prediction error as a function of sample size is demonstrated by the “learning curve” of a classifier. The learning curve of a classifier usually takes the form of a power law function (9). The estimated apparent prediction error becomes monotonically less optimistic with increasing sample size. For large samples the training error becomes equal to the true prediction error because the samples for both diagnostic categories have become fully representative for their populations. For the test set error the opposite trend is true. It decreases with increasing sample size and asymptotically approximates the true prediction error.


For large samples both the apparent error and the test set error leave only a negligible bias. The distinction between apparent error and true prediction error is dropped altogether for large samples in the so-called “ one-shot” approach (10). Sample size thus plays an important role in assessing classification errors. The literature on classification methodology considers samples of 10,000 and more as very large, and samples in the range of 1,000's to be intermediate. Samples of less than 500 in size are generally considered as small in the engineering literature. The question then becomes what is a “large sample” in karyometry? The heuristic rule here, for a multivariate analysis, is that a sample consisting of 10 times the number of objects per variable is accepted as a large sample for which resubstitution would not be optimistically biased. The asymptotic approximation of the test set error to the true prediction error as a function of sample size is closely related to the learning curve of the classifier. The learning curve follows a power law and has the form


The constant C and the exponent x are task specific. They have to do with the dispersion of the test set data and how many data points would be needed to have a representative sample. With increasing sample size n the second term in the sum goes to zero and the true prediction error remains. The exponent x affects the sample size at which a certain difference to the true prediction error is reached, say, 1 %, or so. The value of x is slightly larger than 1.00, but it has a notable influence on the effective sample size. For n = 200, x = 1, nx = 200, but, for x = 1.05, nx = 260, and for x = 1.10, nx = 339.


Bartels and Bartels

Page 6

In karyometry the classification of nuclear populations practically always involves several hundred and often thousands of nuclei. The remaining resubstitution error then is very small.


Assessing overall efficacy of a chemopreventive agent on a treated and a control cohort, even in an exploratory study, may involve about 20 patients/diagnostic category, i.e., there would be 2,000 nuclei/diagnostic category, and typically from four to eight variables. In the development of a criterion indicating risk for the development of an aggressive type of lesion one could expect 50 - 100 patients, i.e., up to 10,000 nuclei recorded and evaluated. The decision rule typically involves from three to eight variables at most. In both instances, the resubstitution error may be negligible. The assessment of nuclei from a single or a small number of cases invariably only provides small samples. This is certainly also so when classification involves nuclear subpopulations of different phenotype, as they occur in single cases. Attention then needs to be paid to possible bias. There is always some uncertainty as to what sample size would be representative for a diagnostic category.


Weiss and Kulikowski (7) point out that the sample size ensuring full representation may not be unreasonably high and rather, sometimes, surprisingly small. One knows the size of the test set. For any classifier the quality of an error estimate depends directly on the number of objects in the test set. And, the accuracy of the estimate, on randomly drawn, independent test objects, follows a binomial distribution. This means we know not only the error rate estimated from the test set but also how far off it can be: the highest possible error rate is given by the confidence limit of the binomial distribution. There is only a low percentage chance that the error rate is higher. Thus, for example, in a situation where an error rate of 32 % had been estimated, on a nuclear population of 2,000 nuclei, the true error rate is likely not higher than 32% + 1.04%. The standard error is defined as


i.e., the sample size is quite adequate for an estimate of the true prediction error. For an estimate of the percentage of nuclei of a certain phenotype in a single case, n = 100, and the same estimated error rate of 32%, the result would be 32% + 4.66% = 36.7%. Even for samples of intermediate size the difference between apparent error and true prediction error may not be substantial, though. If the classification error from the training set matches the classification error from the test set it is an indication that the decision rule had not been “bent” to fit the training set. It indicates that both the training set and the test


Bartels and Bartels

Page 7

set are fully representative for the diagnostic categories at hand, and that the apparent error has become practically equal to the true prediction error.


In the classification of cases sample sizes tend to be small. To obtain an unbiased estimate of the true prediction error the partitioning of the data sets into 50% training and 50% test sets is common. The recommendations in the literature tend to partitionings of 2/3 training set objects and 1/3 test set objects, or even to 90% versus 10%. The reasoning is that this makes more information available for the defining of a decision rule.

Use of a validation set The concern with optimistic bias in the apparent error is justified in the classification of cases. It has been extended to the test set used in the training set / test set procedure as described above. There the result from the test set is generally accepted as unbiased. The argument here is, though, that the training of the system may involve observing the result obtained from the test set. Making adjustments to the decision rule, therefore, actually involves the test set in the process and impairs complete independence.


In response, a procedure is recommended where the data sets are partitioned into three components, a training set and a validation set for the development of the decision rule, and then, application of that rule to a truly independent test set (4, 6), as shown in fig. 2.

Classification of intermediate and/or small samples The estimate of the generalization, true prediction error is a function of the sample size of the training set. In many studies one might expect that the size for a fully representative sample might have to be prohibitively large, or might just not be available. The classifier therefore would have to be tested on a sample of smaller size for an estimate of the true prediction error. For the classification of medium and small size data sets a number of methods are recommended.


The training set/test set sequence, with a partitioning into just two data sets is expanded. In the “jackknife” procedure the preferred choice is a five fold to 10 fold cross-validation (11). In the “leave-one-out” procedure a partitioning of a sample of n data points into partitions of size n - 1 is set up. The “bootstrap” method is recommended especially for small samples which are resampled with replacement up to several hundred times, followed by the same number of training set/test set procedures.

The jackknife procedure In this procedure the data sets are divided into a number of subsets. All but one are used to derive decision rules versus the left out subset. Since one had several subsets there are several decision rules and several estimates of an error rate. The true error rate is the average of them. Its reliability is ascertained by the standard deviation of this set of estimates.


Bartels and Bartels

Page 8

The partitioning is shown in fig. 3. For a sample of 300 objects and a partitioning into 5 subsets the training set thus has 240 entries.


The risk that one encounters with all decreases in the size of the training set is that one may end up at a portion of the learning curve of the classifier well below the asymptotic approach to the true prediction error. This would result in an overestimate of the true prediction error, as shown in fig. 4. One needs to consider the trade-off between the number of partitions and the effect of working with a smaller sample size for the training set. For the example above, the five-fold partition would result in a training set of 240 objects. This would provide an acceptable approximation to the true prediction error. But, if for the same task one had only 80 samples to begin with, the training set would only have 64 objects. This may very well place the problem into a range of the learning curve where the slope has still kept it well below the accuracy given by the true prediction error (4).


This can be seen when drawing a line vertically from the abscissa at the sample size of 64 to the learning curve. If one chose to employ a tenfold cross validation this would further reduce the size of the effective training set and it might lead to an overestimate of the true prediction error. Just how much the true prediction error would be overestimated depends not only on the available sample size, but also on the slope of the learning curve. This situation may not become a problem in the processing of nuclear populations. However, when the data points represent cases it is a very relevant consideration.

The leave-one-out procedure In this procedure, with a sample size of n, training is done on n-1 data points versus the one data point left out. This process is then repeated until every data point has been left out once,i.e., one develops n classification rules.


This is a very labor-intensive procedure. Its single advantage is that for a small sample the leave-one-out method is the only one to provide an unbiased estimate of the true prediction error. For this estimate one uses the average error for the n rules. There are n such estimates, so one obtains an estimate for the variance of the true prediction error as well. The procedure has a number of disadvantages. There is the need to develop n decision rules, and the finally resulting estimate for the true prediction error is based on an average over the n classifiers, so which rule does it refer to?

The bootstrap procedure For very small samples, say of 30 objects or so, the finding of the best estimate for the prediction error may be difficult. Traditionally for such samples the leave-one-out method has been used. It is unbiased, but the variance of the prediction error estimate is quite high Anal Quant Cytopathol Histpathol. Author manuscript; available in PMC 2014 July 07.

Bartels and Bartels

Page 9


for small samples. In such small samples the variance has a dominating influence on the result. Thus, if one had a low variance procedure, for small samples even some bias may be accepted. The bootstrap method offers such a procedure. It was introduced in 1983 by Efron (12). Bootstrapping is a re-sampling method. If one has a sample of n cases one would resample the sample by drawing n re-samples, with replacement. In sampling with replacement, an object may be drawn twice or even multiple times for a resample, while other objects are not resampled at all. Sampling theory shows that in such a procedure, on the average, 63.2% of the original objects are drawn for a resample, and 36.8 objects are not drawn. These are used as test set. The resampling may be done a very large number of times, such as 100 to 200 times. These are treated as independent data sets in the subsequent (200 or so) training set/test set procedures. The so-called 0.632 procedure results in a low variance estimate for the prediction error, but it has an optimistic bias.


Conclusions The engineering literature emphasizes that it is useful to have rules discriminating between objects from different classes, but that the real challenge is to have rules that allow an accurate prediction for new objects in the future. This is certainly true and accurate, generally valid prediction rules are more difficult to derive. In karyometry, though, even the ability to distinguish accurately between objects in two data sets plays an important role e.g., in the assessment of the grade, or, in general, an accurate quantitative assessment of a lesion. And, it is by no means evident that just such a classification rule is simple and straightforward to derive. In karyometry, resubstitution error might be the least problem to be worried about, but sample inhomogeneity and inadequate representation can pose big problems.


The literature on automated pattern recognition, machine vision and classification lists the “representative sample” as a prime requirement for the development of a classification rule. In technology applications this requirement is readily satisfied but, in karyometry, it remains a major problem. In prospective karyometric studies in which material from one and the same institution is used, careful control of processing, i.e., of fixation, sectioning, and staining is possible. When materials collected at different institutions are used, or even when prepared from archival material at the same institution, the assumption of of having a representative sampling may newed to be examined. Karyometric characteristics do not generally have visually clearly perceived appearance. Histopathologic preparations looking convincingly the same as others, in their digital representation, may be distinctly different. Consequently, one may well find agreement between classification success from a training set and a test set for the clinical materials in a given study. But, one may also find that the


Bartels and Bartels

Page 10

classification rule fails when applied to a set of histopathologic slides from a different institution, even when those were prepared according to a well-defined protocol.


The differences may be subtle. Training on the new material may show the same karyometric features as effective, but the coefficients in the discriminant function may be a little different. A representative sample for a classification algorithm therefore may not evolve until material subject to all small differences in preparation has been included in the training. A clinically generally valid classification rule must be expected to follow from an iterative process. The problem of a representative sample becomes particularly relevant in situations when the original set of cases is small. One has to remember here that nuclei from a given diagnostic category, and especially from a given grade of a lesion, are not crisp, but fuzzy sets. What one has as a “representative sample” is a small number of members of a fuzzy set.


Classification methodologies for small samples have been developed to allow estimates of prediction error balanced between the variance of the estimate versus bias. Boot strapping is a good example for this. A small sample is resampled with replacement, possibly hundreds of times, until a large number of these small data sets are generated. They allow a precise estimate of prediction error. The generated bootstrapped data set has the exact stochastic properties of the original small sample. The derived classification rule is generally valid, but only for additional materials with the same stochastic properties as the original data set. As a small sample of a fuzzy set it is unlikely that the originally included cases represent a diagnostic category in its entirety. The problem of a representation therefore is doubly relevant when the original material is but a small sample. Again, this is rarely a problem in technology applications, but in histopathologic materials it is to be taken into serious consideration.

Acknowledgments This work was supported in part by grant PO1 CA - 27502 from the National Institutes of Health, Bethesda, MD, 20892, and a gift from Michael Lewis, Los Angeles, CA.


Grant funding: This study was supported in part by grant number, P01CA027502 from the National Institutes of Health.

References 1. Agrawala, AK., editor. Machine recognition of patterns. IEEE Press; 1977. 2. Fukunaga, K. Introduction to statistical pattern recognition. Academic Press; New York: 1972. 3. Duda, R.; Hart, P. Pattern classification and Scene analysis. Wiley; New York: 1973. 4. Hastie, T.; Tibshirani, R.; Friedman, J. The elements of statistical learning. Springer; New York: 2001. 5. Michie, D.; Spiegelhalter, DJ.; Taylor, CC. Machine learning, neural and statistical classification. Ellis Horwood; New York: 1994. 6. Schuermann, J. Pattern Classification. John Wiley; New York: 1996.


Bartels and Bartels

Page 11


7. Weiss, SM.; Kulikowski, CA. Computer systems that learn, Classification and Prediction Methods. Morgan Kaufman; San Mateo, Calif: 1990. 8. James, M. Classification algorithms. Wiley; New York: 1985. 9. Duda, RO.; Hart, PE.; Stork, DG. Pattern Classification. 2nd. John Wiley; New York: 2000. p. 492 10. Henery, RJ. Methods for Comparison: Train and Test. In: Michie, D.; Spiegelhalter, DJ.; Taylor, CC., editors. Machine learning, Neural and Statistical Classification. Ellis Horwood; New York: 1994. p. 107 11. McLachlan, GJ. Discriminant analysis and statistical pattern recognition. John Wiley; New York: 1992. 12. Efron B. Estimating the error rate of a prediction rule, some improvemnents on cross-validation. J Amer Statist Association. 1983; 78:316–331.

NIH-PA Author Manuscript NIH-PA Author Manuscript Anal Quant Cytopathol Histpathol. Author manuscript; available in PMC 2014 July 07.

Bartels and Bartels

Page 12

NIH-PA Author Manuscript NIH-PA Author Manuscript

Fig. 1. Apparent error and test set error of a classifier as a function of sample size

NIH-PA Author Manuscript Anal Quant Cytopathol Histpathol. Author manuscript; available in PMC 2014 July 07.

Bartels and Bartels

Page 13


Fig. 2. Partitioning of data into a training set, a validation set and a test set

NIH-PA Author Manuscript NIH-PA Author Manuscript Anal Quant Cytopathol Histpathol. Author manuscript; available in PMC 2014 July 07.

Bartels and Bartels

Page 14

NIH-PA Author Manuscript NIH-PA Author Manuscript

Fig. 3. Jack knife procedure with a five fold cross validation


Bartels and Bartels

Page 15

NIH-PA Author Manuscript Fig. 4.


Overestimation of true prediction error resulting from a sample size so small that the learning curve of the classifier is not yet approximationg the true prediction error.


389 240

Class B

sample size


Class A


true dx category

55

327 23%

84%

Class A

185

62 77%

16%

Class B

assignment by algorithm


Table I Bartels and Bartels Page 16


Surprise beyond prediction error.

Dopamine reward prediction error coding.

An error in Rh testing in pregnancy.

Molecular classification and prediction in gastric cancer.

Type I error control for tree classification.

Mean Bias in Seasonal Forecast Model and ENSO Prediction Error.

Emotional context facilitates cortical prediction error responses.

Karyometry of liver biopsies in virus hepatitis.

Materials Prediction via Classification Learning.

Learning about Expectation Violation from Prediction Error Paradigms - A Meta-Analysis on Brain Processes Following a Prediction Error.

Differential neural mechanisms for early and late prediction error detection.

Prediction error, ketamine and psychosis: An updated model.

Deep and beautiful. The reward prediction error hypothesis of dopamine.

Critical evidence for the prediction error theory in associative learning.

The Representation of Prediction Error in Auditory Cortex.

Temporal difference error prediction signal dysregulation in cocaine dependence.

Intact Ventral Striatal Prediction Error Signaling in Medicated Schizophrenia Patients.

Motivational Context Modulates Prediction Error Response in Schizophrenia.

Moments and Root-Mean-Square Error of the Bayesian MMSE Estimator of Classification Error in the Gaussian Model.

Person reidentification by minimum classification error-based KISS metric learning.

Genomic-enabled prediction with classification algorithms.

Mitochondrial DNA copy number augments performance of A1C and oral glucose tolerance testing in the prediction of type 2 diabetes.

Evaluating Random Forests for Survival Analysis using Prediction Error Curves.

Pruning of memories by context-based prediction error.