1504

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 4, APRIL 2014

Analyzing Training Information from Random Forests for Improved Image Segmentation Dwarikanath Mahapatra

Abstract— Labeled training data are used for challenging medical image segmentation problems to learn different characteristics of the relevant domain. In this paper, we examine random forest (RF) classifiers, their learned knowledge during training and ways to exploit it for improved image segmentation. Apart from learning discriminative features, RFs also quantify their importance in classification. Feature importance is used to design a feature selection strategy critical for high segmentation and classification accuracy, and also to design a smoothness cost in a second-order MRF framework for graph cut segmentation. The cost function combines the contribution of different image features like intensity, texture, and curvature information. Experimental results on medical images show that this strategy leads to better segmentation accuracy than conventional graph cut algorithms that use only intensity information in the smoothness cost. Index Terms— Random forests, training information, feature selection, graph cut, segmentation, probability maps, supervoxels, context.

I. I NTRODUCTION

M

EDICAL image segmentation is challenging due to image modality, organ type and segmentation framework. The task is further complicated by varying scanner parameters from different modalities, low resolution, image noise and in many cases poor contrast. Two very popular approaches for medical image segmentation are active contours [1], and graph cuts [2]. Initial works using these methods primarily relied on low level image information like intensity, texture and edge information to segment the target organ. However, image features alone were not sufficient for an accurate segmentation which fostered the exploration of ways to include prior knowledge from training datasets. Subsequently many works started incorporating prior shapes and statistical information into level sets [3]–[5] and graph cut segmentation [6]–[8]. Another approach of exploiting training data is to learn image features that distinguish between organ and background structures as in active appearance models (AAMs) [9] and active shape models (ASMs) [10]. In this work we propose a method that analyses the training process of Random Forest (RF) classifiers, and uses the learned knowledge to devise feature selection strategies and in segmenting medical images. Manuscript received August 9, 2013; revised December 4, 2013; accepted February 4, 2014. Date of publication February 6, 2014; date of current version February 24, 2014. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Jean-Baptiste Thibault. The author is with the Department of Computer Science, ETH Zurich, Zurich 8092, Switzerland (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2014.2305073

A. Prior Work on Knowledge Based Segmentation and Graph Cuts Many works use classifiers (e.g. support vector machines (SVM) and RFs) for segmentation. The existing literature on machine learning techniques for segmentation is quite diverse. Therefore we review relevant literature for RFs only. Random forests [11] are being used increasingly by many medical applications like tissue segmentation [12], abnormality detection [13], [14], coronary artery stenoses identification [15], grading diseases [16], segmenting Crohns Disease [17], [18], kidney segmentation [19] and multiple lesion structures [20]. They are computationally efficient for large training data, solve multiclass classification problems, and the learned knowledge can be extracted and interpreted to get a deeper insight into the training procedure. RF classifiers also allow for a probabilistic interpretation of classification and their output is used as a cost function in a level set or graph cut framework. Advantages of machine learning techniques are thus combined with spatial smoothness constraints for improved segmentation accuracy. The feature set is not limited to low level features and multiple features can be used to improve the accuracy of the classifiers. Graph cuts have the advantage of being fast, give globally optimal results and are not sensitive to initialization, while active contours are sensitive to initialization and can get trapped in local minima. Prior shape models learned from training data have also been used in graph cut approaches for medical image segmentation. Prior shape is incorporated in the form of zero level set shape templates [7], elliptical shape priors [6], registration information [21], orientation histograms [22] and flux-maximization [23] to segment multiple objects in medical and natural images. B. Our Contribution Most methods for RF based segmentation use only the probability maps to segment the desired organ. Further, while imposing smoothness constraints they use only intensity information. However, RFs can provide much more knowledge from the training procedure. RFs allow us to examine the training procedure and quantify the importance of different features to the classification task. This knowledge is useful in many ways like designing an appropriate smoothness cost, or improved feature selection that could lead to higher segmentation accuracy [18]. With respect to [18] our work makes the following specific contributions: 1) we examine feature selection strategies in greater detail, discuss different factors to choose an appropriate feature set that would lead to best classification performance.

1057-7149 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

MAHAPATRA: ANALYZING TRAINING INFORMATION FROM RANDOM FORESTS

1505

Algorithm 1 Segmentation of Desired Organ

Fig. 1. (a) Partitioning of patch for calculating anisotropy features; (b) template for calculating context features.

A. Intensity Statistics We specifically discuss scenarios where the number of samples are low compared to the feature length. 2) our method is shown to successfully segment various categories of medical images. Additionally we also use the importance measures to design a smoothness cost that improves segmentation accuracy. The rest of the paper is structured as follows. Sections II-VI describe different parts of our method. We describe our dataset in Section VII and discuss our results in Section VIII, and conclude with Section IX. II. OVERVIEW O F O UR M ETHOD In this work we focus on exploiting the advantages of the RF training process for feature importance quantification, and subsequent image segmentation. We first briefly describe different features used in the segmentation algorithm. Conventional approaches for graph cut segmentation have used only image intensity. However, by using extra features like texture and curvature information we obtain more discriminative information. Weighing these features according to their importance in classification leads to improved image segmentation. Calculating class probabilities for all voxels is computationally intensive. Therefore we devise an automated method for determining a volume of interest (VOI) for 3D volumes or a region of interest (ROI) for 2D images. First, the given test volume is oversegmented into supervoxels, and each supervoxel is classified using RFs to detect the presence of structures of the organ of interest (OOI) we want to segment. Supervoxels containing parts of the OOI constitute the VOI. Probability maps are generated for all VOI voxels using a second set of RF classifiers and subsequently used for graph cut segmentation. Note that for 2D images we use superpixels instead of supervoxels. An overview of our method is given in Algorithm 1. III. I MAGE F EATURES Intensity statistics, texture and curvature entropy, and spatial context features are used for voxel classification. For VOI extraction we extract these features from each image patch while for voxel wise classification a surrounding 31×31 patch is used. Since this paper’s focus is the analysis of RF training information, we give a brief description of each feature and refer the reader to [18] for details.

By visual examination a trained clinician can identify CD affected regions from T1 MR images due to their high resolution and high signal to noise ratio (SNR). Therefore, in addition to the features processed by the human visual system (HVS), i.e., mean and variance, we extract skewness and kurtosis values for each patch. B. Texture Entropy Texture maps are obtained from 2-D (instead of 3D) Gabor filter banks for each slice (at orientations 0◦ , 45◦ , 90◦ , 135◦ and scale 0.5, 1). They are partitioned into 9 equal parts corresponding to 9 sectors of a circle. For each sector we calculate the texture entropy, which gives an estimate of the degree of disease severity within the patch [18]. The texture anisotropy for sector r is  r =− ptrex log ptrex . (1) χani t ex

ptrex

denotes the probability distribution of texture values in sector r . This procedure is repeated for all the 8 texture maps over 4 orientations and 2 scales to extract a (8 × 9 =) 72 dimensional feature vector. C. Curvature Entropy With disease progression shape (and hence curvature) of tissues change. We also exploit this irregularity for distinguishing between diseased and normal tissues. Curvature maps are obtained from the gradient maps of tangent along the 3D surface. The second fundamental form (F2) of these curvature maps is identical with the Weingarten mapping and the trace of the F2 matrix gives the mean curvature. This mean curvature map is used for calculating anisotropy. Details on curvature calculation is given in [18] and [25]. Similar to texture, curvature entropy is calculated from 9 sectors of a patch and is given by  r =− pθr log pθr . (2) Cur v ani θ

pθr denotes the probability distribution of curvature values in sector r , θ denotes the curvature values. Intensity, texture and curvature features combined give a 85 dimensional feature vector. Fig. 1(a) shows the template for partitioning a patch into sectors and extracting entropy features.

1506

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 4, APRIL 2014

D. Spatial Context Features: Context information is particularly important for medical images because of the the regular arrangement of human organs. Context features have been used to segment brain structures in MRI [26], the prostate from CT images [27], cardiac structures from MRI [28]–[30], localizing anatomical structures [31] and registration [32], [33]. Fig. 1(b) shows the template for context information where the circle center is the voxel in question and the sampled points are identified by a red ‘X’. At each point corresponding to a ‘X’ we extract a 3 × 3 region and calculate the mean intensity, texture and curvature values. The texture values were derived from the texture maps at 90◦ orientation and scale 1. The ‘X’s are located at distances of 3, 8, 15, 22 pixels from the center, and the angle between consecutive rays is 45◦. The values from the 32 regions are concatenated into a 96 dimensional feature vector, and the final feature vector has (96 + 85 =)181 values. IV. R ANDOM F OREST C LASSIFIERS An RF is an ensemble of decision trees, where each tree is typically trained with a different subset of the training set (“bagging”), thereby improving the generalization ability of the classifier. Samples are processed along a path from the root to a leaf in each tree by performing a binary test at each internal node along this path. The binary test compares a certain feature with a threshold. Training a forest amounts to identifying the set of tests at each node that best separate the data into the different training classes. At each node a random feature subset is examined to select the best subset. This decreases the correlation between outputs of different trees, but improves the performance of the forest as a whole. There also exist different strategies to split the training sample at different nodes, such as different cost functions or semi supervised methods [34]. Each training sample is sent to the corresponding child depending on the result of the test, and the process is recursively repeated until the number of samples in a node falls below a threshold, a predefined maximum tree depth is reached, or all the samples belong to the same class. In that case, the node becomes a leaf, and the most frequent class of the training data at the node is stored for testing. During testing, a new sample is processed by applying respective tests according to the path from the root node to the leaf it traverses. When a leaf node is reached, the tree casts a vote corresponding to the class assigned to this node in the training stage. The final decision for a test sample is obtained by selecting the class with the majority of votes. Moreover, the class probability of a test sample is estimated as the fraction of votes for that class cast by all trees. A. Variable Importance Measures for Segmentation RFs have the advantage of giving us access to the knowledge learned during the training phase, which can be interpreted by assessing the importance of different variables (or features) to the classification task [35]. Identifying relevant predictor variables is of interest in many medical applications [36], [37]. In spite of our multiple dimensional feature vector,

TABLE I E XAMPLE I LLUSTRATION FOR C ALCULATING N ORMALIZED I MPORTANCE M EASURES U SING S EMANTIC I NFORMATION . C ARDIAC D ATASETS W ERE U SED FOR T HIS I LLUSTRATION

the features are of three types - intensity, texture and curvature (context information is a combination of the three). By aggregating the importance values of each feature type and normalizing them we obtain their relative importance in the classification task. Table I shows the importance measures of different features obtained after training a cardiac right ventricle (RV) dataset. Details about the dataset are given in Section VIII-G.1. For convenience we have grouped the values under two categories - “Low-level” and “Context” features. Low-level features indicate intensity, texture and curvature features, while context features are a combination of the three values sampled at fixed points from a voxel. Note that there are four columns for texture under “low-level” features corresponding to the four orientations of Gabor filters, while “context” has only one value for texture. The importance measures are not normalized. First we sum up all the values under respective feature categories namely intensity, texture and curvature. Therefore the importance measure of intensity is 343 (Col 1 + Col 7). Similarly, the importance measure of texture (T ) is 488 (Col 2 + Col 3 + Col 4 + Col 5 + Col 8) and for Cur v is 701 (Col 6 + Col 9). The sum of all the importance measures is 1532. Dividing the individual importance measures with the sum gives the final normalized importance measures as follows: I nt − 0.22(w I ), T ex − 0.32(wT ) and Cur v − 0.46(wC ). Note that the importance values depend upon the training set in question. However their values are in the following range: w I = [0.19, 0.24], wT = [0.31, 0.35], wC = [0.41, 0.47]. Thus we observe the relative importance of different features to remain the same. The role of variable importance is discussed in greater detail through experiments in Section VIII-C. Although explained using the cardiac dataset, our findings are also applicable for other data. V. VOLUME OF INTEREST IDENTIFICATION VOI identification is an important part of our method because: 1) it reduces the total computation time since we need not classify each voxel; and 2) the VOI gives an initial selection of likely OOI voxels and reduces false positives in subsequent analysis. Intensity inhomogeneity correction was performed using the nonparametric nonuniform intensity normalization (N3) method of [38]. This method performs well without requiring a model of the tissue classes present and can be applied at an early stage in automated data analysis. The method is independent of pulse sequence and insensitive to pathological data that might otherwise violate model assumptions. To eliminate the dependence of the field estimate on anatomy, an iterative approach is employed to estimate both the multiplicative bias field and the distribution of the true tissue intensities.

MAHAPATRA: ANALYZING TRAINING INFORMATION FROM RANDOM FORESTS

The intensities were normalized using the method in [39]. First a “standard” intensity histogram is learned from a subset of the training medical images to determine the minimum and maximum percentile intensities ( p1 , p2 ), and the second mode of the histogram (μ). The intensities of a test image are rescaled using the following formula x − p1 (s2 − s1 ) (3) x  = s1 + p2 − p1 where x  is the new intensity obtained from the original intensity value x; s1 , s2 are the minimum and maximum intensities of the test image. This approach leads to good contrast of the different images. The normalized images are oversegmented into supervoxels using the simple linear iterative clustering (SLIC) supervoxel algorithm [24]. The desired number of supervoxels is specified, which is the same as the number of initial supervoxel centers. These centers are initially equally spaced, and then moved to the lowest gradient position. The pixels are clustered based on intensity similarity and spatial proximity to the nearest supervoxel centers. After every iteration the cluster centers are updated based on the voxels assigned to that cluster. The iterative procedure continues till the supervoxel centers do not change. In the original algorithm, the intervals between supervoxels along each spatial dimension are equal such that the initial supervoxels are cubes. This choice does not work well in our application since the spatial extent in the z direction is short, has low resolution, and the structure of organs also changes significantly along z-axis. If a supervoxel has too many slices, it may contain more than one organ, which complicates further analysis. For better spatial coherence we restrict the number of slices which supervoxels may cover. Instead of applying the same step size along three spatial axes, the average number of supervoxels on the x y-plane and along z-axis are specified separately so that the step sizes are calculated independently. Details of the algorithm can be found in [24]. Supervoxel classification for VOI identification needs to be fast and accurate. From the training images we identify supervoxels that contain OOI and background voxels, and extract relevant features from them. For every supervoxel class we train RF classifiers with 50 trees. Each supervoxel of the test image is labeled by the RF classifier as ‘OOI’ or ‘background’ with ‘OOI’ supervoxels making up the VOI. For 2D images superpixel segmentation is achieved using a ‘2D version’ of the algorithm where there is no computation along the z axis. VI. G RAPH C UT S EGMENTATION Probability maps are generated for all VOI voxels using a second RF classifier that analyses only the VOI voxels. Approximately equal number of samples from OOI and background voxels are taken from training datasets. Intensity, texture, curvature and context features derived from these samples were used to train the classifier. The features were extracted from a 31 × 31 × 3 neighborhood of each voxel. The training set varies with each round of cross validation. The trained classifier is used to generate probability maps for every

1507

Fig. 2. Probability maps for (a) cardiac RV; (b) background. Higher values indicate greater likelihood of belonging to that region. Red indicates maximum probability while blue indicates zero probability; (c) final segmentation output; red-manual segmentation,green-our segmentation.

voxel within the identified VOI. Each voxel has 2 probability values corresponding to OOI and background. The negative logarithm of the probability maps serve as penalty costs in a second order MRF cost function. Fig. 2(a) and (b) show the probability maps of an example image of the cardiac RV. The final segmentation is obtained by optimizing a second order Markov random field (MRF) energy function which is written as   D(L s ) + λ V (L s , L t ), (4) E(L) = s∈P

(s,t )∈N

where P denotes the set of pixels, N is the set of neighboring pixels for pixel s. L s is the label of s, and L is the set of all labels. The cost function is optimized using graph cuts [40]. λ is a weight that determines the relative contribution of penalty cost (D) and smoothness cost (V ). D(L s ) is given by D(L s ) = − log (Pr (L s ) + ),

(5)

where Pr is the likelihood (from probability maps) previously obtained using RF classifiers and  = 0.00001 is a very small value to ensure that the cost is a real number. Higher the probability for a class lower is the corresponding data penalty for that class.

A. Training Information for Smoothness Cost V ensures a smooth solution by penalizing spatial discontinuities. Let the weight (importance measure) of the different features (as explained in Section IV-A) be w I (intensity), wT (texture) and wC (curvature), where w I + wT + wC = 1. The smoothness cost V is given by   w I VI + wT VT + wC VC , L s = L t . (6) V (L s , L t ) = 0 Ls = Lt where VI , VT , VC are the individual contributions to the smoothness by intensity, texture and curvature. VI is defined as 2 1 − (Is −It ) VI (L s , L t ) = e 2σ 2 · , (7) s − t I is the intensity. VT and VC are similarly defined using texture and curvature.

1508

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 4, APRIL 2014

TABLE II C HANGE IN S EGMENTATION A CCURACY W ITH D IFFERENT VALUES OF λ (E QN . 4)

TABLE III E FFECT OF N UMBER OF T REES IN RF C LASSIFIERS ( NT ) ON S EGMENTATION A CCURACY AND T RAINING T IME

VII. DATASETS We test our method on cardiac, liver and kidney MR images. Our whole pipeline was implemented in MATLAB on a 2.66 GHz quad core CPU running Windows 7. However our code was not optimized to take advantage of parallel processing. The random forest code was a MATLAB interface to the code in [41] written in the R programming language. The tree depth was fixed at 20. Details on computation time are given in Section VIII-H. VIII. R ESULTS AND D ISCUSSION A. MRF Regularization Strength λ (Eqn. 4) To choose the MRF regularization strength λ we choose a separate group of 7 cardiac datasets and perform segmentation with cross validation using our method but with λ taking different values from 10 to 0.001. The results are summarized in Table II. The maximum average segmentation accuracy using Dice Metric (DM) was obtained for λ = 0.02 which was fixed for subsequent experiments. Note that these 7 datasets were not part of the test dataset used for evaluating our algorithm. B. Influence of Number of Trees We also examine the effect of varying number of trees (NT ) in the second RF classifier (used for generating probability maps) on overall performance. The results for R FSem on cardiac RV segmentation are summarized in Table III in terms of DM values. When NT < 10 DM< 0.71. With increasing NT DM increases alongwith the time taken for training. Table III shows the training time (TT r ) for different NT as a multiple of the training time for NT = 50. For NT > 50 there is no significant increase in DM ( p > 0.1) but the training time increases significantly. The best trade-off between NT and DM is achieved for 50 trees and is the reason we have 50 trees for our RF ensemble.

the feature vector to infer their individual importances. This approach can be particularly relevant in e.g., object detection and segmentation where a large feature set is commonly used. Recall that the low level features are obtained from one intensity map, four texture maps and a curvature map. Table I shows T0 , T90 to have similar importance measures, while the same is true for T45 , T135 . Interestingly these importance values are quite lower than intensity and curvature. To check for the influence of different texture maps we conduct the following experiments on cardiac RV datasets: 1) ExptT ex1 : use texture maps T0 , T90 with intensity and curvature for segmentation; 2) ExptT ex2 : use texture maps T45 , T135 with intensity and curvature; 3) ExptT ex3 : use texture maps T0 , T45 with intensity and curvature; 4) ExptT ex4 : use texture maps T0 , T45 , T90 with intensity and curvature; The average DM values for the three experiments on cardiac datasets are 90.6, 90.2, 90.9 and 91.6. Note that the DM values using the entire features set is 93.2 (Table VI). These results lead to the following conclusions: 1) T0 , T90 provide similar information, and so do T45 , T135 . 2) T0 , T45 provide complementary information which improves segmentation accuracy. However it is still short of the accuracy obtained by using all the texture maps. 3) Each texture map has a relatively small role in the segmentation performance. However, their combined influence is significant enough to not be neglected. Similar experiments for the context information show that T90 only (or T0 ) gives good performance and adding context information from other texture maps does not lead to any gain in segmentation accuracy. It is interesting to note that for context information alone, curvature features have a significant advantage over intensity and texture. This is higher than the advantage of low level curvature features (i.e., curvature anisotropy) over low level intensity and texture features. Overall context information influences the segmentation more than low-level features. This is also reflected in Table I where the normalized importance measure of context information is 0.53 compared to 0.47 for low-level information. Similarly for intensity we find skewness and kurtosis values combined to be more influential (61% of intensity’s contribution) than mean and variance (the remaining 39%). The above experiments give a systematic way of determining which features are important for the overall classification and segmentation task. It is particularly useful when a large number of features have been obtained for a voxel and we are interested in identifying the most important features in the final classification task. Even if the segmentation framework does not include classification, RFs may be used for feature selection which in turn would reduce the need for complex and redundant feature extraction methods.

C. Variable Importance Measures for Feature Selection Table I shows importance measures for the three feature categories (intensity, texture and curvature). However within the same category the features have been extracted using different mathematical operations. Here we analyse each dimension of

D. Training of RFs In this section we discuss how the variable importance measures can be used to devise an effective strategy for training RFs, particularly when the number of samples is low.

MAHAPATRA: ANALYZING TRAINING INFORMATION FROM RANDOM FORESTS

1509

TABLE IV Q UANTITATIVE M EASURES FOR S UPERVOXEL C LASSIFICATION U NDER D IFFERENT F EATURE C OMBINATIONS . Sen-S ENSITIVITY, Spe-S PECIFICITY. I nt -I NTENSITY, T ex -T EXURE , Curv -C URVATURE

Fig. 3. Plot showing the relationship of classification accuracy with (a) number of training samples and (b) optimal feature length. The numbers in the y−axis of the two graphs is for the same number of training samples (x−axis).

If the number of samples are low compared to the number of features then the RF does not generalize well to new samples. Hence if we can reduce the feature set through feature selection the trained classifier might generalize better to unseen test samples. Recall that our final feature has 181 values (85 from low level features and 96 from context features). From each class we choose a set of 200 samples to train a RF and test on 100 samples. The average classification accuracy so obtained is 86.4%. Reducing the number of training samples results in lower classification accuracy. However if we reduce the number of features then classification accuracy improves although it is lower than 86.4%. Fig. 3(a) plots the highest classification accuracy obtained for different number of training samples (x−axis). The corresponding length of the optimal feature vector that gives the highest classification accuracy is shown in Fig. 3(b) (right y−axis). The plot indicates that feature selection may also be applied to trim the feature vectors and obtain higher classification accuracy when number of samples is low (or less) compared to the number of features. E. VOI Identification VOI identification is a classification problem it is difficult to get 100% classification/detection accuracy. To overcome this shortcoming, we adopt the following strategy. After the initial classification of OOI supervoxels, we choose the largest cluster and change the labels of all neighboring supervoxels to ‘OOI’ irrespective of their initial labels. Although this approach increases the VOI area, it ensures that we don’t miss out on any OOI regions that may have been missed out by the initial classification. Fig. 4 shows an example cardiac image where this strategy is particularly effective. The first column shows the image with the manually drawn boundaries in red and supervoxels in green. The second column shows the detected ROI in each slice (shown by the yellow supervoxels) overlaid on the previous image. Parts of the OOI are missed by supervoxel classification. However when we include all the neighboring supervoxels (third column), the ROI encompasses all parts of the OOI. Table IV summarizes the average classification performance over all medical datasets for different feature combinations.

Fig. 4. VOI detection. (a) supervoxels (green) and manually annotated OOI (red); (b) identified OOI supervoxels (in yellow); and (c) the final VOI after labels change of neighboring supervoxels from (b).

All Feat indicates the combination of intensity, texture and curvature features. All Featchange denotes measures after label change of neighboring supervoxels. The accuracy for the individual features are lower than their combinations. The combination of texture and curvature features produces results closest to All Feat. A t-test on the values for T ex + Cur v and All Feat gives p < 0.021 which indicates statistically significant improvement after including intensity. We also conduct t-tests for features T ex versus T ex − I nt, and Cur v versus Cur v − I nt. In all cases we find that p < 0.028, thus clearly showing that inclusion of intensity features improves classification accuracy without extra computational cost. These values also support our observations in Table I and our conclusion in Section VIII-D about the importance of curvature. F. Effect of Supervoxel Size Small supervoxels are homogenous and the extracted features representative of a single class than large supervoxels. However they may not always provide sufficient voxels to estimate stable features. Large supervoxels mostly contain enough voxels to calculate stable features but may contain voxels from more than one class which generates label ambiguities. Consequently, the extracted features may not be representative of one class. Thus training with features from very large supervoxels leads to low classification accuracy. Table V provides Sen and Spe measures for different supervoxel sizes. These values are averaged over datasets of different organ types. Our experiments clearly demonstrate that an empirically optimal tradeoff between accuracy and homogenous samples is achieved when the number of voxels in a supervoxel is in the range 1800–2200. The corresponding number for superpixels is 600–900. Note that the performance is independent of the

1510

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 4, APRIL 2014

TABLE V Q UANTITATIVE M EASURES FOR D IFFERENT S UPERVOXEL S IZES . N -N UMBER OF V OXELS IN A S UPERVOXEL

TABLE VI Q UANTITATIVE M EASURES FOR RV (B LOOD P OOL AND M YOCARDIUM ) S EGMENTATION A CCURACY. DM-D ICE M ETRIC IN % AND HD I S H AUSDORFF D ISTANCE IN mm

original volume dimensions, and depends only the number of voxels (or pixels) within each supervoxel (or superpixel) required to obtain reliable features. Depending upon the volume dimensions we set supervoxel parameters such that the resulting supervoxels contain appropriate number of voxels. G. Segmentation of Desired OOI Comparative results are presented for the following methods 1) R F: our proposed method using RF classifiers and graph cut segmentation; 2) R FnC : R F without context information from images; w I , wT , wC are weighted according to their importance measures learned from the training step. 3) R FnV : R F without importance measures in V but using context information; w I = wT = wC = 0.33 4) R FnVI : R F with w I = 0; 5) R FnVT : R F with wT = 0; 6) R FnVC : R F with wC = 0. For 4, 5, 6 above the weights are normalized by the sum of the two values. Quantitative evaluation of segmentation performance is given in terms of Dice Metric (DM) and Hausdorff distance (HD). 1) Cardiac RV Segmentation: Cardiac MR examinations were performed at 1.5T (Symphony Tim, Siemens Medical Systems, Erlangen, Germany) using a eight-element phasedarray cardiac coil and repeated breath-holds of 10–15 s. A total of 8–12 contiguous cine short axis slices were performed. Sequence parameters were as follows: TR = 50 ms; TE = 1.7 ms; flip angle = 55; slice thickness = 7 mm; matrix size = 256 × 216; Field of view = 360 − 420 mm; 20 images per cardiac cycle; spatial resolution of 0.75mm/pixel. There were 32 datasets and we use a leave-one-out strategy to evaluate our method. Table VI summarizes the performance of our method for the previously mentioned methods and M AH : the shape prior based LV segmentation method in [22] applied to our RV datasets. Fig. 5 shows segmentation results for Patients 18, 11, 4 as obtained by R F, R FnC , R FnVC and R FnVT . To ensure clarity we show only the results for endocardium (or the blood pool). 2) Liver Segmentation: The datasets comprised of 15 patient MR volumes. Each volume consists of 28 slices.

Fig. 5. RV Segmentation results: (a) R F; (b) R FnC ; (c) R FnVC ; and (d) R FnVT . The manual segmentations are in red while the algorithm segmentations are in green. Areas of inaccurate segmentation are highlighted by yellow arrows. The three rows show results for patients 18, 11, 4. TABLE VII Q UANTITATIVE M EASURES FOR L IVER S EGMENTATION A CCURACY. DM- D ICE M ETRIC IN % AND HD I S H AUSDORFF D ISTANCE IN mm

The slice thickness was 4 mm, pixel spacing was 1.6 mm and dimension 110 × 80 pixels. A leave-one-out approach was adopted for segmentation. Table VII shows segmentation results on the liver datasets, and visual results are shown in Fig. 6. 3) Kidney Segmentation: We test our algorithm on 12 kidney MR datasets of dimensions 120 × 110 × 40 pixels. The images were acquired on a GE 1.5T Signa scanner, and the pixel resolution was 1.3 × 1.3 × 2.1 mm. Table VIII shows segmentation results on the kidney datasets, and visual results are shown in Fig. 7. In all the above cases R F performs the best followed by R FnV and R FnVI . The low DM values for R FnC highlights the fact that context plays a very important role in our method. Although all the low level features are used, without context information it is very difficult to discriminate between low-contrast edges as many voxels get inaccurate probability values, leading to lower segmentation accuracy. It also supports our observations in Section VIII-D about the greater importance of context information than low level features.

MAHAPATRA: ANALYZING TRAINING INFORMATION FROM RANDOM FORESTS

1511

classification, although intensity also plays a significant role. t−tests between the results yields p-values p < 0.035 for all cases indicating that none of the features can be discarded without a significant drop in performance, and their combination achieves the best results. H. Computational Cost

Fig. 6. Liver Segmentation results: (a) R F; (b) R FnC ; (c) R FnVC ; and (d) R FnVT . The manual segmentations are in red while the algorithm segmentations are in green. Areas of inaccurate segmentation are highlighted by yellow arrows. TABLE VIII Q UANTITATIVE M EASURES FOR K IDNEY S EGMENTATION A CCURACY. DM: D ICE M ETRIC IN % AND HD: H AUSDORFF D ISTANCE IN mm

Our method consists of: supervoxel segmentation and classification to get the VOI, generating probability maps and graphcut segmentation. The average computation time for our entire method in case of cardiac RV segmentation on a 256 × 216 image was 528 seconds. The automatic ROI identification stage including sub sampling, feature extraction, classification, segmentation and upsampling to get the ROI took 224 seconds on an average. Further segmentation of the ROI took 304 seconds on an average, inclusive of the time taken for classification and segmentation. For liver and kidney images VOI identification requires between 216–251 seconds. Probability map generation is more time consuming as features have to be extracted for each voxel. Depending upon the size of the VOI, a further 311–345 seconds are required to get the probability maps. Thus on an average the entire method from start to finish takes 520 − 600 seconds. IX. C ONCLUSION

Fig. 7. Kidney Segmentation results: (a) R F; (b) R FnC ; (c) R FnVC ; and (d) R FnVT . The manual segmentations are in red while the algorithm segmentations are in green. Areas of inaccurate segmentation are highlighted by yellow arrows.

Importance measures used in the smoothness cost are also important. p < 0.025 from t-tests between R F and R FnV supports this observation. R FnV weights intensity, texture and curvature equally in the smoothness cost, thereby neglecting the importance information learned from the classifier. This lowers the DM and increases the HD compared to R F, although the magnitude is less than R FnC . These results also highlight the fact that weights based on importance measures are much more effective than assigning constant weights or ad-hoc values. This is an important observation as it supports our findings in Section VIII-C that based on trained RFs we can derive a effective feature selection strategy. Comparing between R FnVI , R FnVT and R FnVC highlights the relative importance of different features. Curvature and texture features are deemed most important in the

We have proposed an image segmentation method that exploits knowledge from the training process of Random forest classifiers. Apart from learning discriminative features, RFs allow us to examine the role of different features and their contribution toward classification accuracy. Hence we are able to quantify the importance of different features to the classification task. These importance measures are used to design a smoothness cost for graph cut segmentation by weighing different features like intensity, texture and curvature according to their importance in classification. This strategy allows us to segment challenging cases where the desired organ in MR images has similar appearance to its surrounding regions. It results in higher segmentation accuracy than conventional graph cut approaches that use only intensity information in the smoothness cost. Thus we are able to weight different features automatically and avoid the shortcomings of ad-hoc weighting strategies. Another advantage of quantifying feature importances is the design of an appropriate feature selection strategy which enables us to decide which features are crucial in a segmentation or classification task. This knowledge is particularly useful when the number of samples is low when compared with the number of feature elements. In such a case the trained RF may not generalize well to novel samples. We can choose to discard those feature elements that have minimal influence on the classification performance. In doing so we can achieve better generalization of the RF in spite of very few samples. ACKNOWLEDGMENT The author thanks Dr. Joachim Buhmann, Peter Sch¨uffler, and Dr. Frans Vos for important and helpful discussions on this work.

1512

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 4, APRIL 2014

R EFERENCES [1] M. Kass, A. Witkin, and D. Terzolpopulos, “Snakes: Active contour models,” Int. J. Comput. Vis., vol. 1, no. 4, pp. 321–331, 1988. [2] Y. Boykov and G. Funka-Lea, “Graph cuts and efficient N-D image segmentation,” Int. J. Comput. Vis., vol. 70, no. 2, pp. 109–131, 2006. [3] M. Leventon, E. Grimson, and O. Faugeras, “Statistical shape influence in geodesic active contours,” in Proc. IEEE CVPR, Jun. 2000, pp. 316–322. [4] D. Cremers, F. Tischhauser, J. Weickert, and C. Schnorr, “Diffusion snakes: Introducing statistical shape knowledge into the Mumford-Shah functional,” Int. J. Comput. Vis., vol. 50, no. 3, pp. 295–313, 2002. [5] N. Paragios, “A variational approach for the segmentation of the left ventricle in cardiac image analysis,” Int. J. Comput. Vis., vol. 50, no. 3, pp. 345–362, 2002. [6] G. Slabaugh and G. Unal, “Graph cuts segmentation using an elliptical shape prior,” in Proc. IEEE ICIP, Sep. 2005, pp. 1222–1225. [7] D. Freedman and T. Zhang, “Interactive graph cut based segmentation with shape priors,” in Proc. IEEE CVPR, Jun. 2005, pp. 755–762. [8] N. Vu and B. Manjunath, “Shape prior segmentation of multiple objects with graph cuts,” in Proc. IEEE CVPR, Jun. 2008, pp. 1–8. [9] T. Cootes, G. Edwards, and C. Taylor, “Active appearance models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 6, pp. 681–685, Jun. 2001. [10] T. Cootes, C. Taylor, D. Cooper, and J. Graham, “Active shape models— Their training and application,” Comput. Vis. Image Understand., vol. 61, no. 1, pp. 38–59, 1995. [11] L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001. [12] A. Montillo, J. Shotton, J. Winn, J. Iglesias, D. Metaxas, and A. Criminisi, “Entangled decision forests and their application for semantic segmentation of CT images,” in Proc. IPMI, 2011, pp. 184–196. [13] D. Mahapatra, P. J. Sch¨uffler, J. Tielbeek, J. M. Buhmann, and F. M. Vos, “A supervised learning based approach to detect Crohn’s disease in abdominal MR volumes,” in Proc. MICCAI Workshop Comput. Clin. Appl. Abdominal Imag., 2012, pp. 97–106. [14] F. M. Vos, J. Tielbeek, R. Naziroglu, Z. Li, P. J. Schüffler, D. Mahapatra, et al., “Computational modeling for assessment of IBD: To be or not to be?” in Proc. IEEE EMBC, Sep. 2012, pp. 3974–3977. [15] B. Kelm, S. Mittal, Y. Zheng, A. Tsymbal, D. Bernhardt, F. Vega-Higuera, et al., “Detection, grading and classification of coronary stenoses in computed tomography angiography,” in Proc. MICCAI, 2011, pp. 25–32. [16] P. Sch¨uffler, D. Mahapatra, J. Tielbeek, F. Vos, J. Makanyanga, D. Pendsé, et al., “A model development pipeline for Crohn’s disease severity assessment from magnetic resonance images,” in Proc. MICCAIABD, 2013, pp. 1–10. [17] D. Mahapatra, P. Sch¨uffler, J. Tielbeek, F. Vos, and J. Buhmann, “Crohn’s disease tissue segmentation from abdominal MRI using semantic information and graph cuts,” in Proc. IEEE 10th ISBI, Apr. 2013, pp. 358–361. [18] D. Mahapatra, P. Sch¨uffler, J. Tielbeek, J. Makanyanga, J. Stoker, S. Taylor, et al., “Automatic detection and segmentation of Crohn’s disease tissues from abdominal MRI,” IEEE Trans. Med. Imag., vol. 32, no. 12, pp. 2332–2348, Dec. 2013. [19] R. Cuingnet, R. Prevost, D. Lesage, L. Cohen, B. Mory, and R. Ardon, “Automatic detection and segmentation of kidneys in 3D CT images using random forests,” in Proc. MICCAI, 2012, pp. 66–74. [20] E. Geremia, B. Menze, O. Clatz, E. Konukoglu, A. Criminisi, and N. Ayache, “Spatial decision forests for MS lesion segmentation in multi-channel MR images,” in Proc. MICCAI, 2010, pp. 111–118. [21] D. Mahapatra and Y. Sun, “Joint registration and segmentation of dynamic cardiac perfusion images using MRFs,” in Proc. MICCAI, 2010, pp. 493–501. [22] D. Mahapatra and Y. Sun, “Orientation histograms as shape priors for left ventricle segmentation using graph cuts,” in Proc. MICCAI, 2011, pp. 420–427. [23] D. Chittajallu, S. Shah, and I. Kakadiaris, “A shape driven MRF model for the segmentation of organs in medical images,” in Proc. IEEE CVPR, Jun. 2010, pp. 3233–3240.

[24] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk, “SLIC superpixels compared to state-of-the-art superpixel methods,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 11, pp. 2274–2282, Nov. 2012. [25] (2012, Feb.) [Online]. Available: http://www.cs.ucl.ac.uk/staff/S.Arridge/ teaching/ndsp/ [26] Z. Tu and X. Bai, “Auto-context and its application to high-level vision tasks and 3D brain image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 10, pp. 1744–1757, Oct. 2010. [27] W. Li, S. Liao, Q. Feng, W. Chen, and D. Shen, “Learning image context for segmentation of prostate in CT-guided radiotherapy,” in Proc. MICCAI, 2011, pp. 570–578. [28] D. Mahapatra, “Cardiac LV and RV segmentation using mutual context information,” in Proc. MICCAI-MLMI, 2012, pp. 201–209. [29] D. Mahapatra and Y. Sun, “Integrating segmentation information for improved MRF-based elastic image registration,” IEEE Trans. Image Process., vol. 21, no. 1, pp. 170–183, Jan. 2012. [30] Y. Zheng, A. Barbu, B. Beorgescu, M. Scheuering, and D. Comaniciu, “Four chamber heart modeling and automatic segmentation for 3D cardiac CT volumes using marginal space learning and steerable features,” IEEE Trans. Med. Imag., vol. 27, no. 11, pp. 1668–1681, Nov. 2008. [31] A. Criminsi, J. Shotton, and S. Bucciarelli, “Decision forests with long range spatial context for organ localization,” in Proc. MICCAI/PMMIA, Sep. 2009, pp. 1–12. [32] D. Mahapatra and Y. Sun, “Registration of dynamic renal MR images using neurobiological model of saliency,” in Proc. 5th IEEE ISBI, May 2008, pp. 1119–1122. [33] D. Mahapatra and Y. Sun, “Nonrigid registration of dynamic renal MR images using a saliency based MRF model,” in Proc. MICCAI, 2008, pp. 771–779. [34] X. Liu, M. Song, D. Tao, Z. Liu, L. Zhang, C. Chen, et al., “Semisupervised node splitting for Random forest construction,” in Proc. IEEE CVPR, Jun. 2013, pp. 492–499. [35] C. Strobl, A.-L. Boulesteix, T. Kneib, T. Augustin, and A. Zeileis, “Conditional variable importance for Random forests,” BMC Bioinformat., vol. 9, no. 1, p. 307, Jul. 2008. [36] M. van der Laan, “Statistical inference for variable importance,” Int. J. Biostat., vol. 2, no. 2, p. 188, 2006. [37] M. Segal, M. Cummings, and A. Hubbard, “Relating amino acid sequence to phenotype: Analysis of peptide-binding data,” Biometrics, vol. 57, no. 2, pp. 632–643, 2001. [38] J. Sled, A. Zijdenbos, and A. Evans, “A nonparametric method for automatic correction of intensity nonuniformity in MRI data,” IEEE Trans. Med. Imag., vol. 17, no. 1, pp. 87–97, Feb. 1998. [39] L. Nyúl and J. Udupa, “On standardizing the MR image intensity scale,” Magn. Reson. Med., vol. 42, no. 6, pp. 1072–1081, 1999. [40] Y. Boykov and O. Veksler, “Fast approximate energy minimization via graph cuts,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 11, pp. 1222–1239, Nov. 2001. [41] A. Liaw and M. Wiener, “Classification and regression by randomforest,” R. News, vol. 2, no. 3, pp. 18–22, 2002.

Dwarikanath Mahapatra received the B.Tech. degree in electrical engineering from the National Institute of Technology, Rourkela, India, and the Ph.D. degree from the National University of Singapore, Singapore, in 2006 and 2011, respectively. He is currently a Research Fellow with ETH Zurich, Zurich, Switzerland. His current research interests include medical image analysis, computer vision, and machine learning.

Analyzing training information from random forests for improved image segmentation.

Labeled training data are used for challenging medical image segmentation problems to learn different characteristics of the relevant domain. In this ...
5MB Sizes 0 Downloads 3 Views