Int J CARS DOI 10.1007/s11548-017-1663-9

REVIEW ARTICLE

Breast cancer cell nuclei classification in histopathology images using deep neural networks Yangqin Feng1 · Lei Zhang1 · Zhang Yi1

Received: 27 April 2017 / Accepted: 18 August 2017 © CARS 2017

Abstract Purpose Cell nuclei classification in breast cancer histopathology images plays an important role in effective diagnose since breast cancer can often be characterized by its expression in cell nuclei. However, due to the small and variant sizes of cell nuclei, and heavy noise in histopathology images, traditional machine learning methods cannot achieve desirable recognition accuracy. To address this challenge, this paper aims to present a novel deep neural network which performs representation learning and cell nuclei recognition in an endto-end manner. Methods The proposed model hierarchically maps raw medical images into a latent space in which robustness is achieved by employing a stacked denoising autoencoder. A supervised classifier is further developed to improve the discrimination of the model by maximizing inter-subject separability in the latent space. The proposed method involves a cascade model which jointly learns a set of nonlinear mappings and a classifier from the given raw medical images. Such an on-the-shelf learning strategy makes obtaining discriminative features possible, thus leading to better recognition performance. Results Extensive experiments with benign and malignant breast cancer datasets are conducted to verify the effectiveness of the proposed method. Better performance was obtained when compared with other feature extraction methods, and higher recognition rate was achieved when compared with other seven classification methods. Conclusions We propose an end-to-end DNN model for cell nuclei and non-nuclei classification of histopathology

B 1

Lei Zhang [email protected] Machine Intelligence Laboratory, College of Computer Science, Sichuan University, Chengdu 610065, China

images. It demonstrates that the proposed method can achieve promising performance in cell nuclei classification, and the proposed method is suitable for the cell nuclei classification task. Keywords Cell nuclei classification · Deep neural network · Denoising autoencoder · Medical image processing · Representation learning

Introduction Breast cancer (BC) is a rigorous public health problem for women around the world. Nowadays, a biopsy is the only way to diagnose with confidante when cancer is really present. Diagnosis from a histopathology image is the gold standard in BC and almost all types of cancer. Histopathological analysis is a time-consuming and experiential task for the pathologists. Automatic histopathological image analysis can assist diagnosis BC. Cell nuclei classification plays a significant role in histopathology image analyses since diseases can often be characterized by its expression in cell nuclei, especially cancers. In the past, pathologists have had to obtain cytological descriptions of cells (e.g. cell size, shape, distribution, and cellular spaces) from histopathology images manually, and then classify the cell nuclei based on these human-specified features. Clearly, such expert-involved “feature learning” is exhaustive and timeconsuming. In contrast to manually designed feature detectors, numerous automatic feature learning and classification methods have been developed. However, many challenges remain: (1) small and variant size cell nuclei increase classification difficulty via the unitary quality criterion; (2) noise and ambiguity created by imperfections in the staining and imaging

123

Int J CARS

processes decrease discrimination between the nuclei and associated tissue; (3) artifacts and unwanted objects introduced during the slide preparation process may lead to poor image quality in local areas. (4) the published breast cancer datasets are usually involved in small-scale and insufficient of ground truth. Many cell nuclei classification algorithms have been proposed over the past few decades in attempts overcome these challenges. Ojansivu et al. [21] proposed using support vector machines (SVM) for histopathological image classification. Kowal et al. [13] employed a K-nearest neighbor (KNN) classifier [7] to handle features extracted from nuclei. Sertel et al. [30] utilized principal component analysis (PCA) to extract low-dimensional features from medical images and then obtain classification results by ensembling multiresolution inference results. Fatakdawala et al. [6] presented methods based on a probabilistic model and expectation maximization for medical image classification [1,20]. Recently, inspired by huge successes of deep learning (DL) [11], some DL-based models have been proposed for nuclei classification. For example, [10] and [22] proposed to utilize a convolutional neural network (CNN) for cell nuclei classification. We note that algorithms based on CNNs require large-scale labeled data; however, labeled histopathological images are expensive and time-consuming to generate. If a large amount of unlabeled data can be utilized, the performance of current supervised-based models could be improved significantly. One feasible solution to this problem is feature learning, which captures the structures of histopathological images from unlabeled data. Furthermore, the discriminative model is trained on this new feature space for cell nuclei classification. Unsupervised feature learning neural networks, such as stacked sparse autoencoders (SSAEs) have been employed to learn high-level representations of hematoxylin and eosin (H&E) stained images, to discover worthwhile information from a large number of unlabeled medical images. Hematoxylin is a blue staining basic dye that stains genetic materials, which are mainly seen in cell nuclei, although some cytoplasmic components and extracellular material are also stained. Eosin is a pink-staining acidic dye that stains membranes and fibers, most obviously seen in the cytoplasm and connective tissues. Routine histology uses the stain combination of hematoxylin and eosin, commonly refers to as H&E. The method in [38,39] used a two-hidden-layer SSAE to learn breast cancer image representations in an unsupervised manner and then trains a classifier to recognize nuclei and non-nuclei images. It is notable that the method learns high-level representations of gray-scale images without considering the color information in the H&E images. In this paper, a novel end-to-end deep neural network (DNN)-based model is proposed for breast cancer nuclei classification. It is a cascade model and learns hierarchal features

123

from raw data. First, it hierarchically obtains high-level representations of patches sampled from histopathology images by employing a stacked denoising autoencoder (SDAE) [37]. To improve the discrimination of the proposed model, these high-level features and the labeled information are employed to train a classifier to separate nuclei and non-nuclei image patches. Finally, supervised end-to-end fine-tuning is conducted to optimize the model. The proposed method is suitable and useful for cell nuclei classification tasks, where only a few labeled images and a large number of unlabeled images are available for the training process of the classification. Moreover, it is a patch-based learning method, which overcomes the challenge of small and variant size cell nuclei. The remainder of our paper is organized as follows: “Preliminaries” in section introduces related works. “The proposed method” in section presents our cell nuclei classification method based on a SDAE. Experimental results illustrating the effectiveness of the proposed algorithm are shown in “Experimental results” in section. “Conclusions and future works” in section concludes the paper.

Preliminaries Notations Unless specified otherwise, lower-case bold letters represent column vectors and upper-case bold letters represent matrices and collections. MT denotes the transpose of the matrix M. Table 1 summarizes the notations used throughout the paper. Image classification Image classification is one of the most important aspects of digital image analysis. Many image classification algorithms have been proposed for applications in specific image domains over the past decade, such as those used with handwritten numerical digits, remote sensing, and medical image processing [17,23,24,32]. For example, a KNN classifier has been employed for cell phase identification [3]. Furthermore, some classifiers (e.g. Bayesian, KNN, and SVM) have been used for prostate cancer recognition based on fractal features [12]. However, these traditional methods are primarily based only on color, texture, and other feature descriptors, or aggregations of them. Usually, these feature descriptors are quite complicated and domain-specific. To address these problems, automatic feature representation learning using deep neural networks becomes an attractive and popular alternative. In deep learning models, the neural network can be treated as a feature extractor that can obtain better representations, i.e., it can recognize those feature representations that achieve stronger classification discriminations. For example, LeNet-5 [16] is a convolutional network designed for handwritten and machine-printed character classification. GoogLeNet [34] was the most effective method for classification and detection in the ImageNet large-scale visual recognition challenge

Int J CARS Table 1 Notations

Notation

Definition

d

Dimension of input sample

n

Data size

m

Layers of neural network

k

Number of hidden units

dx

Dimension of patches

λ

Weight decay parameter

β

Sparsity parameter

ρ

Hidden units average target activation

ρˆ j

jth hidden unit average activation over a n training data point

ra

Additional noise ratio for white Gaussian noise

rc

Ratio of random pixel corruption

·

Result of 2 -norm

x ∈ Rd

Data point

h(1) , h(2)

Representation for x in the hidden layers

y∈

Rd

Output for x in the output layer



Corrupt version of x

W

Weight matrix

P

Collection of labeled data points

X

Collection of unlabeled data points

T

Ground truth collection of X

Y

Output collection from the neural network

2014. Attempts have been made in the medical imaging domain after inspiration by the success of deep neural networks in other domains. Deep learning and its applications to medical image analysis will be introduced next. Deep learning Deep learning has received increasing interest in a number of research areas, including image classification, object detection, speech recognition, and human action recognition. Existing DL architectures consist of multiple linear and nonlinear transformation layers. Specifically, DL methods aim at learning feature hierarchies with features from higher levels of the hierarchy formed by the composition of lower-level features. A more abstract and more useful representation of the raw data can be extracted using DL methods. Many deep learning methods have been applied in medical image processing in recent years, including the use of Restricted Boltzmann machines (RBM) [28] for the classification of functional magnetic resonance imaging (fMRI) images. A stacked autoencoder in [31] was trained to encode the spatial and temporal features in dynamic contrastenhanced magnetic resonance imaging (DCE-MRI) images, and a classifier was trained to classify those representations in different organs. Convolutional neural networks [4] have been trained directly on raw RGB data sampled from source images to differentiate patches with mitotic nuclei close to the center of others. More recently, a DL-based model

was developed to classify nuclei patches on breast cancer histopathology images [38]. Significantly and differently, our proposed method employs an SDAE to learn the robust features of breast cancer nuclei implicitly, with sufficient color information exploitation. Denoising autoencoders The denoising autoencoder is a variant autoencoder (AE), which is a symmetrical neural network and learns the features of a dataset in an unsupervised manner. This is done by minimizing the reconstruction error between the input data at the encoding layer and its reconstruction at the decoding layer, such that the correlation between the input features is learned in an EM-like fashion [5,26] in the mapping weight vectors. The architecture of a traditional autoencoder is shown in Fig. 1. Let X = {x(1), x(2), . . . , x(n)} ∈ Rd×n be a collection of n unlabeled training samples. The autoencoder takes an input vector x(i), and encodes it to a hidden representation h(i) ∈ Rk by applying a linear mapping and a nonlinear activation function to the network as:   (1) h(i) = f W(1) x(i) + b(1) , where W(1) ∈ Rk×d is a weight matrix, b(1) ∈ Rk is the encoding bias, and f (·) is the nonlinear activation function, which is typically set as sigmoid, tanh, or hardlim functions. After obtaining the hidden representation h(i) , the autoencoder decodes it using a decoding matrix as:

123

Int J CARS

Fig. 1 Basic architecture of a traditional autoencoder. It contains two parts: the encoder and decoder. The notation x is the input, which is being used to reconstruct itself (y is the reconstruction of x) by minimizing the reconstruction error between x and y, and h is the feature representation of x in the hidden layer. W(1) is the weight matrix representing features



y(i) = f W

(2)

(2)

h(i) + b



K L(ρρˆ j ) = ρlog ,

(2)

where b(2) ∈ Rd is the decoding bias, and W(2) ∈ Rd×k is the decoding matrix, which may optionally be constrained by T  W(2) = W(1) . Features in the training data are obtained in the process of optimizing parameters by minimizing the reconstruction error of the following cost function: Φ(W, x, y) =

n  1 λ W(1) 2 + W(2) 2 , L(x(i), y(i)) + n 2 i=1

(3) where the first term is the likelihood function, such as a (onehalf) squared-error cost function L(x(i), y(i)) = 21 x(i) − y(i)22 , the second term is a regularization term that aims to decrease the magnitude of the weights, and prevents overfitting. The weight decay parameter λ controls the relative importance of the above two terms. More interesting structure can be discovered by imposing other constraints on the network. In particular, a sparsity constraint on the hidden units will be imposed on it by adding an extra penalty term to the optimization objective. To achieve this goal, the sparse autoencoder (SAE) [25] aims to minimize the reconstruction error with a sparsity constrain as: n  K L(ρρˆ j ), (4) J (W, x, y) = Φ(W) + β j=1

where β is the sparsity parameter, ρ is the target average n h j (i) is the average activaactivation of h, ρˆ j = n1 i=1 tion of the j-th hidden unit over the n training samples, and K L(ρρˆ j ) is Kullback–Leibler divergence [14] which is defined as

123

Fig. 2 Basic architecture of a denoising autoencoder. The denoising autoencoder maps the corrupted version of x, i.e., x˜ , to its hidden representation h via W(1) , and aims to recover x from h via W(2) . Its reconstruction error is measured by J (W, x, y)

ρ 1−ρ + (1 − ρ)log . ρˆ j 1 − ρˆ j

(5)

The denoising autoencoder [36] is a variant of the basic autoencoder that aims to reconstruct a clean input from a corrupted version of the input sample. The architecture of a denoising autoencoder is shown in Fig. 2. It assumes an additional specific criterion: robustness can be improved by partial destruction of the inputs, i.e., partially destroyed inputs should yield almost the same representations. When training a denoising autoencoder, the raw input x is corrupted to obtain its corrupted version x˜ , and then x˜ is taken as the input of the neural network to compute the activations of  the hidden units h(i) = f W(1) x˜ (i) + b(1) . Then, h(i) is input to the output layer to compute the outputs  of the  denoising autoencoder y(i) = f W(2) h(i) + b(2) . y(i) is the reconstruction of x(i) rather than that of x˜ (i). The weight matrices W(1) and W(2) are trained to minimize the average reconstruction error over the training collection X. Here, the Eqs. (3) or (4) can be taken as the cost function of the denoising autoencoder. Denoising autoencoders still minimize the cost function about clean x and its reconstruction y, and learn to reconstruct a raw datum from its corrupted version x˜ , such that robust features are learned. After training the single hidden layer denoising autoencoder shown in Fig. 2, the clean input is mapped to the hidden representation directly without a corruption process. The procedure of the mapping is shown in Fig. 3.

The proposed method We propose an end-to-end deep neural network for histopathology image nuclei classification in this section.

Int J CARS

 (2) , which can be interpreted as the image feaw1(2) , . . . , wk2 tures learned by the SDAE in the second hidden layer. Nuclei and non-nuclei classifier

Fig. 3 Procedure for obtaining the hidden representations of uncorrupted samples with a trained denoising autoencoder

The proposed architecture for classification Our cell nuclei classification approach for histopathology image analysis contains two major modules, as shown in Fig. 4. In the first stage, it maps the histopathology image patches using the encoding layers of our SDAE. In the second stage, it classifies the nuclei and non-nuclei patches using the softmax (SM) classifier. The SDAE in this architecture aims to learn high-level features of the sampling image patches by using only unlabeled data. Next, the high-level feature representations of the labeled image patches are obtained to train the softmax classifier to recognize the nuclei image patches and non-nuclei image patches.

Unsupervised feature learning via SDAE Stacking denoising autoencoder models create a neural network consisting of multiple hidden layers of denoising autoencoders, where the outputs of each layer are wired to the inputs of the successive layer. It is specified that input corruption is only used for the initial denoising-training of each individual layer. Once it has been trained, the uncorrupted samples can be taken as the inputs to the SDAEs, hence obtaining robust feature representations of the inputs. The complete procedure of learning and stacking several hidden layers of denoising autoencoders is shown in Fig. 5. Given a set of image patches X = {x(1), x(2), . . . , x(n)} which were sampled from breast cancer histopathology images, where x(i)(i = 1, 2, . . . , n) is a column vector representing a d p × d p square histopathology image patch with three channels (RGB). A two-hidden-layer SDAE (see Fig. 5) is constructed to learn the features in the nuclei and nonnuclei patches. In the SDAE there are d p ×d p ×3 input units, k1 neurons in the first hidden layer, and k2 neurons in the second hidden layer. Here, the X was taken as the training set to train the SDAE. Firstly, the initial input x(i) is corrupted into x˜ (i) by means of a stochastic mapping. Then, a greedy layer-wise method is employed to pre-train the SDAE. (1) (1) This produces a weight matrix W(1) = w1 , . . . , wk1 , which represents image features learned by the SDAE in the first hidden layer, and weight matrix W(2) =

After the unsupervised pre-training, the encoding layers of the trained SDAE are employed to parse nuclei and nonnuclei path representations. In detail, for input raw data p(i), the output of the SDAE in the second hidden layer h(2) (i) is treated as input to a softmax classifier (SM) to calculate the corresponding label l(i) for the input data p(i). Specifically, a four layers neural network is constructed: the input layer, two hidden layers from the SDAE, and the output layer. Please note that the first three layers are the encoder of the trained SDAE, and the last layer was a softmax classifier. Then, the representations of the labeled data P = {p(1), p(2), . . . , p(n)} and its ground truth T = {t (1), t (2), . . . , t (n)} are adopted to pre-train the last layer. After pre-training, all four layers are combined together to form the DNN model, which is capable of classifying the nuclei and non-nuclei patches as desired. Then, fine-tuning all the connection weight parameters of the DNN model is accomplished using the labeled data P to improve performance. Finally, the trained end-to-end DNN model is used to classify the new input image patches. In the testing step, by inputting a new image patch q, the corresponding representations h(1) in the first hidden layer, and h(2) in the second hidden layer can be computed. The high-level representation h(2) can then be input to softmax to give a prediction label g. Implementation details We present the details of the nonlinear activation functions, the initializations of W and b, and the training strategy in this subsection. Activation function Many nonlinear activation functions are available to determine the node outputs of our deep neural network. In our experiments, the sigmoid function was used as the activation function. It is computed as follows: sigmoid(s) =

1 . 1 + e−s

(6)

Initialization The initializations of all W and b variables are important to the gradient descent-based method used in our deep neural networks. Random initialization is popular in deep learning. In the experiments, a simple normalized random initialization method was used [9], where the bias b is initialized as 0, and the weight matrices of each layer are initialized as the following uniform distribution:

W

(l)

∼ U −√

√ 6 d (l) + d (l−1)

,√

√ 6 d (l) + d (l−1)

,

(7)

123

Int J CARS

Fig. 4 Proposed method for nuclei classification. For the given breast cancer images, a number of image patches are sampled as the inputs to the SDAE to learn the representations in an unsupervised manner. Then,

the SDAE representations are used as input to train a softmax classifier by labeled data. Finally, the model is fine-tuned and applied to classify the new input histopathology image patches

To increase the convergence rate, either the simple momentum method or more advanced optimization techniques, such as the L-BFGS or the conjugate gradient method, can be applied. Algorithm 2 summarizes the detailed procedure of supervised end-to-end fine-tuning for the DNN model.

Algorithm 1 Greedy layer-wise approach of training the proposed SDAE.

Fig. 5 Architecture of a stacked denoising autoencoder. After training the first single hidden layer denoising autoencoder, the hidden representation h(1) is treated as the input to the successive denoising autoencoder. Then, the second level representation h(2) can be learned by reconstructing the h(1) from its corrupted version and minimizing the cost function J (W, h(1) , y). y is the reconstruction for h(1) . Repeating the procedure, more hidden layers can be added into the SDAE model

where d (l) is the dimension of the l-th layer, and d (0) is the dimension of input layer. Training strategy A greedy layer-wise procedure has been shown to yield significantly better local minima than the random initialization of deep networks, achieving better generalization on a number of tasks [15]. The greedy, layer-wise approach is employed to pre-train our SDAE. Algorithm 1 summarizes the detailed procedure of our greedy, layerwise approach for training the SDAE. The trained SDAE is employed to obtain representations for nuclei and non-nuclei patches, after the pre-training. Then, an additional softmax classifier layer capable of classifying the nuclei and nonnuclei patches as desired is employed in the last layer. Finally, the parameters of the whole network are fine-tuned using a classical backpropagation (BP) algorithm with labeled data.

123

Input: Training set: X = {x(1), x(2), . . . , x(n)}; Hyper parameter: λ; Sparsity parameter: β; Iterative number It . Output:

m+1 Weights and biases: W(l) , b(l) l=1 ; //Initialization

m 1: Initialize W(l) , b(l) l=1 according to Equation (7). 2: Set h(0) (i) = x(i), i ∈ [1, n]. //Greedy layer-wise approach pre-training SDAE 3: for each l ∈ [1, m] do 4: Set x(l) (i) = h(l−1) (i). 5: Corrupt x(l) (i) to obtain x˜ (l) (i). 6: for each t ∈ [1, It ] do 7: Perform forward propagation to calculate y(l) (i) according to Equation (1) and Equation (2) by using x˜ (l) (i) as the input. 8: Solve the optimization problem in Equation (4) to acquire W(l) , b(l) . 9: end for 10: Perform forward propagation to obtain representation h(l) (i) for each x(l) (i). 11: end for m+1 12: return {W(l) , b(l) }l=1 .

Experimental results We evaluated the effectiveness of our proposed method for cell nuclei classification on a breast cancer cell database by considering results in terms of accuracy, computational cost, and robustness, in this section.

Int J CARS Fig. 6 Example of stained histopathology images and ground truth. a 896 × 768 stained histopathology images. b 200 × 200 ground truth window

Algorithm 2 Supervised pre-train the softmax classifier and fine-tuning the DNN model. Input: Training set: P = {p(1), p(2), . . . , p(n)}; Training label: T = (t(1), t(2), . . . , t(n)); Hyper parameter: λ; Sparsity parameter: β; Iterative number It . Output:

s+1 Weights and biases: W(l) , b(l) l=1 ; //Initialization

s 1: Initialize W(l) , b(l) l=1 according to the weights gained from the trained SDAE. 2: Set h(m) (i) as raw input to pre-train softmax classifier to obtain W(l+1) , b(l+1) . 3: Built a deep neural network with multi hidden layers form the trained SDAE and a final trained softmax classifier layer. // Fine-tune all parameters 4: for each t ∈ [1, It ] do 5: Set h(0) (i) = p(i), i ∈ [1, n]. 6: for each l ∈ [1, s + 1] do 7: Do forward propagation. 8: end for 9: for each l ∈ [1, s + 1] do 10: Fine-tune W(l) , b(l) by minimize cost function of softmax classifier. 11: end for 12: end for s+1 13: return {W(l) , b(l) }l=1 .

Database The experiments were conducted using the Breast Cancer Cell (BCC) database at the University of California, Santa Barbara Bio-Segmentation Benchmark [8].1 This database contains 58 H &E stained histopathology images with the size of 896 × 768; Specifically, 26 malignant images, plus 32 benign images. For each image, a pixel-level markup is specified within a ground truth window of about 200 × 200. Figure 6 shows a stained image and the ground truth window. Figure 7 shows two example of the benign and malignant images and their associated ground truth. These images are 1

The BCC database is downloaded from the website at http://bioimage. ucsb.edu/research/bio-segmentation.

H &E stained, as most sample cells are essentially transparent, with little or no intrinsic pigment. In addition, unstained tissue sections do not have color pigments. Special stains, as described previously, which bind selectively to particular components, are used to identify biological structures. Figure 8 clearly illustrates the size and shape of the nuclei.

Experimental settings The BCC database is divided into two different subsets: one a benign image database, and the other a malignant image database. For each subset, two classes of square patches (15 × 15) were sampled from the breast cancer cell histopathology images, i.e., nuclei and non-nuclei patches. Because sizes of the nuclei in the images are different, each 15 × 15 nuclei patch contains one small nucleus or is inside a large nucleus. The non-nuclei patches do not contain any complete cell nuclei, nor are inside a large nucleus. Nonnuclei patches contain cytoplasm, connective tissue, the gaps between nuclei or other non-nuclei areas and distractors. Specifically, 33,000 patches were sampled from the benign subset (8700 nuclei patches and 24,300 non-nuclei patches), and 25,000 patches were sampled from the malignant subset (12,200 nuclei patches and 12,800 non-nuclei patches). The training and test sets were created separately according to the image so that images used to build the training set are not used for the testing set. Sample details are shown in Table 2. The L-BFGS method is utilized to update the connection parameters in the DNN model. Firstly, an SDAE with two hidden layers is trained to extract features, and then a softmax classifier is connected to classify nuclei patches and non-nuclei patches. There are 15 × 15 × 3 input units, creating 400 neurons in the first hidden layer. Then the number of units is reduced by half, that is, 200 neurons exist in the second hidden layer. To train the SDAE, Gaussian noise with a variance of 0.05 is generated to acquire the corrupted input. The regularization λ and the sparsity parameter β are empirically set as 0.003 , 3 and 0.1, respectively. In order to get

123

Int J CARS

Fig. 7 Examples of benign and malignant images and their associated ground truth. a Benign histopathology image. b Benign ground truth label. c Malignant histopathology image. d Malignant ground truth label label

Fig. 8 Eight histopathology images. Top row the samples of benign images; Bottom row the samples of malignant images. The red circles are distractors (some artifacts or dust introduced during the slide prepa-

ration process which have influence on the image quality), the red boxes represent the cell nuclei and the green boxes are the non-nuclei

Table 2 Details of breast cancer cell database samples used in the experiment. For simplicity, n i denotes the total number of images, and si stands for the number of samples in each database

Accuracy (Acc), which are defined respectively as:

Dataset

ni

Original size

si

Patch size

Benign

32

896 × 768

33,000

15 × 15

Malignant

26

896 × 768

25,000

15 × 15

Prec =

F1 = 2 ×

Acc = convincing results, 10-fold cross-validation has been done for the experiments. For each trail, the benign training set contains about 90% patches which all from 29 benign images and patches form other 3 benign images as testing images, the malignant training set contains about 90% patches which all from 23 malignant images and patches form other 3 malignant images as testing images. All the experiments were run on a computer with a 2.10 GHz Intel Quad-Core processor and 8 GB of RAM memory under the environment of Windows 10 operating system. Performance metrics Classification performance was evaluated in terms of Precision (Prec), F1 Score (F1), and

123

TP , TP + FP Prec × RC TP , RC = , Prec + RC TP + FN

TP + TN , TP + TN + FP + FN

(8) (9)

(10)

where TP is the number of nuclei patches that had been correctly classified based on the ground truth, FP is the number of nuclei patches that had been wrongly classified, TN is the number of non-nuclei patches that had been correctly classified, and FN is the number of non-nuclei patches that had been wrongly classified. Results and analysis Experiment 1 Feature exaction performance was investigated by comparing our proposed method with three other feature

Int J CARS Table 3 Comparison of the mean accuracy and standard error(%) with the original data and features extracted by PCA, KPCA, DSIFT and DNN on the benign subset and malignant subset Method

Benign

Malignant

Raw data + SM

97.05 ± 0.19

82.58 ± 0.77

PCA + SM

94.97 ± 0.35

80.01 ± 0.83

KPCA + SM

96.02 ± 0.17

84.03 ± 0.76

DSIFT + SM

85.23 ± 0.67

81.62 ± 0.87

+ SM

95.49 ± 0.34

86.73 ± 0.53

SDAE2 + SM

98.27 ± 0.15

90.54 ± 0.45

SDAE3 + SM

98.28 ± 0.12

90.49 ± 0.52

SDAE1

exaction methods: principle component analysis [33], Kernel principle component analysis (KPCA) [27], and dense scale-invariant feature transform (DSIFT) [18]. Once lowdimensional features were extracted, the softmax classifier was used to verify the performance of the tested methods. The dimensionality of the data was reduced by reserving 98% energy for PCA method analysis. Our experiment aimed to examine the feature exaction performance of our SDAE versus other state-of-the-art feature extraction methods. In our case, the raw patches were classified with a softmax classifier (Original+SM) as a baseline. In addition, in order to explore how feature hierarchy affect the performance, we trained the SADE with different hidden layers (1-hidden layer, 2-hidden layers, and 3-hidden layers) separately. Mean classification accuracies are reported in Table 3. To summarize: – The KPCA method outperforms the PCA method for feature exaction on the BCC database. This illustrates that the patches lie in nonlinear subspaces; – Our proposed method competes well against the other four feature extraction algorithms that we examined; It shows that our proposed method is much more robust

than other tested methods on the challenging cell nuclei classification task; – Our proposed method obtains good performance with two hidden layers and three hidden layers. This indicates that the proposed method would be practical for realworld applications. – All methods achieved better performance on the benign subset than on the malignant subset. The causes of this phenomenon may include the fact that the features implicit in malignant images are more complex than those in benign images, and the number of benign samples was larger than that of malignant samples. – Our proposed method with three hidden layers can not gain the better performance on the malignant subset, and it only have little improvement on the benign subset. Therefore, the method with two hidden layers was more suitable for the task. Feature detectors that correspond to the first hidden layer of a network trained on the benign and malignant database were straightforward to visualize (Fig. 9). Figure 10 shows a set of 400 image features learned from benign (a) and malignant samples (b). It shows that features learned from both subsets capture visual patterns related to colors, the edges of large nuclei in different orientations, and perhaps some small dots related to common nuclei patterns. Experiment 2 The performance of our proposed method was compared with that of seven other methods: the Knearest Neighbor classifier [7], Support Vector Machine [35], Kernel Support Vector Machine (KSVM) [29], Random Forest Classifier (RFC) [2], Discriminant Analysis Classifier (DAC) [19], Multi-layer Perceptron (MLP) [40], the SSAE method used in [39] and the CNN model used in [10], for nuclei classification both on benign and malignant subsets. The KNN classifier is a simple algorithm that stores all available cases and classifies new cases based on a similarity

Fig. 9 Color features in the first hidden layer of the proposed DNN model. a features on the benign database. b features on the malignant database

123

Int J CARS Fig. 10 Examples of classification results. a the nuclei in benign subset. b the non-nuclei in benign subset. c the nuclei in malignant subset. d the non-nuclei in malignant subset

Table 4 Comparison of the mean Precision (Prec), Recall (RC), F1 Score (F1), Accuracy (Acc)(%) and Mean Execution Time (MET) (in minutes) with other popular nuclei classification methods and neural network models on the benign subset

Table 5 Comparison with the popular methods and neural network models on the malignant subset Method

Prec

F1

Acc

MET

Method

KNN

83.31

88.31

88.02

6.12

SVM

83.21

82.59

82.10

4.40

K-SVM

82.83

82.67

82.56

5.32

DAC

82.66

81.67

81.71

0.17

RFC

84.29

84.56

84.66

0.27

MLP

86.36

84.13

85.43

14.45

SSAE

88.02

88.69

89.06

65.71

CNN

89.57

89.95

89.83

60.62

Ours

90.04

90.17

90.54

68.76

Prec

F1

Acc

MET

KNN

77.56

86.63

93.12

9.53

SVM

94.61

94.21

96.56

2.15

K-SVM

94.43

94.94

96.67

1.23

DAC

92.56

93.57

95.35

0.15

RFC

95.45

92.78

96.42

0.30

MLP

95.98

95.16

96.28

19.55

SSAE

96.29

96.55

96.84

80.63

CNN

97.03

96.97

97.13

72.67

Ours

97.88

97.92

98.27

81.43

measurement. The SVM and K-SVM are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. The RFC is an ensemble learning method for classification; it operates by constructing a multitude of decision trees at training time and outputs the class that is the mode of all classes (classi-

123

fication) or a mean prediction (regression) of the individual trees. The DAC is a linear classifier that tries to find a linear combination of features that characterizes or separates two or more classes. The MLP, SSAE used in [39] and CNN used in [10] are existing deep learning methods. Tables 4 and 5 present classification performance results on the benign subset and malignant subset, respectively. The following conclusions naturally follow:

Int J CARS Table 6 Comparison of the mean accuracy and standard error(%) with the popular methods and neural network models on the malignant subset Method

Benign

Malignant

KNN

91.15 ± 1.80

82.56 ± 2.78

SVM

93.53 ± 1.45

81.78 ± 2.12

K-SVM

92.35 ± 1.38

81.64 ± 1.54

DAC

90.89 ± 1.73

73.22 ± 2.23

RFC

94.15 ± 1.97

79.65 ± 4.20

MLP

95.78 ± 1.43

80.83 ± 4.79

SSAE

96.40 ± 1.25

87.69 ± 4.27

CNN

96.02 ± 1.74

87.33 ± 4.64

Ours

97.98 ± 0.69

88.37 ± 1.90

– All methods get higher indices on the benign subset than so on the malignant subset (except Prec and F1 for the KNN method). This is consistent with results from Experiment 1. – The K-SVM method works well on the benign subset, while it is inferior to the other seven methods on the malignant subset (except for the DAC method). The KNN method works well on the malignant subset, while it is inferior to all other methods on the benign subset. – Our proposed method outperforms all other methods on these two subsets. Specifically, the Acc of our proposed method on the benign database is at least 1.14% higher than that of the second best method, and the Pr ec of our proposed method is 0.85% higher than that of the second best method; Experiment 3 One-tenth samples were extracted from the benign and malignant samples used in the second experi-

ment, to investigate the performance of the same methods on benign and malignant subsets under the condition of insufficient sample number. In this experiment, the scale of hidden units was same as the model employed in the second experiment. Mean classification accuracy and standard error are reported in Table 6. It shows that:

– Our proposed method outperforms all other examined classification algorithms on these two subsets. – Compared with the results in Tables 4 and 5, the accuracy of our proposed method decreases a little under the insufficient sample number condition. Specifically, the accuracy of our proposed method slightly drops from 98.27% to 97.98% on the malignant subset, and 90.54% to 88.37% on the benign subset. This means that the proposed method still works well in spite of such a challenging condition.

Experiment 4 Two popular noise generation techniques were added to the input image patches, white Gaussian noise and random pixel corruption, to verify the robustness of the proposed method to noise. We added white Gaussian noise to individual samples x as xG = x + ra e, where ra is the additional noise ratio, and e is the noise following a standard normal distribution (0.5). Randomly selected pixels from the image were selected and then replaced with values following a uniform distribution over [0, pmax ] with a corruption ratio of rc , where pmax is the largest pixel value of x, in our random pixel corruption experiment. Classification accuracy on benign and malignant databases with the additional Gaussian noise is reported in 100

90

The classification accuracy (%)

The classification accuracy (%)

100

80

70

60

50

40 0.1

KNN SVM K-SVM DAC RFC MLP SSAE CNN DNN

KNN SVM K-SVM DAC RFC MLP SSAE CNN DNN

90

80

70

60

50

0.2

0.3

The corruption ratio r a

(a)

0.4

0.5

40 0.1

0.2

0.3

The corruption ratio r a

0.4

0.5

(b)

Fig. 11 Comparison of tested methods performance on the benign and malignant databases with additional white Gaussian noises. a benign samples. b malignant samples

123

Int J CARS 100

90

The classification accuracy (%)

The classification accuracy (%)

100

80

70

60

50

40 0.1

KNN SVM K-SVM DAC RFC MLP SSAE CNN DNN

KNN SVM K-SVM DAC RFC MLP SSAE CNN DNN

90

80

70

60

50

0.2

0.3

The corruption ratio r c

0.4

0.5

40 0.1

(a)

0.2

0.3

The corruption ratio r c

0.4

0.5

(b)

Fig. 12 Comparison of tested methods performance on the benign and malignant databases with random pixel corruption. a benign samples. b malignant samples

Fig. 11, and classification accuracy with our random pixel corruption is reported in Fig. 12. The following observations can be summarized from our results: – The proposed method is superior to all other examined classification algorithms on these two subsets under cases with an additional noise ratio of ra = 0.1 to 0.5. – The proposed method outperforms all other methods under different random pixel corruption ratios (rc = 0.1 to 0.5), both with the benign database and the malignant database.

Conclusions and future works We propose an end-to-end DNN model in this paper for cell nuclei and non-nuclei classification of histopathology images. The model utilizes an SDAE to learn robust features from a large number of H&E stained image patches in an unsupervised manner. It is suitable for cell nuclei classification where only a small amount of labeled data are available. Patch-based feature learning helps to overcome the challenge of variant cell nuclei size and is robust to noise and unwanted objects. We have introduced a simple and effective method that achieves competitive verification performance using the well-established breast cancer cell dataset at the UCSB BioSegmentation Benchmark Further investigation of our model will improve its performance. For example, several free parameters (λ, β, patch size, the number of hidden units, and the number of hidden layers) need to be individually set for different databases

123

in the present implementation. Determining optimal values for these free parameters is a challenge. Therefore, the exploration of additional theoretical results via parameter selection, and further investigation of the performance of the DNN model by extension to a complete architecture, or by combining advanced neural networks, is warranted. Acknowledgements This work is supported by the Fok Ying Tung Education Foundation under Grant 151068, the Foundation for Youth Science and Technology Innovation Research Team of Sichuan Province 2016TD0018, and the National Natural Science Foundation of China under Grant 61332002. Compliance with ethical standards

Conflict of interest The authors declare that they have no conflict of interest. Human and animal rights No animal or human experiments were conducted as part of this research. Informed consent Informed consent was obtained from all individual participants included in the study.

References 1. Basavanhally A, Xu J, Madabhushi A, Ganesan S (2009) Computer-aided prognosis of er+ breast cancer histopathology and correlating survival outcome with oncotype dx assay. In: IEEE international symposium on Biomedical imaging: from nano to macro (ISBI’09). IEEE pp 851–854 2. Breiman L (2001) Random forests. Mach Learn 45(1):5–32 3. Chen X, Zhou X, Wong ST (2006) Automated segmentation, classification, and tracking of cancer cell nuclei in time-lapse microscopy. IEEE Trans Biomed Eng 53(4):762–766

Int J CARS 4. Cirean DC, Giusti A, Gambardella LM, Schmidhuber J (2013) Mitosis detection in breast cancer histology images with deep neural networks. Springer, Berlin, pp 411–418 5. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J Roy Stat Soc B 39(1):1–38 6. Fatakdawala H, Xu J, Basavanhally A, Bhanot G, Ganesan S, Feldman M, Tomaszewski JE, Madabhushi A (2010) Expectation maximization-driven geodesic active contour with overlap resolution (emagacor): application to lymphocyte segmentation on breast cancer histopathology. IEEE Trans Biomed Eng 57(7):1676–1689 7. Fukunaga K, Hostetler LD (1975) K-nearest-neighbor bayes-risk estimation. IEEE Trans Inf Theory 21(3):285–293 8. Gelasca E.D, Byun J, Obara B, Manjunath B (2008) Evaluation and benchmark for biological image segmentation. In: 15th IEEE International Conference on Image Processing (ICIP 2008). IEEE, pp 1816–1819 9. Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. Aistats 9:249–256 10. Hatipoglu N, Bilgin G (2014) Classification of histopathological images using convolutional neural network. In: 2014 4th international conference on image processing theory, tools and applications (IPTA). IEEE, pp 1–6 11. Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554 12. Huang PW, Lee CH (2009) Automatic classification for pathological prostate images based on fractal analysis. IEEE Trans Med Imag 28(7):1037–1050 13. Kowal M (2014) Computer-aided diagnosis for breast tumor classification using microscopic images of fine needle biopsy. Springer, Berlin, pp 213–224 14. Kullback S, Leibler RA (1951) On information and sufficiency. Anna Math Stat 22(1):79–86 15. Larochelle H, Erhan D, Courville A, Bergstra J, Bengio Y (2007) An empirical evaluation of deep architectures on problems with many factors of variation. In: Proceedings of the 24th international conference on Machine learning. ACM, pp 473–480 16. LeCun Y, Jackel L, Bottou L, Brunot A, Cortes C, Denker J, Drucker H, Guyon I, Muller U, Sackinger E (1995) Comparison of learning algorithms for handwritten digit recognition. In: International conference on artificial neural networks, vol 60. pp 53–60 17. Liang M, Li Z, Chen T, Zeng J (2015) Integrative data analysis of multi-platform cancer data with a multimodal deep learning approach. IEEE/ACM Trans Comput Biol Bioinf 12(4):928–937. doi:10.1109/TCBB.2014.2377729 18. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110 19. McLachlan G (2004) Discriminant analysis and statistical pattern recognition. Wiley, Hoboken 20. Naik S, Doyle S, Agner S, Madabhushi A, Feldman M, Tomaszewski J (2008) Automated gland and nuclei segmentation for grading of prostate and breast cancer histopathology. In: 5th IEEE international symposium on biomedical imaging: from nano to macro (ISBI 2008). IEEE, pp 284–287 21. Ojansivu V, Linder N, Rahtu E, Pietikinen M, Lundin M, Joensuu H, Lundin J (2013) Automated classification of breast cancer morphology in histopathological images. Diagn Pathol 8(1):1–4 22. Pang B, Zhang Y, Chen Q, Gao Z, Peng Q, You X Cell nucleus segmentation in color histopathological imagery using convolutional networks. In: 2010 Chinese conference on pattern recognition (CCPR). IEEE, pp 1–5 23. Peng X, Yi Z, Tang H (2015) Robust subspace clustering via thresholding ridge regression. In: AAAI, pp 3827–3833

24. Peng X, Zhao B, Yan R, Tang H, Yi Z (2016) Bag of events: an efficient probability-based feature extraction method for aer image sensors. IEEE Trans Neural Netw Learn Syst 99:1–13 25. Poultney C, Chopra S, Cun Y.L (2007) Efficient learning of sparse representations with an energy-based model. In: Advances in neural information processing systems, pp 1137–1144 26. Ranzato MA, Huang FJ, Boureau YL, LeCun Y (2007) Unsupervised learning of invariant feature hierarchies with applications to object recognition. In: IEEE conference on computer vision and pattern recognition (CVPR’07). IEEE, pp 1–8 27. Schlkopf B, Smola A, Mller KR (1997) Kernel principal component analysis. Springer, Berlin, pp 583–588 28. Schmah T, Hinton G.E, Small S.L, Strother S, Zemel R.S (2009) Generative versus discriminative training of RBMS for classification of fmri images. In: Advances in neural information processing systems. pp 1409–1416 29. Scholkopf B, Smola AJ (2001) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, Cambridge 30. Sertel O, Kong J, Shimada H, Catalyurek U, Saltz JH, Gurcan MN (2009) Computer-aided prognosis of neuroblastoma on whole-slide images: classification of stromal development. Pattern recognition 42(6):1093–1103 31. Shin HC, Orton MR, Collins DJ, Doran SJ, Leach MO (2013) Stacked autoencoders for unsupervised feature learning and multiple organ detection in a pilot study using 4D patient data. IEEE Trans Pattern Anal Mach Intell 35(8):1930–1943 32. Shin M, Jang D, Nam H, Lee K.H., Lee D (2016) Predicting the absorption potential of chemical compounds through a deep learning approach. IEEE/ACM Trans Comput Biol Bioinf 99:1. doi:10. 1109/TCBB.2016.2535233 33. Shlens J (2014) A tutorial on principal component analysis. arXiv preprint arXiv:1404.1100 34. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 1–9 35. Vapnik VN, Vapnik V (1998) Statistical learning theory, vol 1. Wiley, New York 36. Vincent P, Larochelle H, Bengio Y, Manzagol P.A Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th international conference on machine learning. ACM, pp 1096–1103 37. Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol PA (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11:3371–3408 38. Xu J, Xiang L, Hang R, Wu J (2014) Stacked sparse autoencoder (ssae) based framework for nuclei patch classification on breast cancer histopathology. In: 2014 IEEE 11th international symposium on biomedical imaging (ISBI). IEEE, pp 999–1002 39. Xu J, Xiang L, Liu Q, Gilmore H, Wu J, Tang J, Madabhushi A (2016) Stacked sparse autoencoder (ssae) for nuclei detection on breast cancer histopathology images. IEEE Trans Med Imag 35(1):119–130 40. Zhang Z, Lyons M, Schuster M, Akamatsu S (1998) Comparison between geometry-based and gabor-wavelets-based facial expression recognition using multi-layer perceptron. In: 1998 proceedings third IEEE international conference on automatic face and gesture recognition. IEEE, pp 454–459

123

Breast cancer cell nuclei classification in histopathology images using deep neural networks.

Cell nuclei classification in breast cancer histopathology images plays an important role in effective diagnose since breast cancer can often be chara...
2MB Sizes 1 Downloads 28 Views