YNIMG-12289; No. of pages: 14; 4C: 3, 4, 6, 8, 9, 10, 11, 12 NeuroImage xxx (2015) xxx–xxx

Contents lists available at ScienceDirect

NeuroImage journal homepage: www.elsevier.com/locate/ynimg

Automatic segmentation of MR brain images of preterm infants using supervised classification Pim Moeskops a,b, Manon J.N.L. Benders b, Sabina M. Chiţǎ a, Karina J. Kersbergen b, Floris Groenendaal b, Linda S. de Vries b, Max A. Viergever a, Ivana Išgum a a b

Image Sciences Institute, University Medical Center Utrecht, The Netherlands Department of Neonatology, University Medical Center Utrecht, The Netherlands

a r t i c l e

i n f o

Article history: Received 22 January 2015 Accepted 2 June 2015 Available online xxxx Keywords: Automatic brain segmentation Supervised voxel classification Preterm neonatal brain MRI

a b s t r a c t Preterm birth is often associated with impaired brain development. The state and expected progression of preterm brain development can be evaluated using quantitative assessment of MR images. Such measurements require accurate segmentation of different tissue types in those images. This paper presents an algorithm for the automatic segmentation of unmyelinated white matter (WM), cortical grey matter (GM), and cerebrospinal fluid in the extracerebral space (CSF). The algorithm uses supervised voxel classification in three subsequent stages. In the first stage, voxels that can easily be assigned to one of the three tissue types are labelled. In the second stage, dedicated analysis of the remaining voxels is performed. The first and the second stages both use two-class classification for each tissue type separately. Possible inconsistencies that could result from these tissue-specific segmentation stages are resolved in the third stage, which performs multi-class classification. A set of T1- and T2-weighted images was analysed, but the optimised system performs automatic segmentation using a T2-weighted image only. We have investigated the performance of the algorithm when using training data randomly selected from completely annotated images as well as when using training data from only partially annotated images. The method was evaluated on images of preterm infants acquired at 30 and 40 weeks postmenstrual age (PMA). When the method was trained using random selection from the completely annotated images, the average Dice coefficients were 0.95 for WM, 0.81 for GM, and 0.89 for CSF on an independent set of images acquired at 30 weeks PMA. When the method was trained using only the partially annotated images, the average Dice coefficients were 0.95 for WM, 0.78 for GM and 0.87 for CSF for the images acquired at 30 weeks PMA, and 0.92 for WM, 0.80 for GM and 0.85 for CSF for the images acquired at 40 weeks PMA. Even though the segmentations obtained using training data from the partially annotated images resulted in slightly lower Dice coefficients, the performance in all experiments was close to that of a second human expert (0.93 for WM, 0.79 for GM and 0.86 for CSF for the images acquired at 30 weeks, and 0.94 for WM, 0.76 for GM and 0.87 for CSF for the images acquired at 40 weeks). These results show that the presented method is robust to age and acquisition protocol and that it performs accurate segmentation of WM, GM, and CSF when the training data is extracted from complete annotations as well as when the training data is extracted from partial annotations only. This extends the applicability of the method by reducing the time and effort necessary to create training data in a population with different characteristics. © 2015 Elsevier Inc. All rights reserved.

Introduction The brain develops rapidly during the third trimester of pregnancy; most sulci are formed during this phase of development. Neurodevelopmental impairments are a risk for extremely preterm infants (Saigal and Doyle, 2008) because an essential part of their brain development takes place ex utero. The brain development of preterm infants can be evaluated using MR brain images. These images can provide

E-mail address: [email protected] (P. Moeskops).

quantitative descriptors such as volume, surface area, and morphology of the cortex, which may help in identifying which children are at risk of complications due to preterm birth (Dubois et al., 2008a; Rathbone et al., 2011), especially when evaluated longitudinally. The accurate segmentation of different brain tissue types in these MR images is a prerequisite for obtaining quantitative descriptors. Manual segmentation of brain tissue volumes is a very time-consuming task and prone to inter- and intra-observer variability. Therefore, accurate automatic algorithms are needed. Most methods in the literature focus on brain segmentation of either full-term infants or preterm infants scanned at term-equivalent age

http://dx.doi.org/10.1016/j.neuroimage.2015.06.007 1053-8119/© 2015 Elsevier Inc. All rights reserved.

Please cite this article as: Moeskops, P., et al., Automatic segmentation of MR brain images of preterm infants using supervised classification, NeuroImage (2015), http://dx.doi.org/10.1016/j.neuroimage.2015.06.007

2

P. Moeskops et al. / NeuroImage xxx (2015) xxx–xxx

(Prastawa et al., 2005; Song et al., 2007; Weisenfeld and Warfield, 2009; Shi et al., 2010; Gui et al., 2012; Srhoj-Egekher et al., 2013; Cardoso et al., 2013; Anbeek et al., 2013; Wang et al., 2014). Only a few methods were applied to images acquired at earlier ages (Xue et al., 2007; Kuklisova-Murgasova et al., 2011) and a few were applied to foetal images (Habas et al., 2010; Gholipour et al., 2011; Wright et al., 2012). A popular segmentation approach is the use of (population specific) atlases (Prastawa et al., 2005; Weisenfeld and Warfield, 2009; Shi et al., 2010; Habas et al., 2010; Kuklisova-Murgasova et al., 2011; Srhoj-Egekher et al., 2013; Cardoso et al., 2013; Anbeek et al., 2013; Makropoulos et al., 2014; Wang et al., 2014). While these methods generally obtain good segmentation results, and are typically robust to certain abnormalities, their reliance on atlases might limit their application to data sets that are not well represented by the atlas. This might affect the segmentation of tissue types not included in the available atlas or having tissue definitions different from those in the atlas. Moreover, application of an atlas might be suboptimal if the patient population or the acquisition protocol yields images that differ from the atlas to the extent that sufficiently accurate registrations cannot be obtained. Since the brain of premature infants develops rapidly, it is advantageous to use an atlas obtained at the same age as the target population. Therefore, Kuklisova-Murgasova et al. (2011) and Serag et al. (2012) developed 4D atlases, visualising temporal changes between 28 and 44 weeks postmenstrual age (PMA). Other approaches to overcome differences between images were presented by Xue et al. (2007), who simulated a subject-specific atlas by initial segmentation using k-means clustering, and Cardoso et al. (2013), who proposed an iterative relaxation approach that adapts the atlas priors to the data. A different supervised segmentation approach was described by Anbeek et al. (2013). This method uses a pattern recognition technique where brain voxels are labelled based on spatial and intensity features in a population-specific atlas space. However, this method uses a large set of manually segmented images as training data, and is therefore limited in application to data that is well represented by the training set. In contrast to this, Gui et al. (2012) described an unsupervised method that thus does not require training data, but relies on knowledge of brain morphology. The advantage of this method is that it does not depend on segmentation of a (large) set of representative training images, nor on a specific atlas. However, this method has only been tested on coronal images of full-term infants and preterm infants scanned at term-equivalent age. The assumptions about morphology might not hold when the infant is imaged at different ages or using different scan directions, therefore likely specific parameter tuning would be required in these cases. Many of the recently published methods were evaluated within the NeoBrainS12 challenge. While all methods obtained accurate results, many problems still exist, especially the segmentation of cortical grey matter is shown to be difficult (Išgum et al., 2015). Compared to an atlas-based approach, a supervised voxel classification approach might be beneficial in the local analysis of intensity characteristics. This is especially advantageous in the segmentation of details of the cortex because of the large variability in cortical morphology among patients. The accurate segmentation of cortical grey matter (GM) and its bordering tissue types, unmyelinated white matter (WM) and cerebrospinal fluid in the extracerebral space (CSF) is particularly important in the evaluation of cortical characteristics and brain development in preterm infants (Dubois et al., 2008b; Rodriguez-Carranza et al., 2008; Pienaar et al., 2008; Hill et al., 2010; Awate et al., 2010; Moeskops et al., 2013; Moeskops et al., in press). Hence, the major goal of the currently presented work is the development of a method for the accurate automatic segmentation of WM, GM, and CSF. The method employs intensity and spatial characteristics and is based on sequential supervised voxel classification. In contrast to other supervised approaches, voxel classification allows training of the method using a subset of the labelled voxels per training image. Because manual annotation of whole brain images is an extremely time-

consuming task, being able to only partially annotate the images could substantially reduce the time and effort in creating training data. In this paper, the training data was first constructed by using a number of randomly selected voxels from a set of representative, manually annotated images, and second by only using training data from a number of selected regions in these annotated images. The method was evaluated on two different datasets: coronal MR images of preterm infants acquired at 30 weeks PMA and coronal images acquired at term-equivalent age (40 weeks PMA). A preliminary version of this method was presented by Chiţǎ et al. (2013). Data The segmentation algorithm was evaluated on a set of preterm born infants imaged at 30 and 40 weeks PMA. Acquisition Ten preterm infants with gestational ages of 26.9 ± 1.1 (mean ± standard deviation) weeks at birth were imaged at 30.9 ± 0.6 weeks PMA on a Philips Achieva 3 T scanner. Five of these infants had a follow-up scan at the term-equivalent age of 41.3 ± 1.3 weeks PMA. All patients were sedated with oral chloral hydrate (30–60 mg/kg). For each patient, coronal T1- and T2-weighted images were acquired and reconstructed with a resolution of 0.34 × 0.34 × 2.0 mm3 at 30 weeks PMA and 0.35 × 0.35 × 1.2mm3 at 40 weeks PMA, in accordance with standard clinical imaging protocols at the neonatal intensive care unit of the University Medical Center Utrecht, The Netherlands. Table 1 lists detailed scanning parameters. No brain pathology was visible in these images, and the cognitive performance of all infants at 15 months was within the normal range when assessed using the Griffiths mental development scales. Reference standard The reference standard for each tissue type was provided by manual annotations in the T2-weighted images. This means that in each T2weighted scan, WM, GM and CSF voxels were manually assigned their corresponding tissue label by mouse painting. Details about the manual segmentation protocol were described by Išgum et al. (2015) and can be found at http://neobrains12.isi.uu.nl. These annotations have been performed either by MDs who were working towards a PhD in neonatology, or by trained medical students, and were verified independently by three neonatologists with at least ten years of experience with neonatal MRI scans. Example reference segmentations are shown in Fig. 1. Evaluation To evaluate the results of the automatic segmentation, the overlap between the obtained results and the reference standard was determined with the Dice coefficient (Zou et al., 2004). Furthermore, to evaluate how close the boundaries of the tissues were between the automatic segmentation and the reference segmentation, the modified Hausdorff distance was determined (Huttenlocher et al., 1993). The modified Hausdorff distance was defined as the 95th-percentile of all shortest Euclidean distances between the reference segmentation and the automatic segmentation and vice versa. This measure was chosen instead of the Hausdorff distance to determine whether the segmentation had a tendency to have multiple voxels further from the reference boundary, rather than to determine the possible existence of a single outlier. Second human expert annotations To estimate inter-observer variability, five of the images acquired at 30 weeks PMA and all five images acquired at 40 weeks PMA were manually annotated by an additional human expert. In these scans, the three

Please cite this article as: Moeskops, P., et al., Automatic segmentation of MR brain images of preterm infants using supervised classification, NeuroImage (2015), http://dx.doi.org/10.1016/j.neuroimage.2015.06.007

P. Moeskops et al. / NeuroImage xxx (2015) xxx–xxx

3

Table 1 The acquisition parameters for the coronal T1- and T2-weighted images acquired at 30 and 40 weeks PMA. The images are acquired with Turbo Field Echo (TFE) and Turbo Spin Echo (TSE) sequences. The dimensions and voxel sizes are presented in the right–left, inferior–superior, and anterior–posterior directions, respectively. 30 weeks PMA

Sequence Acquisition protocol Repetition time [ms] Echo time [ms] Scan time [min] Field of view [mm3] Acquisition matrix Reconstruction matrix Acquired voxel size [mm3] Reconstructed voxel size [mm3] Coil

40 weeks PMA

T1

T2

T1

T2

TFE 3D 9.4 4.6 4.44 101 × 130 × 100 108 × 140 × 50 384 × 384 × 50 0.93 × 0.93 × 2.0 0.34 × 0.34 × 2.0 Philips SENSE flex coils

TSE Multislice 2D 10 085 120 6.23 104 × 130 × 100 162 × 244 × 50 384 × 384 × 50 0.64 × 0.53 × 2.0 0.34 × 0.34 × 2.0

TFE 3D 9.5 4.6 7.02 200 × 200 × 132 220 × 256 × 110 256 × 256 × 110 0.91 × 0.78 × 1.2 0.78 × 0.78 × 1.2 Philips SENSE head coil

TSE Multislice 2D 4847 150 5.05 180 × 180 × 132 202 × 232 × 110 512 × 512 × 110 0.89 × 0.78 × 1.2 0.35 × 0.35 × 1.2

tissue types were segmented in three slices. The selected slices were those numbered as 25th percentile, median and 75th percentile of the slices containing brain tissue. To evaluate the performance of the second human expert, the Dice coefficient and the modified Hausdorff distance were calculated. Note that because these annotations have been performed in three slices only, the modified Hausdorff distance was calculated for each of these three slices in 2D. These manual annotations were performed according to the NeoBrainS12 protocol by five medical students trained for this task. None of them had been involved in the annotation of the reference standard of the same images. All annotations were checked and corrected by expert neonatologists. Methods The proposed segmentation method consists of three stages, each using supervised voxel classification. In the first stage, voxels that

can easily be assigned to one of the tissue classes are labelled. In the second stage, dedicated analysis of the remaining voxels is performed. Both the first and the second stages use two-class classifiers for each tissue type separately. As a consequence, voxels might be assigned to multiple tissue types. This is resolved in the third stage, which performs multi-class classification of such voxels. The classifications are based on intensity values and spatial characteristics of each voxel within the brain. Prior to these segmentation steps, a brain mask was automatically generated based on the T2-weighted image using the Brain Extraction Tool (BET) from the FMRIB Software Library (FSL) (Smith, 2002); the same settings were used for all images. Only voxels within this mask were further analysed. To determine the parameters for the segmentation method while allowing independent evaluation of the segmentation, the set of scans acquired at 30 weeks PMA was divided into two subsets: Set A and Set B, each containing five images. Set A was used for the method development

Fig. 1. Example reference segmentations for images acquired at 30 weeks PMA (top) and 40 weeks PMA (bottom). Left: the T2-weighted image. Middle: the corresponding reference segmentation, where WM is shown in yellow, GM in blue and CSF in red. Right: an enlarged view of the lateral sulcus in the same image with the corresponding reference segmentation for GM shown as contours.

Please cite this article as: Moeskops, P., et al., Automatic segmentation of MR brain images of preterm infants using supervised classification, NeuroImage (2015), http://dx.doi.org/10.1016/j.neuroimage.2015.06.007

4

P. Moeskops et al. / NeuroImage xxx (2015) xxx–xxx

Brain mask

T2 image

Stage1

and parameter tuning, while Set B was used for independent evaluation. The patients in Set A did not have a follow-up image acquired at 40 weeks PMA, while the images in Set B did have a follow-up image acquired at 40 weeks PMA. This set-up ensured fully independent evaluation of both the images acquired at 30 weeks PMA in Set B and the images acquired at 40 weeks PMA. The images acquired at 40 weeks PMA were not included in the method development, nor in the parameter tuning. The segmentation parameters were determined in leave-one-subjectout cross-validation. This means that for each image in Set A, the segmentation was performed using the remaining four images in this set as training data. An overview of the determined parameters is shown in Table 2. No movement was present between the T1- and T2-weighted images, and therefore no alignment between these images was performed. An overview of the method is shown in Fig. 2.

1st stage features

WM GM CSF classifier classifier classifier

1st stage training data

First segmentation stage

WM probability GM probability CSF probability

Stage 2

2nd stage features Voxels not selected for classification in 2nd stage

WM GM CSF classifier classifier classifier

2nd stage training data

WM probability GM probability CSF probability

Stage 3

The aim of the first stage was to label the voxels that can easily be assigned to a tissue class within the brain mask. WM, GM and CSF can be differentiated based on their spatial characteristics and their intensity values. Therefore, each voxel was classified based on a set of features describing these properties. First, because WM, GM and CSF appear at characteristic locations within the brain, normalised coordinates (x, y, and z) starting from the edge of a bounding box around the brain mask were calculated. Second, because CSF, GM and WM are most likely to appear in that specific order from the skull inwards, the shortest Euclidean distance to the boundary of the brain mask was computed. Next, because of their importance in differentiating between WM, GM and CSF (Anbeek et al., 2013), the intensities from both the T1- and the T2-weighted image were added to the set of features. Finally, because not only voxel intensity, but also contextual intensity characteristics can provide valuable information for differentiating tissue types, Gaussian filters were applied to the T1- and T2-weighted images, using scales σ = 0.5, 1.0, 1.5, and 2.0 mm. To further characterise the local texture, derivatives up to and including the second order: Lx, Ly, Lz, Lxx, Lyy, Lzz, Lxy, Lxz, and Lyz were computed using the same scales. The x-direction was defined from right to left patient side, the ydirection from inferior to superior, and the z-direction from anterior to posterior. In total this comprised a set of 86 features: 4 spatial and 82 intensitybased features. Prior to classification, normalisation to zero mean and unit variance was performed for each image. These features were

3rd stage features Voxels not selected for classification in 3rd stage

Multiclass classifier

3rd stage training data

Table 2 Overview of the segmentation parameters, and the corresponding values determined in the experiments. The same parameter values were used to segment the images acquired at 30 and 40 weeks PMA. The feature sets are separately presented in Table 3. Brain mask (BET) Fractional intensity threshold

0.6

Segmentation

1st stage Classifiers Training set size Lower and upper thresholds

kNN, 101 neighbours WM: 40%, GM: 20%, CSF: 50% (of 4 images) WM: 0.35 and 0.80, GM: 0.35 and 0.95, CSF: 0.15 and 0.90

2nd stage Classifiers Training set size Thresholds Morphological operations

SVM, Radial basis function, σ = 16, C = 4 10% of 4 images WM: 0.45, GM: 0.35, CSF: 0.45 WM: opening (0.7 mm), GM: closing (0.35 mm), CSF: closing (0.35 mm)

3rd stage Region of interest Classifier Training set size

Distance of 1.5 mm or less to at least two other tissue types Multi-class kNN, 55 neighbours 30% of 4 images

Fig. 2. Proposed pipeline showing the final optimised segmentation method, which utilises the T2-weighted image only.

input for three independent two-class classifiers, one for each of the three tissue types. This means that for each classifier, the positive class was one of the three tissue types, while all remaining voxels represented the negative class. In preliminary experiments, several supervised classifiers were evaluated. The best performance was obtained using k-nearest neighbour (kNN) classifiers. The posterior probabilities1 for

1 Note that in this paper, the term posterior probability refers to the posterior probability per classifier.

Please cite this article as: Moeskops, P., et al., Automatic segmentation of MR brain images of preterm infants using supervised classification, NeuroImage (2015), http://dx.doi.org/10.1016/j.neuroimage.2015.06.007

P. Moeskops et al. / NeuroImage xxx (2015) xxx–xxx

tissue type t, P1t (x), of the samples analysed in the first stage, S1, were computed based on the kNN classification as: P t1 ðxÞ ¼

Nt ðxÞ k

ð1Þ

where Nt(x) is the number of neighbours of sample x ∈ S1 labelled as tissue type t in the training set. After evaluating the performance using values of k between 1 and 1000, k was set to 101 for all three tissue types. For each of the classifiers, sequential forward floating selection (SFFS) (Pudil et al., 1994; Jain and Zongker, 1997) with accuracy as evaluation criterion was performed in leave-one-subject-out crossvalidation. SFFS is an iterative algorithm that at every iteration adds the best single feature to an initially empty feature set and removes features as long as that improves performance. The maximum number of features that was allowed to be selected per experiment was set to 25. The final set of features for the task was composed of those features that were selected in at least 50% of the experiments. The features selected for segmentation of each tissue type are listed in Table 3. Note that no features calculated from the T1-weighted image were selected, making the classification independent of that image. In each leave-one-subject-out cross-validation experiment, training data from four images were used. Each voxel in the brain masks obtained using BET represented one sample, adding up to a total of about 800 000 to 1 000 000 samples per image. Maximum performance according to the average Dice coefficients was obtained by randomly selecting 40%, 20% and 50% of the training samples for WM, GM and CSF, respectively, see Fig. 3, top row. This allowed keeping computational costs relatively low. The classification resulted in posterior probabilities for every voxel to be WM, CSF or GM (Fig. 4, top row). To identify voxels that can directly be assigned to a tissue class and those that need to be reclassified, two Table 3 The features selected using SFFS for the first and third segmentation stages. T2 indicates that the feature was computed using the T2-weighted image, and Pw, Pg and Pc indicate that the feature was computed on the posterior probability image of the second segmentation stage for WM, GM and CSF, respectively. The subscripts x, y, and z indicate the direction of the derivative calculated at various scales σ. Note that features used in the second segmentation stage are not provided. This stage utilised all spatial and intensity features computed using the T2-weighted images and all features based on the probabilistic results of the first segmentation stage. 1st stage

Normalised x Normalised y Normalised z Distance to brain mask T2-image intensity T2 with σ = 0.5mm T2,x with σ = 0.5 mm T2,y with σ = 0.5 mm T2,z with σ = 0.5 mm T2,xx with σ = 0.5 mm T2,yy with σ = 0.5 mm T2,zz with σ = 0.5 mm T2,xy with σ = 0.5 mm T2 with σ = 1.0mm T2,zz with σ = 1.0 mm T2 with σ = 1.5mm T2 with σ = 2.0mm T2,x with σ = 2.0 mm Distance to WM Distance to CSF Pw,xx with σ = 1.0 mm Pw,yy with σ = 1.0 mm Probability for GM Pg,xx with σ = 0.5 mm Number of selected features

3rd stage

WM

GM

CSF

× × × × × × × × × × ×

× × × × × × × × × × × ×

× × × × × × × × × × × ×

×

× × × × ×

× × × ×

15

×

14

17

thresholds on the posterior probabilities were determined for each of the tissue types: A lower threshold assigning voxels with a posterior probability for all tissue types below that value to the background class, and an upper threshold labelling voxels above that threshold to the tissue class. The remaining voxels were selected for further analysis. The thresholds were set such that the percentage of voxels that were incorrectly labelled in this stage was less than 5% for Set A. The determined thresholds were: 0.35 and 0.80 for WM, 0.35 and 0.95 for GM, and 0.15 and 0.90 for CSF. Second segmentation stage The first stage identified voxels for further analysis based on the posterior probability for each class. The total number of such samples was 200 000 to 300 000 per image, which corresponds to 27–31% of the samples analysed in the first stage. For these voxels, in addition to the previously computed features, the output of the first stage, i.e. the probabilities of a voxel to belong to each of the three tissues, was transformed into features. To this end, the posterior probabilities for each tissue type and their Gaussian derivatives (up to and including the second order) with σ = 0.5 and 1.0 mm were computed and added to the initial set of features. Higher scales were not considered because this segmentation stage focuses mostly on voxels along the tissue boundaries, and therefore local characteristics of the image were more important than in the previous stage. The total feature set comprised 149 features. All features were normalised to zero mean and unit variance. Preliminary feature selection experiments with a kNN-classifier showed that the T1-features were not important for the classification in the second stage either. Therefore, only intensity features derived from the T2weighted scans were used, resulting in a total of 108 features. These features were input for three independent two-class support vector machine (SVM) classifiers (Cortes and Vapnik, 1995) with radial basis functions as their respective kernels. The LIBSVM (Chang and Lin, 2011) implementation was used. To determine the optimal number of training samples for the classification, the classification performance was evaluated with respect to the training set size (Fig. 3, middle row). Considering the very limited increase in performance and the substantial increase in training time for the SVM classifier with increasing the training set size, we set the number of samples to 10% of the initial number of samples in the second stage, which corresponded to 20 000 to 30 000 samples selected randomly from each training image. The posterior probabilities for tissue type t, P2t (x), were computed as: 

Multi-class

P t2 ðxÞ ¼

×

× × × × × × 7

5

t P SVM ðxÞ P t1 ðxÞ

if x ∈ S2 otherwise

ð2Þ

t was the probability resulting from the SVM classification where PSVM computed according to the method by Platt, (1999) using the improvement by Lin et al. (2007), and S2 was the set of samples analysed in the second segmentation stage. An example of this combined probabilistic output of the first and second stages for WM, GM and CSF, is shown in Fig. 4, bottom row. To obtain a binary segmentation result based on these probabilities, the optimal thresholds were determined by the segmentation accuracy for WM, GM, and CSF, evaluated using the Dice coefficient. Based on these results, the thresholds in the second stage were set to 0.45 for WM, 0.35 for GM, and 0.45 for CSF. As this could result in isolated (clusters of) voxels, only the largest connected component in 3D was retained per tissue type. As a consequence of voxel classification, the tissue boundaries could appear somewhat unsmooth. Therefore, to slightly smooth this result, in-plane morphological operations were performed: opening using a circular structuring element with a radius of 0.70 mm for WM, and closing using a circular structuring element with a radius of 0.35 mm for GM and CSF.

Please cite this article as: Moeskops, P., et al., Automatic segmentation of MR brain images of preterm infants using supervised classification, NeuroImage (2015), http://dx.doi.org/10.1016/j.neuroimage.2015.06.007

6

P. Moeskops et al. / NeuroImage xxx (2015) xxx–xxx WM

0.94

GM

0.8

CSF 0.84

0.935 0.75

0.925 0.92 0.915 0.91

Patient 1 Patient 2 Patient 3 Patient 4 Patient 5

0.905 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.65 Patient 1 Patient 2 Patient 3 Patient 4 Patient 5

0.6

0.7

0

0.1

0.2

0.3

0.4

0.5

0.6

Ratio of samples used from every training image

WM

GM 0.82

0.955

0.8

0.95 0.945 0.94 0.935

Patient 1 Patient 2 Patient 3 Patient 4 Patient 5

0.93

0.76 Patient 1 Patient 2 Patient 3 Patient 4 Patient 5

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Ratio of samples used from every training image CSF 0.9 0.88

0.78 0.76 0.74 0.72

Patient 1 Patient 2 Patient 3 Patient 4 Patient 5

0.7 0.68

0.925 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

0.78

0.72

Dice coefficient

0.96

0.8

0.74

0.7

Ratio of samples used from every training image

Dice coefficient

Dice coefficient

0.9

0.82

Dice coefficient

Dice coefficient

Dice coefficient

0.93

0.86 0.84 0.82 0.8

Patient 1 Patient 2 Patient 3 Patient 4 Patient 5

0.78 0.76

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Ratio of samples used from every training image

Ratio of samples used from every training image

Ratio of samples used from every training image

WM

GM

CSF

0.82

0.955

0.8

0.95 0.945 0.94

Patient 1 Patient 2 Patient 3 Patient 4 Patient 5

0.935 0.93

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Ratio of samples used from every training image

0.88

Dice coefficient

0.96

Dice coefficient

Dice coefficient

0.9

0.78 0.76 0.74 0.72

Patient 1 Patient 2 Patient 3 Patient 4 Patient 5

0.7 0.68

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Ratio of samples used from every training image

0.86 0.84 0.82 0.8 0.78 0.76

Patient 1 Patient 2 Patient 3 Patient 4 Patient 5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Ratio of samples used from every training image

Fig. 3. Dice coefficients obtained in the first (top), second (middle) and third (bottom) segmentation stages using different training set sizes for WM, GM, and CSF, from left to right. The horizontal axes show the ratios of randomly selected samples in every training image. Note that the range of the vertical axis is scaled separately per graph.

Third segmentation stage Owing to the independent segmentation of each tissue in the first and second stages, some voxels might have been assigned to more than one tissue class, while others, in between tissue classes might have remained unclassified. These regions were resolved in the third segmentation stage. Only unlabelled voxels which were in the vicinity (Euclidean distance of 1.5 mm or less) of more than one of the already segmented tissue types were considered. The total number of samples selected for reclassification was about 30 000 to 50 000 per image in the third stage, which corresponded to 3–5% of the samples analysed in the first stage. The classification was performed based on the same features as in the previous stage, but using the updated posterior probabilities of the second stage. In addition, the shortest Euclidean distances to the binary segmentation of the second stage for each tissue type (i.e. the distances to WM, GM and CSF) were computed. All features were normalised to zero mean and unit variance. Based on this set of features, a four-class classification (WM, GM, CSF and background) was performed using a kNN classifier. The maximum performance of this classifier was reached with 55 neighbours. This was evaluated in terms of Dice coefficient for the range of k from 1

until 95 neighbours. The best features were selected using SFFS with accuracy as evaluation criterion (Table 3). Also here, the performance of the classification was evaluated against the size of the training set (Fig. 3, bottom row). Based on this, the size of the training set was set to 30% of the initial samples in the third stage, which corresponded to 10 000 to 15 000 samples selected randomly from each training image. The posterior probability, Pt3, of the samples analysed in the third stage, S3, was computed as: P t3 ðxÞ ¼

N t ðx Þ k

ð3Þ

where x ∈ S3. The voxels were assigned to the tissue with the highest posterior probability. Experiments and results To evaluate the performance of the algorithm and its robustness to variations in the data, multiple experiments were performed. First, the optimised method was evaluated for the images acquired at 30 (Set A and Set B) and 40 weeks PMA. Second, the method was evaluated

Please cite this article as: Moeskops, P., et al., Automatic segmentation of MR brain images of preterm infants using supervised classification, NeuroImage (2015), http://dx.doi.org/10.1016/j.neuroimage.2015.06.007

P. Moeskops et al. / NeuroImage xxx (2015) xxx–xxx

7

Fig. 4. An example slice from one of the T2-weighted images, and the corresponding probabilistic output of the first (top) and second (bottom) segmentation stages for WM, GM, and CSF, from left to right.

using the images within the NeoBrainS12 challenge. Third, the method was evaluated on an image of a patient with brain injury and an image containing acquisition artefacts. Next, the results were compared to the performance of a second human expert. Finally, to motivate our choice for multiple two-class classifiers, an (iterative) multi-class classification approach was evaluated. Evaluation in images acquired at 30 weeks PMA Set A and Set B were segmented using training data selected in two different ways: in the first run, random selection from all available training samples was performed, whereas in the second run, samples were only selected from a limited number of manually selected regions. Training the classifiers using random selection from all samples First, the segmentation performance was tested on the unseen set of images (Set B) using training data derived from Set A. The average Dice coefficients were: 0.95 for WM, 0.81 for GM and 0.89 for CSF (Fig. 5, Experiment SetB30). The results in terms of average modified Hausdorff distances were: 0.14 mm for WM, 0.60 mm for GM and 0.51 mm for CSF. Second, to illustrate the performance when a larger number of images are used as training data, leave-one-subject-out cross-validation was performed over all ten images (Set A and Set B); the average Dice coefficients were: 0.96 for WM, 0.82 for GM and 0.90 for CSF (Fig. 5, Experiment LOSO30). The average modified Hausdorff distances were: 0.03 mm for WM, 0.49 mm for GM and 0.49 mm for CSF. Note that, in this experiment, the images used in determining the parameters (Set A) were mixed with the independent images (Set B), which could result in a positive bias with respect to the performance. Training the classifiers using samples from annotated regions In every segmentation stage, only a percentage of randomly selected samples from each image was used for training data. This suggests that it might not be necessary to manually annotate complete volumes in multiple images to build a training set; annotating only part of the images might suffice. The manual annotation of WM, GM and CSF in the complete scans took about 45 h per image for the images acquired at 30 weeks PMA and about 85 h per image for the images acquired at 40 weeks PMA when done by a trained expert. Therefore, limiting the

manual annotation to a few regions would substantially reduce the time needed to create a training set. To determine whether it would be feasible to train the algorithm using only several small representative manually annotated regions in the images and thus reduce the effort in creating training data, manual selection of such regions was performed. The regions were evenly spread in each image, and especially parts were chosen that are known to be difficult to segment based on the previous experiments. In each of the five images contained in Set A, 150 circular regions with a diameter of 35 voxels (11.8 mm) were selected. This diameter was chosen as it is large enough to capture the important structures in the images, yet small enough to yield regions that can be manually segmented in a reasonable time. The number of samples within these regions was about 125 000 per image, which corresponded to 12–16% of all available samples within the brain mask. An example of the selected regions is shown in Fig. 7, left. In the first segmentation stage, all samples contained in the selected regions in the five training images were used as training data for all three classifiers. In the second stage, the training set size was reduced, by random selection, to the same number of samples as determined in the optimisation experiments (Methods: Second segmentation stage). In the third stage, because of the small number of available training samples, all training samples were used. The average Dice coefficients were: 0.94 for WM, 0.78 for GM and 0.87 for CSF (Fig. 5, Experiment Regions30). The average modified Hausdorff distances were: 0.34 mm for WM, 0.81 mm for GM and 0.81 mm for CSF. Finally, to compare the segmentation of Experiment Regions30 with random sampling from the complete brain, additional experiments were performed where the exact same number of samples was randomly selected from all voxels within the brain mask. The average Dice coefficients were: 0.95 for WM, 0.80 for GM and 0.89 for CSF (Fig. 5, Experiment Random30). The average modified Hausdorff distances were: 0.14 mm for WM, 0.64 mm for GM and 0.47 mm for CSF. Evaluation in images acquired at 40 weeks PMA The segmentation parameters were determined previously using Set A, thus only using scans acquired at 30 weeks PMA. To evaluate the performance of the proposed method on a different set of images, the same method, with the exact same parameters, was applied to the five images

Please cite this article as: Moeskops, P., et al., Automatic segmentation of MR brain images of preterm infants using supervised classification, NeuroImage (2015), http://dx.doi.org/10.1016/j.neuroimage.2015.06.007

8

P. Moeskops et al. / NeuroImage xxx (2015) xxx–xxx WM

GM

0.84

0.97

0.82

Dice coefficient

Dice coefficient

0.965 0.96 0.955 0.95 0.945

0.8

0.78

0.76

0.94 0.74

0.935 0.93

SetB30

LOSO30

Regions30 Random30 Experiment

NB30

Manual30

0.72

SetB30

LOSO30

Regions30 Random30 Experiment

NB30

Manual30

CSF Patient 1 Patient 2 Patient 3 Patient 4 Patient 5 Patient 6 Patient 7 Patient 8 Patient 9 Patient 10

0.92

Dice coefficient

0.9

0.88

0.86

0.84

0.82

0.8

SetB30

LOSO30

Regions30 Random30 Experiment

NB30

Manual30

Fig. 5. The segmentation performance evaluated using Dice coefficients for each of the ten images acquired at 30 weeks PMA. Six different experiments were performed. Experiment SetB30 was performed using the unseen Set B as test set and Set A as training set. Experiment LOSO30 was performed by leave-one-subject-out cross-validation over all ten images (Set A and Set B). Experiment Regions30 was performed with Set B as test set using training samples originating only from specific manually annotated regions in Set A. Experiment Random30 was performed using the same settings and the exact same number of training samples as in Experiment Regions30, but the samples were selected randomly. Experiment NB30 was performed using only the two training images provided in the NeoBrainS12 challenge. Experiment Manual30 was performed by manual annotation in three slices by a second human expert. In addition to the results per image, boxplots for each experiment are shown in grey.

acquired at 40 weeks PMA. Owing to the limited number of available images with manual annotations, the experiments were performed in a leave-one-subject-out manner. Because the goal was to perform segmentation using only parts of the training images, thus circumventing the need for expensive manual annotations, these scans were segmented using training data originating from the annotated regions only. To obtain the same number of training samples as in the experiment with the images acquired at 30 weeks PMA, 188 regions were annotated. These regions contained only 4–5% of the samples from the brain mask of the images acquired at 40 weeks PMA. An example of the selected regions is illustrated in Fig. 7, right. The average Dice coefficients were: 0.92 for WM, 0.80 for GM and 0.85 for CSF (Fig. 8, Experiment LOSO40). The average modified Hausdorff distances were: 0.35 mm for WM, 0.50 mm for GM and 0.82 mm for CSF. An example of the resulting segmentation for one of the images is shown in Fig. 9.

Evaluation in NeoBrainS12 images The NeoBrainS12 challenge compares the performance of automatic methods developed for the segmentation of neonatal MR images. Note that the images used in the NeoBrainS12 challenge are a subset of the images used in this paper.

The NeoBrainS12 challenge provides only two images acquired at 30 weeks PMA as training data. To directly compare the results in this paper with the results in the challenge,2 the five test images acquired at 30 weeks PMA as well as the five test images acquired at 40 weeks PMA that are available in the challenge were segmented using only these two training images. This means that also for the images acquired at 40 weeks PMA, the training data were extracted from images acquired at 30 weeks PMA, which, in addition to their large anatomical differences, were made with a different acquisition protocol (see Table 1). The average Dice coefficients for the images acquired at 30 weeks PMA were: 0.94 for WM, 0.76 for GM and 0.85 for CSF (Fig. 5, Experiment NB30). The average modified Hausdorff distances were: 0.34 mm for WM, 0.97 mm for GM and 1.28 mm for CSF. The average Dice coefficients for the images acquired at 40 weeks PMA were: 0.88 for WM, 0.76 for GM, 0.76 for CSF (Fig. 8, Experiment NB40). The average modified Hausdorff distances were: 0.89 mm for WM, 0.87 mm for GM and 1.75 mm for CSF. For all three tissue types, these segmentations achieved the best performance in terms of Dice coefficient for the images acquired at 30 weeks PMA, and the second best performance in terms of Dice coefficient for the images acquired at 40 weeks PMA.

2

http://neobrains12.isi.uu.nl/mainResults.php.

Please cite this article as: Moeskops, P., et al., Automatic segmentation of MR brain images of preterm infants using supervised classification, NeuroImage (2015), http://dx.doi.org/10.1016/j.neuroimage.2015.06.007

P. Moeskops et al. / NeuroImage xxx (2015) xxx–xxx

9

Fig. 6. Segmentation of four slices of one image acquired at 30 weeks PMA. The left column shows the T2-images, the middle column presents the automatic segmentation results, and in the right column shows the reference segmentation.

Evaluation in images with artefacts or abnormalities

Evaluation of second human expert segmentations

To estimate the performance on images containing imaging artefacts or abnormalities, two additional images were segmented. First, segmentation was performed for an image suffering from motion artefacts. Second, segmentation was performed for an image of a patient with a porencephalic cyst following periventricular haemorrhagic infarction, which occurs typically in preterm born infants. Because manual reference annotations were not available for these data, the results were only qualitatively evaluated. The results are shown in one slice for each of the images in Fig. 10.

The average Dice coefficients of a second human expert were: 0.96 for WM, 0.80 for GM, and 0.89 for CSF for the images acquired at 30 weeks PMA (Fig. 5, Experiment Manual30), and 0.93 for WM, 0.79 for GM, and 0.86 for CSF for the images acquired at 40 weeks PMA (Fig. 8, Experiment Manual40). The average modified Hausdorff distances were: 0.18 mm for WM, 0.42 mm for GM and 1.11 mm for CSF for the images acquired at 30 weeks PMA and 0.35 mm for WM, 0.55 mm for GM and 0.79 mm for CSF for the images acquired at 40 weeks PMA.

Please cite this article as: Moeskops, P., et al., Automatic segmentation of MR brain images of preterm infants using supervised classification, NeuroImage (2015), http://dx.doi.org/10.1016/j.neuroimage.2015.06.007

10

P. Moeskops et al. / NeuroImage xxx (2015) xxx–xxx

Fig. 7. Example illustrating the selected training regions in two T2-weighted images. Left: image acquired at 30 weeks PMA. Right: image acquired at 40 weeks PMA. WM samples are shown in yellow, GM samples in blue, CSF samples in red, and background samples in green.

Evaluation of an (iterative) multi-class classification approach To demonstrate that classification in several subsequent stages employing dedicated two-class classifiers provides more accurate segmentation than single or iterative multi-class classification, two experiments were performed using scans acquired at 30 weeks PMA. First, the method by Anbeek et al. (2013), which utilised a single multi-class classifier was tested on these images. A population atlas

was created using nine independent but representative images. The experiment was performed in leave-one-subject-out manner with the five images of Set A. The resulting average (± standard deviation) Dice coefficients were as follows: 0.85 ± 0.02 for WM, 0.60 ± 0.02 for GM, and 0.63 ± 0.10 for CSF. Second, to compare the performance of our sequential multiple twoclass classifier approach with an iterative multi-class classifier, a 101NN classifier was used, as in the proposed first segmentation stage. Instead of three 2-class classifiers, a single 4-class classifier (WM, GM, CSF, and background) was utilised, using the intersection of the features selected for all three tissue types in the first segmentation stage (see Table 3), totalling 13 features. In each iteration, a threshold was defined where all voxels with a posterior probability above this threshold were assigned to that tissue, and the remaining voxels were reclassified. In each iteration, the threshold was lowered, and a new classifier was trained accordingly. To make the experiment similar to our proposed method, three iterations were performed where the thresholds on the posterior probability of the classifiers were set to 0.8 and 0.6, in the first and second iteration, respectively. In the last iteration, the remaining voxels were assigned to the class with the highest posterior probability. The experiments were performed in leave-one-subject-out manner with the five images of Set A. The resulting average (± standard deviation) Dice coefficients were as follows: 0.91 ± 0.01 for WM, 0.60 ± 0.03 for GM, and 0.74 ± 0.04 for CSF. To give an indication of the difference in performance, the average Dice coefficients achieved by the proposed method on the same images (Fig. 3, bottom row) were 0.95 ± 0.01 for WM, 0.78 ± 0.03 for GM, and 0.87 ± 0.04 for CSF.

WM

GM

0.95 0.83 0.94

0.82 0.81

0.92

Dice coefficient

Dice coefficient

0.93

0.91 0.9 0.89

0.8 0.79 0.78 0.77 0.76

0.88

0.75

0.87 0.86

0.74 LOSO40

NB40 Experiment

Manual40

0.73

LOSO40

NB40 Experiment

Manual40

CSF 0.9

Patient 6 Patient 7 Patient 8 Patient 9 Patient 10

0.88

Dice coefficient

0.86 0.84 0.82 0.8 0.78 0.76 0.74 0.72 LOSO40

NB40 Experiment

Manual40

Fig. 8. The segmentation performance evaluated using Dice coefficients for each of the five images acquired at 40 weeks PMA. Two different experiments were performed. Experiment LOSO40 was performed in leave-one-subject-out manner, where the training data was selected only from several manually annotated regions in the images. Experiment NB40 was performed using training data only from the two images acquired at 30 weeks PMA that are available within the NeoBrainS12 challenge. Experiment Manual40 was performed by manual annotation in three slices by a second human expert. In addition to the results per image, boxplots for each experiment are shown in grey.

Please cite this article as: Moeskops, P., et al., Automatic segmentation of MR brain images of preterm infants using supervised classification, NeuroImage (2015), http://dx.doi.org/10.1016/j.neuroimage.2015.06.007

P. Moeskops et al. / NeuroImage xxx (2015) xxx–xxx

11

Fig. 9. Segmentation of four slices of one image acquired at 40 weeks PMA. The left column shows the T2-images, the middle column presents the automatic segmentation results, and the right column shows the reference segmentations.

Discussion We have presented a method for the automatic segmentation of WM, GM and CSF in neonatal MR brain images, based on supervised voxel classification. The method was evaluated with two different sets of images: ten images acquired at 30 weeks PMA and five images acquired at 40 weeks PMA; accurate segmentations were obtained in both sets. In addition, comparable segmentation performance could be

achieved when using a number of small, manually annotated regions in training images, as well as when using training data from the images acquired at 30 weeks PMA to segment the images acquired at 40 weeks PMA. This substantially reduced the time and effort required to generate training data, which might extend its applicability. The method's performance was similar to that of a second human expert for the images acquired at 30 weeks PMA as well as for those acquired at 40 weeks PMA.

Please cite this article as: Moeskops, P., et al., Automatic segmentation of MR brain images of preterm infants using supervised classification, NeuroImage (2015), http://dx.doi.org/10.1016/j.neuroimage.2015.06.007

12

P. Moeskops et al. / NeuroImage xxx (2015) xxx–xxx

Fig. 10. Segmentation result in one slice of an image suffering from motion artefacts acquired at 30 weeks PMA (left), and in one slice of an image acquired at 40 weeks PMA of a patient with a porencephalic cyst following periventricular haemorrhagic infarction (right).

Segmentation performance The method accurately segmented all three tissue types on an independent test set (Set B), see Fig. 5, Experiment SetB30. The highest performance was obtained for WM, with a Dice coefficient of 0.95, and the lowest performance was obtained for GM, with a Dice coefficient of 0.81. The lower performance for GM could be caused by its complex shape, bordering with both WM and CSF. Both these boundaries are often difficult to identify, even visually, because of partial volume effects and the fact that the tissues are not yet fully developed at the time of imaging. The boundary between GM and CSF is especially difficult to identify inside the sulci, where it is often poorly visible. Furthermore, errors often occurred in the regions around the hippocampi (e.g., see the third row of Fig. 6). Overall, the segmentations for each of the three tissue types resulted in a modified Hausdorff distance smaller than two voxels, which indicates that the boundaries of the automatic segmentations were close to the reference segmentations.

Parameters Several parameters of the presented segmentation method had an influence on the segmentation results.

Features Feature selection experiments demonstrated that the best performance was achieved using only features extracted from the T2weighted scans. This was likely caused by the lower signal-to-noise ratio and the lower contrast of the T1-weighted images. These images therefore did not provide additional information for the classification. Moreover, the manual segmentations were performed using only the T2-weighted images, suggesting that these images provided sufficient information for segmentation to a human expert as well. This simplifies the algorithm by removing the need for registration and, for this purpose, even the acquisition of a T1-weighted image. The spatial features in the presented algorithm were computed relative to the brain mask and were normalised per patient. This does mean that these features were dependent on the quality of the brain mask. The generated brain masks have not provided perfect brain segmentations but they were sufficiently accurate to provide meaningful features. This was supported by the results of feature selection where all spatial features were selected, as well as by the final segmentation results. Even though ventricular CSF and extracerebral CSF consist of the same fluid and therefore present the same contrast in the image, the segmentation method was able to distinguish between them. The results demonstrate that the features describing the contextual information per voxel were sufficient for the discrimination between ventricular and extracerebral CSF voxels.

Size of the training set The optimal number of training samples per segmentation stage was investigated (Fig. 3). The best performance was achieved using a subset of randomly selected voxels from the whole brain mask. Moreover, evaluation using only the two images acquired at 30 weeks PMA available in the NeoBrainS12 challenge and leave-one-subject-out crossvalidation over all ten images suggest that the performance increases when training samples are extracted from a larger number of images. Comparable but slightly lower performance could be achieved using a subset of training voxels selected from a number of manually selected regions only (Fig. 5, Experiments Regions30 and Random30). The possibility of building a training set by only manually annotating several small parts in a few images can greatly reduce expert time if new representative training data needs to be created for the segmentation of a set of images with different characteristics. In these experiments, errors occurred mostly on the boundaries between two tissue types. The slightly lower performance suggests that the variance of the available samples is too small to reduce these errors. This is a direct consequence of using only several training regions because training samples only extracted from these regions could never result in a training set that is perfectly representative for the whole image. These results could therefore be improved by selecting more regions with a smaller diameter, adding up to the same number of samples. The training regions could then be better spatially distributed, which would better approach random sampling over the whole brain. Further investigation is needed to reveal whether different sizes or shapes of those regions, a different spatial distribution, or a correction for the specific training sample selection procedure might improve the segmentation accuracy. Multi-stage classification Two experiments were performed to demonstrate that sequential two-class classifiers may be advantageous over a single multi-class classifier or sequential multi-class classifiers (Results: Evaluation of an (iterative) multi-class classification approach). Independent two-class (oneversus-all) classifiers divide the problem into multiple easier tasks, i.e. the classifier only needs to be able to distinguish the tissue type at hand from all other voxels and can therefore be specifically optimised and trained for this task. Sequential classifiers offer a possibility for relabelling voxels that were potentially incorrect in the previous stage. A similar idea for relabelling voxels based on the results of an initial segmentation method was described earlier by Wang et al. (2011). Robustness to age The method was developed using coronal scans acquired at 30 weeks PMA, and it was subsequently tested in images acquired 40 weeks PMA, using the exact same parameter settings. These images were segmented using training data from images acquired at 40 weeks PMA and from the selected regions only, which comprised just 5% of all available training samples. The high Dice coefficients for these

Please cite this article as: Moeskops, P., et al., Automatic segmentation of MR brain images of preterm infants using supervised classification, NeuroImage (2015), http://dx.doi.org/10.1016/j.neuroimage.2015.06.007

P. Moeskops et al. / NeuroImage xxx (2015) xxx–xxx

segmentations (Fig. 8, Experiment LOSO40) suggest that no specific parameter tuning is required to achieve accurate performance in the segmentation of images acquired at 40 weeks PMA. Nevertheless, dedicated parameter tuning might further improve these segmentation results. Furthermore, we showed that the segmentation of the images acquired at 40 weeks PMA also produced accurate results when trained with only the two training images acquired at 30 weeks PMA that are publicly available in the NeoBrainS12 challenge (Fig. 8, Experiment NB40). This suggests that the training set in this method does not necessarily have to be composed of images acquired at the same age and does not necessarily have to be acquired with the same protocol. As a result of cortical development, GM has a more complex morphology in the images acquired at 40 weeks PMA than in the images acquired at 30 weeks PMA, which makes segmentation more challenging. Errors in the images acquired at 40 weeks occurred mostly in the sulci (see Fig. 9, third row). Furthermore, the segmentation results sometimes contained small holes (see Fig. 9, second and third rows). These errors might be reduced by simply increasing the number of voxels that are reanalysed in the third segmentation stage by loosening the criterion that allows voxels to be selected for reclassification. Relation to previous work A preliminary version of this approach was presented by Chiţǎ et al. (2013). The currently presented work describes a substantial extension to this initial study with respect to parameter optimisation and evaluation. The parameter optimisation experiments in the current work demonstrated that GM, WM and CSF can be accurately segmented based on T2-weighted images only. In addition, the method was evaluated on images acquired at 40 weeks PMA without further parameter modification or optimisation. Finally, this work is the first to demonstrate that training data for supervised voxel classification in neonatal brain imaging does not necessarily require very time consuming and tedious manual annotations of complete images, but that small annotated regions or training data at a different age are sufficient. Previously, another supervised classification approach was presented by Anbeek et al. (2013). In contrast to our method, that method used a single kNN-classification based only on raw intensities of the T1- and T2-weighted images and coordinates in an atlas space. This approach is therefore very sensitive to noise. Moreover, it requires a populationspecific atlas as well as multiple fully segmented training images and is therefore not easily applicable to images acquired at different ages or with a different protocol. Practical applicability This segmentation method used features based on intensity and spatial information. In combination with the limited amount of required training data, not even necessarily acquired from the same set of images, it could likely straightforwardly be applied to new image sets, e.g. images acquired at 1.5 T, axial images, or images acquired at different ages. The method may, however, be limited in application to scans with severe pathology. A qualitative impression of the segmentation performance on images with characteristics not included in the training data (Fig. 10) suggests that the method is robust to minor abnormalities. In our experiments, no bias field correction was performed as this effect was not visible. This might, however, not be the case for images acquired with different acquisition protocols and could be included as a preprocessing step when evaluating such images. A common problem in automatic neonatal brain segmentation is the appearance of voxels segmented as WM on the border between GM and CSF, which is caused by the similar intensity of these voxels (Anbeek et al., 2013; Cardoso et al., 2013; Išgum et al., 2015). The results of the presented method demonstrate that such voxels appeared in only

13

very few cases. In most of these cases, such voxels were connected to the WM in a neighbouring slice. Furthermore, the method focuses on the segmentation of WM, GM and CSF because of their relevance in evaluating cortical morphology and its development in preterm infants. The other tissues were labelled as background. Myelinated white matter was not included in the WM segmentation because of its unclear definition and hence large interobserver variability at these ages (Išgum et al., 2015). Extending this segmentation approach to other tissue types, such as basal ganglia, cerebellum and ventricles, is an interesting topic for future work, and could possibly be achieved by straightforward extension of the described approach to more tissue types. The segmentations can be used for volumetric measurements of WM, GM, and CSF in images acquired at 30 and 40 weeks PMA. Furthermore, both the boundary between GM and WM and the boundary between GM and CSF were accurately segmented, as shown by the small modified Hausdorff distances. This suggests that no large errors along the boundary were present. The accurately segmented cortical surfaces allow quantification of cortical morphology and development. Conclusion The presented method accurately segmented WM, GM, and CSF in T2-weighted images and is robust to differences in age and acquisition protocol. Furthermore, the method accurately segmented the images when trained with a limited number of training samples from the given population, without any additional parameter tuning. This reduces the time and effort required to create reference annotations and may therefore extend the applicability of the method. The resulting segmentations can be used for volumetric measurements and quantification of cortical characteristics. References Anbeek, P., Išgum, I., van Kooij, B.J., Mol, C.P., Kersbergen, K.J., Groenendaal, F., Viergever, M.A., de Vries, L.S., Benders, M.J., 2013. Automatic segmentation of eight tissue classes in neonatal brain MRI. PLoS One 8 (12), e81895. Awate, S.P., Yushkevich, P.A., Song, Z., Licht, D.J., Gee, J.C., 2010. Cerebral cortical folding analysis with multivariate modeling and testing: studies on gender differences and neonatal development. NeuroImage 53 (2), 450–459. Cardoso, M.J., Melbourne, A., Kendall, G.S., Modat, M., Robertson, N.J., Marlow, N., Ourselin, S., 2013. AdaPT: an adaptive preterm segmentation algorithm for neonatal brain MRI. NeuroImage 65, 97–108. Chang, C.C., Lin, C.J., 2011. LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2 (3), 27. Chiţǎ, S., Benders, M.J., Moeskops, P., Kersbergen, K.J., Viergever, M.A., Išgum, I., 2013. Automatic segmentation of the preterm neonatal brain with MRI using supervised classification. SPIE Medical Imaging. International Society for Optics and Photonics (86693X-86693X). Cortes, C., Vapnik, V., 1995. Support-vector networks. Mach. Learn. 20 (3), 273–297. Dubois, J., Benders, M., Borradori-Tolsa, C., Cachia, A., Lazeyras, F., Ha-Vinh Leuchter, R., Sizonenko, S.V., Warfield, S.K., Mangin, J.F., Hüppi, P.S., 2008a. Primary cortical folding in the human newborn: an early marker of later functional development. Brain 131, 2028–2041. Dubois, J., Benders, M., Cachia, A., Lazeyras, F., Leuchter, R.H.V., Sizonenko, S.V., BorradoriTolsa, C., Mangin, J., Hüppi, P.S., 2008b. Mapping the early cortical folding process in the preterm newborn brain. Cereb. Cortex 18 (6), 1444–1454. Gholipour, A., Estroff, J.A., Barnewolt, C.E., Connolly, S.A., Warfield, S.K., 2011. Fetal brain volumetry through MRI volumetric reconstruction and segmentation. Int. J. Comput. Assist. Radiol. Surg. 6 (3), 329–339. Gui, L., Lisowski, R., Faundez, T., Hüppi, P.S., Lazeyras, F., Kocher, M., 2012. Morphologydriven automatic segmentation of MR images of the neonatal brain. Med. Image Anal. 16 (8), 1565–1579. Habas, P.A., Kim, K., Rousseau, F., Glenn, O.A., Barkovich, A.J., Studholme, C., 2010. Atlasbased segmentation of developing tissues in the human brain with quantitative validation in young fetuses. Hum. Brain Mapp. 31 (9), 1348–1358. Hill, J., Dierker, D., Neil, J., Inder, T., Knutsen, A., Harwell, J., Coalson, T., Van Essen, D., 2010. A surface-based analysis of hemispheric asymmetries and folding of cerebral cortex in term-born human infants. J. Neurosci. 30 (6), 2268–2276. Huttenlocher, D., Klanderman, G., Rucklidge, W., 1993. Comparing images using the Hausdorff distance. IEEE Trans. Pattern Anal. Mach. Intell. 15, 850–863. Išgum, I., Benders, M.J., Avants, B., Cardoso, M.J., Counsell, S.J., Gomez, E.F., Gui, L., Hüppi, P., Kersbergen, K.J., A., et al., 2015. Evaluation of automatic neonatal brain segmentation algorithms: the NeoBrainS12 challenge. Med. Image Anal. 20 (1), 135–151. Jain, A.K., Zongker, D., 1997. Feature selection: evaluation, application and small sample performance. IEEE Trans. Pattern Anal. Mach. Intell. 19, 153–158.

Please cite this article as: Moeskops, P., et al., Automatic segmentation of MR brain images of preterm infants using supervised classification, NeuroImage (2015), http://dx.doi.org/10.1016/j.neuroimage.2015.06.007

14

P. Moeskops et al. / NeuroImage xxx (2015) xxx–xxx

Kuklisova-Murgasova, M., Aljabar, P., Srinivasan, L., Counsell, S.J., Doria, V., Serag, A., Gousias, I.S., Boardman, J.P., Rutherford, M.A., Edwards, A.D., Hajnal, J.V., Rueckert, D., 2011. A dynamic 4D probabilistic atlas of the developing brain. NeuroImage 54 (4), 2750–2763. Lin, H.T., Lin, C.J., Weng, R.C., 2007. A note on Platt's probabilistic outputs for support vector machines. Mach. Learn. 68, 267–276. Makropoulos, A., Gousias, I., Ledig, C., Aljabar, P., Serag, A., Hajnal, J., Edwards, A.D., Counsell, S., Rueckert, D., 2014. Automatic whole brain MRI segmentation of the developing neonatal brain. IEEE Trans. Med. Imaging 33 (9), 1818–1831. Moeskops, P., Benders, M.J., Pearlman, P.C., Kersbergen, K.J., Leemans, A., Viergever, M.A., Išgum, I., 2013. Assessment of quantitative cortical biomarkers in the developing brain of preterm infants. SPIE Medical Imaging. International Society for Optics and Photonics (867011-867011). Moeskops, P., Benders, M.J.N.L., Kersbergen, K.J., Groenendaal, F., de Vries, L.S., Viergever, M.A., Išgum, I., 2015. Development of cortical morphology evaluated with longitudinal MR brain images of preterm infants. PLOS ONE. http://dx.doi.org/10.1371/journal. pone.0131552 (in press). Pienaar, R., Fischl, B., Caviness, V., Makris, N., Grant, P.E., 2008. A methodology for analyzing curvature in the developing brain from preterm to adult. Int. J. Imaging Syst. Technol. 18 (1), 42–68. Platt, J.C., 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers. MIT Press, pp. 61–74. Prastawa, M., Gilmore, J.H., Lin, W., Gerig, G., 2005. Automatic segmentation of MR images of the developing newborn brain. Med. Image Anal. 9 (5), 457–466. Pudil, P., Novovicova, J., Kittler, J., 1994. Floating search methods in feature selection. Pattern Recogn. Lett. 15 (11), 1119–1125. Rathbone, R., Counsell, S.J., Kapellou, O., Dyet, L., Kennea, N., Hajnal, J., Allsop, J.M., Cowan, F., Edwards, A.D., 2011. Perinatal cortical growth and childhood neurocognitive abilities. Neurology 77 (16), 1510–1517. Rodriguez-Carranza, C.E., Mukherjee, P., Vigneron, D., Barkovich, J., Studholme, C., 2008. A framework for in vivo quantification of regional brain folding in premature neonates. NeuroImage 41 (2), 462–478. Saigal, S., Doyle, L.W., 2008. An overview of mortality and sequelae of preterm birth from infancy to adulthood. Lancet 371, 261–269.

Serag, A., Aljabar, P., Ball, G., Counsell, S.J., Boardman, J.P., Rutherford, M.A., Edwards, A.D., Hajnal, J.V., Rueckert, D., 2012. Construction of a consistent high-definition spatiotemporal atlas of the developing brain using adaptive kernel regression. NeuroImage 59 (3), 2255–2265. Shi, F., Fan, Y., Tang, S., Gilmore, J.H., Lin, W., Shen, D., 2010. Neonatal brain image segmentation in longitudinal MRI studies. NeuroImage 49, 391–400. Smith, S.M., 2002. Fast robust automated brain extraction. Hum. Brain Mapp. 17 (3), 143–155. Song, Z., Awate, S.P., Licht, D.J., Gee, J.C., 2007. Clinical neonatal brain MRI segmentation using adaptive nonparametric data models and intensity-based Markov priors. Medical Image Computing and Computer-Assisted Intervention. Springer, pp. 883–890. Srhoj-Egekher, V., Benders, M.J., Viergever, M.A., Išgum, I., 2013. Automatic neonatal brain tissue segmentation with MRI. SPIE Medical Imaging, organizationInternational Society for Optics and Photonics (86693X-86693X). Wang, H., Das, S.R., Suh, J.W., Altinay, M., Pluta, J., Craige, C., Avants, B., Yushkevich, P.A., 2011. A learning-based wrapper method to correct systematic errors in automatic image segmentation: consistently improved performance in hippocampus, cortex and brain segmentation. NeuroImage 55 (3), 968–985. Wang, L., Shi, F., Li, G., Gao, Y., Lin, W., Gilmore, J.H., Shen, D., 2014. Segmentation of neonatal brain MR images using patch-driven level sets. NeuroImage 84, 141–158. Weisenfeld, N.I., Warfield, S.K., 2009. Automatic segmentation of newborn brain MRI. NeuroImage 47 (2), 564–572. Wright, R., Vatansever, D., Kyriakopoulou, V., Ledig, C., Wolz, R., Serag, A., Rueckert, D., Rutherford, M.A., Hajnal, J.V., Aljabar, P., 2012. Age dependent fetal MR segmentation using manual and automated approaches. MICCAI workshop on Perinatal and Paediatric Imaging, pp. 97–104. Xue, H., Srinivasan, L., Jiang, S., Rutherford, M., Edwards, A.D., Rueckert, D., Hajnal, J.V., 2007. Automatic segmentation and reconstruction of the cortex from neonatal MRI. NeuroImage 38 (3), 461–477. Zou, K.H., Warfield, S.K., Bharatha, A., Tempany, C., Kaus, M.R., Haker, S.J., Wells III, W.M., Jolesz, F.A., Kikinis, R., 2004. Statistical validation of image segmentation quality based on a spatial overlap index. Acad. Radiol. 11 (2), 178–189.

Please cite this article as: Moeskops, P., et al., Automatic segmentation of MR brain images of preterm infants using supervised classification, NeuroImage (2015), http://dx.doi.org/10.1016/j.neuroimage.2015.06.007

Automatic segmentation of MR brain images of preterm infants using supervised classification.

Preterm birth is often associated with impaired brain development. The state and expected progression of preterm brain development can be evaluated us...
3MB Sizes 0 Downloads 10 Views