Home

Search

Collections

Journals

About

Contact us

My IOPscience

Feature selection using genetic algorithms for fetal heart rate analysis

This content has been downloaded from IOPscience. Please scroll down to see the full text. 2014 Physiol. Meas. 35 1357 (http://iopscience.iop.org/0967-3334/35/7/1357) View the table of contents for this issue, or go to the journal homepage for more Download details: IP Address: 132.239.1.231 This content was downloaded on 28/04/2017 at 18:41 Please note that terms and conditions apply.

You may also be interested in: Computerized fetal heart rate analysis in labor Antoniya Georgieva, Stephen J Payne, Mary Moulden et al. Combining latent class analysis labeling with multiclass approach for fetal heart rate categorization P Karvelis, J Spilka, G Georgoulas et al. Maximum relevance, minimum redundancy band selection based on neighborhood rough set for hyperspectral data classification Yao Liu, Yuehua Chen, Kezhu Tan et al. Multiresolution local binary pattern texture analysis combined with variable selection for application to false-positive reduction in computer-aided detection of breast masses on mammograms Jae Young Choi and Yong Man Ro Computer-aided diagnosis of prostate cancer in the peripheral zone using multiparametric MRI Emilie Niaf, Olivier Rouvière, Florence Mège-Lechevallier et al. Patients on weaning trials classified with support vector machines Ainara Garde, Rico Schroeder, Andreas Voss et al. Sleep stage classification with ECG and respiratory effort Pedro Fonseca, Xi Long, Mustafa Radha et al. Design of a high-sensitivity classifier based on a genetic algorithm: application to computer-aided diagnosis Berkman Sahiner, Heang-Ping Chan, Nicholas Petrick et al.

Institute of Physics and Engineering in Medicine Physiol. Meas. 35 (2014) 1357–1371

Physiological Measurement

doi:10.1088/0967-3334/35/7/1357

Feature selection using genetic algorithms for fetal heart rate analysis Liang Xu 1,2,3 , Christopher W G Redman 2 , Stephen J Payne 3 and Antoniya Georgieva 2 1

Doctoral Training Centre, University of Oxford, Rex Richards Building, South Parks Road, Oxford OX1 3QU, UK 2 Nuffield Department of Obstetrics and Gynaecology, University of Oxford, Level 3, Women’s Centre, John Radcliffe Hospital, Oxford OX3 9DU, UK 3 Department of Engineering Science, Institute of Biomedical Engineering, University of Oxford. Headington, Oxford OX3 7DQ, UK E-mail: [email protected] Received 20 December 2013, revised 7 April 2014 Accepted for publication 25 April 2014 Published 22 May 2014 Abstract

The fetal heart rate (FHR) is monitored on a paper strip (cardiotocogram) during labour to assess fetal health. If necessary, clinicians can intervene and assist with a prompt delivery of the baby. Data-driven computerized FHR analysis could help clinicians in the decision-making process. However, selecting the best computerized FHR features that relate to labour outcome is a pressing research problem. The objective of this study is to apply genetic algorithms (GA) as a feature selection method to select the best feature subset from 64 FHR features and to integrate these best features to recognize unfavourable FHR patterns. The GA was trained on 404 cases and tested on 106 cases (both balanced datasets) using three classifiers, respectively. Regularization methods and backward selection were used to optimize the GA. Reasonable classification performance is shown on the testing set for the best feature subset (Cohen’s kappa values of 0.45 to 0.49 using different classifiers). This is, to our knowledge, the first time that a feature selection method for FHR analysis has been developed on a database of this size. This study indicates that different FHR features, when integrated, can show good performance in predicting labour outcome. It also gives the importance of each feature, which will be a valuable reference point for further studies. Keywords: fetal heart rate, cardiotocogram, genetic algorithms, support vector machines (Some figures may appear in colour only in the online journal)

0967-3334/14/1357+15$33.00

© 2014 Institute of Physics and Engineering in Medicine Printed in the UK

1357

L Xu et al

Physiol. Meas. 35 (2014) 1357

1. Introduction During birth, a baby’s oxygen supply can be reduced due to the stress caused by uterine contractions. Therefore, some babies suffer from birth asphyxia, which may lead to seizures, permanent brain damage or even death. Cerebral palsy occurs in about two cases per 1000 births, of which birth asphyxia accounts for 10–30% (Alberry et al 2009). To prevent birth asphyxia, it is crucial to intervene in a timely manner to assist with immediate delivery. On the other hand, interventions such as Caesarean sections, forceps and ventouse deliveries may cause complications, and are best avoided. Therefore, timely and accurate diagnosis of birth asphyxia is essential to minimize damage while avoiding unnecessary interventions. In order to monitor fetal health, the fetal heart rate (FHR) and uterine contractions are electronically recorded during labour on a paper strip called a cardiotocogram. The complicated FHR patterns are usually assessed by eye, which is tedious, error-prone and associated with high inter- and intra-observer variability (Chauhan et al 2008, Westgate 2009). Although computerized analysis of FHR patterns in fetal monitoring has the potential to improve decision making for clinical interventions (Westgate 2009), it has been difficult to develop because of the lack of large databases that not only include the original digital signals from the FHR monitor but also reliable and comprehensive clinical outcome data. Such a database has been created in Oxford and we are currently using it to develop a computerized system (OxSys) to recognize unfavourable FHR patterns automatically in labour (Georgieva et al 2013b). Our aim is not to simulate the practices of expert clinicians but use the database to define features, novel or classical, that improve diagnosis and prediction of the fetal state (data-driven approach). At present, the OxSys system extracts 64 FHR features. The influence of FHR features on the assessment of the fetus has previously been investigated in 2124 cases (Czaba´nski et al 2013), where it was found that FHR features can be used for fetal state assessment using a fuzzy inference method. Other studies have shown that the combination and integration of different FHR features can perform better than using the features univariately (Georgieva et al 2013b, Chud´acˇ ek 2011, Georgoulas et al 2007). On the other hand, the number of all possible combinations of the 64 features used here is 264 = 1.8 × 1019, which is too big for an exhaustive search. Therefore, feature selection methods need to be applied to the task of selecting a subset of FHR features that has the best classification performance. Various methods of feature selection have been compared in other fields (Guyon and Elisseeff 2003), and categorized as filters, wrappers and embedded methods. Genetic algorithms (GA), as one of the wrappers, select a subset of the features based on its performance on certain classifiers. We chose GA for this study because of their competitive ability to fully explore the feature space (Mitchell 2001). To our knowledge, the use of the feature selection method to select FHR features has not been investigated previously on data at this scale (7568 cases). This is because, so far, only the Oxford data archive (comprising digital FHR traces and labour outcomes) has been large enough to allow such analyses. The objective of this study is to select a FHR feature subset that has the best performance in labour outcome classification.

2. Data 2.1. Case selection

The OxSys database includes more than 100 000 records stored on a digital archive taken over a period of 15 years at the John Radcliffe Hospital, Oxford, UK and linked to clinical 1358

L Xu et al

Physiol. Meas. 35 (2014) 1357

outcome data (Georgieva et al 2013b). A subset of these are considered here, included for the completeness and timing of the FHR and outcome data. To ensure that all cases are analysed at comparable stages of labour, only the last 30 min before birth were examined. The assumption of this study is that, in the last 30 min of the second labour stage, adverse labour outcome is detectable using FHR. Only those FHR records taken after the onset of active maternal pushing with adequate signal quality (>50%) were considered. These selection criteria are consistent with our previous study using artificial neural network (ANN) for FHR classification: a total set of 7568 FHR records were selected (Georgieva et al 2013b). A detailed patient flow chart is provided in the reference. The features were extracted in 15 min windows using 5 min sliding steps; thus in the last 30 min of labour, there were four 15 min windows, from which 64 FHR features were extracted. For each feature, the median value of the four windows was used. This was done to reduce the data complexity and the possible effect of an outlier feature value in one window being caused by artefactual FHR signals (Georgieva et al 2013b). The mean of each feature (over the 7568 dataset) was normalized to 0 and standard deviation (STD) to 1. The analysis package Matlab 2012b (The Mathworks, Inc.) was used to develop the algorithm and analyse the data. 2.2. GA training set and testing set

An adverse outcome was defined as umbilical arterial pH < 7.05, which is equivalent to fetal acidemia, one of the conditions for diagnosis of birth asphyxia (Alberry et al 2009). A clearly distinguishable category of normal outcome was then defined as 7.27 < arterial pH < 7.33—considered here to be a normal range of arterial pH, based on other studies (Georgieva et al 2013a)—and no form of labour compromise (959 out of 7568 cases). This is to clearly distinguish normal and adverse outcomes in the training set and testing set. Fetal acidemia is related to certain FHR patterns (Parer et al 2006), but it is rare, so that the dataset is heavily imbalanced: only 255 out of 7568 cases have an adverse outcome (3.37%). Training a classifier using the entire dataset would be heavily biased towards recognizing the healthy cases. Therefore, a balanced dataset was selected to train the classifier. To select a balanced dataset of 50% normal outcomes and 50% adverse outcomes, 255 cases were randomly selected out of the 959 normal outcomes. Hence a total of 510 cases, with 255 normal outcomes and 255 adverse outcomes were studied. Figure 1 shows the distribution of the training and testing data. From the 510 cases, 404 (80%, 202 normal and 202 adverse cases) were randomly selected to select the features using GA. The remaining 106 cases (20%, 53 normal cases and 53 adverse cases) were used as a testing set to evaluate the performance of the selected features. In addition, in each GA run, the 404 cases were further separated into a training set and a validation set (70%-30%). To avoid confusion, the 404 cases used in the GA training are referred to as the GA training set, and the 106 cases used to evaluate the performance of the GA are referred to as the testing set. The widely accepted ‘rule of thumb’ (Van Niel et al 2005) is that at least ten training samples of each class per input feature dimension are needed. Therefore, the maximal number of features that could be selected was 14. This was incorporated as a constraint in the algorithms. 2.3. OxSys features

OxSys extracts a total of 64 different features for a sliding 15 min window (table 1). Different categories of FHR features are extracted using different methods. There are clinical features related to specific clinical phenomena, such as contractions, accelerations, decelerations, shoot 1359

L Xu et al

Physiol. Meas. 35 (2014) 1357

Figure 1. Distribution of all 510 cases used for the genetic algorithm.

and lag features. There are also other features derived using statistical methods (Georgieva et al 2011a, 2011b, 2014, Fulcher et al 2012, Xu et al 2012, 2013). 3. Methods 3.1. The GA method

The standard GA algorithm was used (Goldberg 1989). The GA method consists of four major steps: (1) the features were coded as genomes in binary, i.e. ‘1’ means selected and ‘0’ means not selected. A population of features was randomly generated; (2) each feature was evaluated using the fitness value, and the best features were selected; (3) the best features were modified using crossover and mutation to form a new generation of population; (4) the new generation continued to (2) again until stopping criteria were met. A flow chart of the GA method is shown in figure 2. The population size was set to 100. Ranking was used as the selection strategy, where the individuals ranking in the best 50% fitness were selected to create offspring. A single-point crossover with a proportion of 0.8 and single-point mutation with a proportion of 0.2 was applied to generate the next population. The elitist strategy selection was set to 2, i.e. the best two individuals of the current generation were included in the next population. The maximum number of generations with the same best fitness value was set to 20. The maximum number of generations was set to 200. The number of GA run with different initial conditions (same data splits) was set to 100. 3.2. Fitness function

In this study, Cohen’s kappa value (Cohen 1960) was used to evaluate the fitness of the classifier. Kappa is a statistical measure of agreement between predicted and actual values. In a balanced dataset, it is equivalent to the proportion of agreement kappa = proportion of agreement ×2−1. The performance of the classifier depends on how the training set is specified (figure 1). Owing to the limited size of the data, a repeated random sub-sampling strategy was used for crossvalidation. Before each GA analysis, the data were randomly split ten times into 70% training and 30% validation sets (figure 2). The performance on each set was recorded as the kappa value. The median of these ten kappa values was then recorded as the fitness value of the genome. 1360

L Xu et al

Physiol. Meas. 35 (2014) 1357

Table 1. List of OxSys features.

General features 1 Baseline 5 Lowest dip of fetal heart rate 60 Sinusoidal pattern 64 Simulation for clinical guidelines of American Congress of Obstetricians and Gynaecologists (ACOG 2009) Contraction features (Georgieva et al 2009) 15 Number of contractions 16 Median of contraction duration 17 Median of resting time between contractions 18 Resting/contraction time ratio Acceleration and deceleration features 20 Number of accelerations 21 Mean acceleration duration 22 Mean acceleration amplitude 23 Number of decelerations 24 Mean deceleration duration 25 Maximal deceleration duration 26 Median deceleration amplitude 27 Maximal deceleration amplitude 28 Mean time to deceleration 29 Mean time to recover 30 Maximal time to recover 31 Number of quick recoveries 32 Number of slow recoveries 33 Resting/deceleration time ratio 34 Mean deceleration area 35 Total number of lost beats 36 Median onset slope 37 Median recovery slope 38 Maximal baseline shift after deceleration 39 Number of prolonged deceleration (3 min) 40 Number of decelerations with right and/or left shoulder 41 Number of decelerations with only right shoulder (overshooting)

Lag features 42 Number of early decelerations 43 Number of late decelerations 44 Number of variable decelerations (42–44 based on ACOG 2009) 45 Median recovery time after contraction end 46 Mean lag time (late decelerations only) 47 Deceleration/contraction time ratio Variability features 2 Percentage of zero difference between neighbour points 3 Approximate entropy (Dawes et al 1992) 4 Signal stability index (SSI) (Georgieva et al 2011a) 8 Long-term variability 9 SSI of the residual signal 10 Short term variability (STV) 11 SSI of STV tracker 12 Range of STV tracker 13 STV tracker trend 14 Median of the absolute derivative values of the STV tracker 19 STV (accelerations included) 61 Phase rectified signal averaging (PRSA) (Georgieva et al 2014) 62 Bivariate phase rectified signal averaging (BPRSA)–AC component 63 BPRSA–DC component Other statistically derived features (Fulcher et al 2012, Xu et al 2012) 6 Skewness 7 Kurtosis 48 Auto-mutual information feature 49 Ratio of standard deviation 50 Mean of local approximate entropy 51 Standard deviation (STD) of local sample entropy 52 Goodness of exponential fit 53 2-dim time delay embedding space 54 Median absolute deviation (MAD) 55 (STD/mean)2 56 Alphabet feature 57 Pattern readjustment feature 58 Interquartile range of smoothed signal 59 STD of Gaussian filtered signal

3.3. Classifiers

Three classifiers were used in this study to compare the performance: linear regression, linear support vector machine (SVM), and radial basis function (RBF) SVM. (1) Linear regression: the prediction is defined as an adverse result if the output value of linear regression >0.5, and as normal if the output value of linear regression is 0.5. 1361

L Xu et al

Physiol. Meas. 35 (2014) 1357

Figure 2. A flow chart of the genetic algorithm.

(2) Linear SVM: the SVM is widely used in data analysis owing to its intuitive definition and simple implementation (Burges 1998). Its detailed principles can be found elsewhere (Cristianini and Shawe-Taylor 2000, Vapnik 1999). The least squares method is used to find the separating hyperplane. The threshold for the classifier output value is chosen here to be 0.5. (3) RBF SVM: nonlinear SVM is similar to linear SVM, except that every dot product is replaced by a nonlinear kernel function. In this study, we used the Gaussian RBF kernel, whose feature space is a Hilbert space of infinite dimensions. Intuitively, RBF SVM determines the category of a case based on its spatial distance to other points. The threshold for the classifier output value is chosen here to be 0.5. The gamma value in the kernel function of the RBF SVM was tested on the GA training set in terms of classification performance. It was found that between 0.05 and 0.2 there was no significant difference in classification performance (kappa = 0.49), so the gamma value was set throughout to 0.1. Linear SVM and RBF SVM algorithms were implemented using LIBSVM (Chang and Lin 2011). The influence of the classifier parameters on classification has been investigated but not shown here due to space limitations. 3.4. Overfitting prevention

Previous analysis of the FHR features has shown that most of them are weakly linearly correlated with adverse outcomes (Fulcher et al 2012, Georgieva et al 2013b), which renders that the dataset is prone to overfitting, i.e. some of the features might be selected by chance. To 1362

L Xu et al

Physiol. Meas. 35 (2014) 1357

minimize this potential problem, two approaches were used: (1) regularization methods, for linear regression and linear SVM; (2) sequential backward selection (SBS), for RBF SVM. Regularization offers a relative measure for the goodness of fit, describing the tradeoff between model fitness and model complexity. The Akaike information criterion (AIC) (Akaike 1974) and Bayesian information criterion (BIC) (Schwarz 1978) are two commonly used regularization methods (Bishop 2006). The intuitive interpretation of the AIC is to find the balance between classification performance and model complexity. The BIC is based on the AIC but uses a Bayesian formalism, which penalizes the number of features more strongly than the AIC. Since the AIC and BIC assess the performance of the classifier using the whole dataset instead of training-validation sets, the whole GA training set (404 cases) was used (figure 1). The AIC and BIC were found to be consistent and robust for linear regression and linear SVM, but the penalized terms were not large enough to relieve the overfitting issue for RBF SVM. Therefore, another approach was applied for RBF SVM: backward selection using the feature ranking in GA. GA are often criticized because they are sensitive to tiny changes of the classifier performance (Mitchell 2001). Coarser greedy search strategies, such as sequential forward selection (SFS) and SBS, are computationally advantageous and particularly robust against overfitting (Guyon and Elisseeff 2003), but their ability to explore the full feature space is limited. This disadvantage is reduced by using the feature ranking of GA, since GA can effectively explore the feature space. Therefore, it is possible that utilizing the feature ranking from GA in SFS and SBS will improve the performance of GA, SFS or SBS, each on its own. To test this hypothesis, the SFS and SBS were trained and tested on the same dataset as GA, using the best features selected by GA (feature frequency >10%). 4. Results 4.1. Classification performance

The best classifier of each of the 100 runs of GA was applied independently to the testing set. The classification performance of each classifier was measured by a kappa value and the performances are compared in figure 3. It is shown that there is no obvious difference between the three classifiers, except that the RBF SVM seems to have lower variance than the other two classifiers. 4.2. Feature ranking

Since GA was run 100 times with different splits of training-validation data, there are 100 sets of best features. The importance of each feature is given by the frequency of the feature being selected. Figure 4 shows the feature ranking given by the different classifiers. Similarities are found between the rankings for linear regression, linear SVM and nonlinear SVM. For example, the most frequently selected feature no. 61: PRSA- deceleration capacity (DC) is the same for all three classifiers. These feature tables, indicating the relative importance of different features, could be very useful for future studies. 4.3. Evaluation of robustness using artificial features

To assess the robustness of the method, 32 artificial features (50% of original number of features) were generated, indexed feature no. 65–no. 96. These features were randomly generated, normally distributed with zero mean and STD equal to 1. The dataset containing 96 1363

L Xu et al

Physiol. Meas. 35 (2014) 1357

Figure 3. Boxplots of the classification performance for different classifiers over 100 runs on the testing set.

Table 2. Best feature subsets selected by the AIC and BIC.

Classifier Linear regression Linear SVM

Selected feature subset (displayed in feature index) AIC BIC AIC BIC

5

9

10

4 5

5 9

9 10

16 5 10 16

38 9 16 44

40 10 26 48

41 48 28 50

44 51 40 51

48 61 41 61

50 63 51

51

61

61

63

63

Number of features

Kappa value on testing set

13 7 12 9

0.42 0.45 0.41 0.47

features was then used for the GA. Other properties of this dataset, including the separation of the GA training set and GA testing set, were the same as described in section 2. The robustness of the method was evaluated by observing the frequency of artificial features being selected using different strategies, as shown in figure 5. For all three classifiers, it was found that artificial features were selected only with low frequency: none of the top features are artificial. This is encouraging, and suggests that some of the original features (which were selected less frequently than artificial features) are likely to be selected by chance. This confirms that measures to prevent overfitting are needed. 4.4. Prevention of overfitting: linear regression and linear SVM

The linear classifiers use the AIC and BIC to select a subset of the best features. The best feature subsets selected using the AIC and BIC are shown in table 2. The BIC generally performs better than the AIC in selecting fewer features and has a better classification performance (kappa value). This table shows that the BIC can provide a reasonable classification performance while reducing the number of features. The robustness of the AIC and BIC were also tested 1364

L Xu et al

Physiol. Meas. 35 (2014) 1357

(A)

(B)

(C) Figure 4. Feature frequency (top 20 frequently selected features) of: (A) linear regression

(B) linear SVM (C) RBF SVM.

with artificial features. The AIC selects varying numbers of artificial features for the three classifiers, while the BIC selects identical feature subsets with or without artificial features for the linear classifiers (results not shown). In conclusion, the BIC is more robust and accurate than the AIC. 4.5. Overfitting prevention: RBF SVM

Four methods were used to determine if GA + SFS or GA + SBS can relieve the overfitting intrinsic to GA when RBF SVM was used: GA-only, SFS-only, SFS using the GA ranking (GA + SFS) and SBS using the GA ranking (GA + SBS). SBS-only was studied too but it always selects excessively more features than the maximum features number allowed (14). The results are shown in figure 6. GA-only outperforms SFS-only, while the performance of GA + SFS is similar to that of GA-only. Moreover, GA + SBS has the best classification performance on the testing set (one-tailed t-test, p < 0.05), which validates the hypothesis in section 3.4 that using GA + SBS significantly improves the performance of GA-only, SBS-only. The feature subset ‘1 10 19 21 48 50 51 55 61’ was most frequently selected. 4.6. Classification performance of the best feature subsets

The best feature subsets picked up by different methods are summarized in table 3, and their classification performances are compared in table 4. RBF SVM is slightly better than the other 1365

L Xu et al

Physiol. Meas. 35 (2014) 1357

(A)

(B)

(C) Figure 5. Feature frequency (top 20 frequently selected features) of: (A) linear regression

(B) linear SVM (C) RBF SVM on the mixture of actual features and artificial features. Table 3. Best feature subset for different classifiers.

Classifier

Selected feature subset (displayed in feature index)

Linear regression Linear SVM RBF SVM

5 1

5 9 10

9 10 19

10 16 21

48 44 48

51 48 50

61 50 51

63 51 55

61 61

Number of features

Method

7 9 9

GA + BIC GA + BIC GA + SBS

two classifiers, with better sensitivity and worse specificity than the other two classifiers. In terms of kappa, all three classifiers perform better using the selected feature subset than using all 64 features, while RBF SVM is slightly better than the other two classifiers using the (GA) selected feature subset. The ROC curves of different classifiers on all 7568 cases, and on all cases but excluding the GA training set (7164 cases), are shown in figure 7. The points on the ROC curve are produced by varying the threshold on the output of the classifier. For both ROC curves, the RBF SVM performs generally better than the other two classifiers, while linear regression and the linear SVM are similar to each other. The area under curve (AUC) values were also calculated for each classifier and these are displayed in table 5. Again, RBF SVM gave a slightly better AUC when compared to the other two classifiers. Note that, when excluding the 1366

L Xu et al

Physiol. Meas. 35 (2014) 1357

Figure 6. Classification performance (kappa value comparing the prediction output and actual result) of different methods (using RBF SVM as the classifier) on the testing set.

Table 4. The classification performances of different methods on the testing set.

Classifier

Sensitivity

Specificity

Kappa

Kappa using all 64 features

Linear regression Linear SVM RBF SVM RF-classification RF-regression LASSO PCA + RBF SVM (dimension reduced to 7 features) PCA + RBF SVM (dimension reduced to 14 features)

64.15% 66.83% 83.02% 67.92% 64.15% 66.83% 67.92%

81.13% 81.13% 66.03% 77.36% 73.58% 78.25% 73.58%

0.45 0.47 0.49 0.45 0.38 0.45 0.40

0.33 0.30 0.01 – – – –

64.15%

73.58%

0.38



Table 5. AUC values for different classifiers.

Classifier

All 7568 cases

Excluding the GA training set

Linear regression Linear SVM RBF SVM

0.7498 0.7516 0.7644

0.7531 0.7352 0.7447

GA training set, the dataset became even more imbalanced, which caused lower resolution of the ROC curve. The AUC values (0.74 ∼ 0.75) can be compared to the AUC (0.64 ∼ 0.74) reported by the study using fuzzy inference method (Czaba´nski et al 2013). Although the selection of pH threshold, the data selection criteria are different, it did imply the potential that feature selection can improve the quality of the classification. The classification results were also compared with other feature selection and reduction methods such as random forest (RF) (Breiman 2001), principal component analysis (PCA) (Bishop 1996) and least absolute shrinkage and selection (LASSO) (Tibshirani 1996) in table 4. 1367

L Xu et al

1

1

0.9

0.9

0.8

0.8 true positive rate (sensitivity)

true positive rate (sensitivity)

Physiol. Meas. 35 (2014) 1357

0.7 0.6 0.5 0.4 0.3 linear regression linear SVM RBF SVM

0.2 0.1 0

0

0.1

0.2

0.3 0.4 0.5 0.6 0.7 false positive rate (1 - specificity)

0.8

0.9

0.7 0.6 0.5 0.4 0.3 0.2

linear regression linear SVM RBF SVM

0.1

1

(A)

0

0

0.1

0.2

0.3 0.4 0.5 0.6 0.7 false positive rate (1 - specificity)

0.8

0.9

1

(B)

Figure 7. Receiver operating characteristic (ROC) curve of different classifiers for: (A)

all 7568 cases (B) all cases excluding the GA training set (7164 cases).

It is found that the classification performances of the GA are comparable, or slightly better than, the performances of these other feature selection methods. A two-proportion test was used to test if GA works significantly better (using kappa) than the other methods, and the results are not significant (p > 0.05). In conclusion, GA was able to select the best feature subset for each classifier using different methods with reasonable classification performances. 5. Discussion In this study, 64 FHR features were able to predict the outcome of labour (in terms of fetal acidemia). The objective of this study was to find the best feature subset using GA based on three different classifiers. To our knowledge, this is the first time that a feature selection method has been used on such a large FHR dataset. Clear and intuitive clinical interpretation is needed to assist clinicians in decision making, thus GA was chosen for its ability to explore the whole feature space and to give the best feature subset. In addition, linear regression, linear SVM and RBF SVM were chosen as the classifiers for the GA. GA, as a fitness-based global optimization tool, is prone to overfitting, especially when used with a distance-based classifier such as RBF. To limit this problem, regularization and backward selection were used. These improved the effectiveness and robustness of the GA. Reasonable classification performance was shown on the balanced 510 dataset, with kappa values of 0.45 to 0.49 on the testing set using different classifiers. The classification performances were comparable, or slightly better (not statistically significant) than traditional feature selection methods such as RF and LASSO. In addition, unlike RF, which uses an integrated classifier of all features, GA gives a specific feature subset and the most important features. This is valuable for intuitive clinical interpretation. The classification performance was better than our previous study using an ANN on a similar dataset (Georgieva et al 2013b), where kappa = 0.28 on the testing set. It should be noted that in the ANN study a different training-testing dataset was used and no direct comparison is possible. Each classifier selected 7 to 9 features using the GA; the features selected by all three classifiers are: feature no. 10 (short term variability) (Dawes et al 1992, Cazares 2002), feature no. 48 (auto mutual information), feature no. 51 (STD of local sample entropy) 1368

L Xu et al

Physiol. Meas. 35 (2014) 1357

(Fulcher et al 2012) and feature no. 61 (phase rectified signal averaging—DC component) (Georgieva et al 2014). Therefore, these features appear to be useful in predicting labour outcome when used in multivariate analysis. This information will be very valuable for reference in future studies. It is noted that, in previous work, several of the features were extracted and tested independently on different subsets, thus it is difficult to compare the performance of the feature subset to the performance of single features. The top-ranked feature for all three classifiers was feature no. 61 (PRSA-DC component). PRSA is a relatively new time series method that measures heart rate variability, and also quantifies separately the acceleration capacity (AC) and DC (Bauer et al 2006). In a previous study using low arterial pH as an adverse outcome, it was found that the DC component compared well with or outperformed short term variability (Georgieva et al 2014). Feature no. 10 (short term variability) feature no. 48 (auto mutual information) and feature no. 51 (STD of local sample entropy) also measure different aspects of variability. This suggests that the FHR variability, as suggested clinically, has an important relationship with low arterial pH. It should be noted that this study considers only the last 30 min of labour. To provide a better performance, more information during the process of labour, especially time-series information, should be integrated into the classifier. There are also a number of clinical parameters that need to be investigated, such as gestation, oxytocin augmentation, maternal infection, etc. Further studies will be carried out to estimate the risk of compromise, based on the classifier prediction and its patient-specific time-series trend. The next step is to apply the classifiers throughout the duration of labour, which will provide an online objective measurement of fetal health condition during different stages of labour. 6. Conclusions Genetic algorithms, as a feature selection method, were used to find the best feature subset out of 64 FHR features. To our knowledge, this is the first time that a feature selection method for FHR diagnostic analysis has been tested on a database of this size. Reasonable classification performance was shown on the balanced 510 dataset, with kappa values 0.45 to 0.49 on the testing set using different classifiers. Regularization methods and backward selection were used to determine the best feature subset for each classifier. Based on these results, we conclude that GA can be successfully applied to FHR features to integrate and optimize their predictive power. Further analysis and clinical interpretation of the classifiers will be the subject of future work. Acknowledgment The authors would like to acknowledge the use of the Oxford Supercomputing Centre (OSC) in carrying out this work. L Xu is funded by the Clarendon Fund. A Georgieva is funded by the Henry Smith Charity and Action Medical Research. We thank Ms M Moulden for her continuous work on the database acquisition and data maintenance over the years. References ACOG 2009 ACOG practice bulletin no. 106: intrapartum fetal heart rate monitoring: nomenclature, interpretation, and general management principles Obstet. Gynecol. 114 192–202 Akaike H 1974 A new look at the statistical model identification IEEE Trans. Autom. Control 19 716–23 1369

L Xu et al

Physiol. Meas. 35 (2014) 1357

Alberry M, Fuente S and Soothill P 2009 Prediction of asphyxia with fetal gas analysis Fetal and Neonatal Neurology and Neurosurgery 4th edn ed M I Levene and F A Chervenak (London: Churchill Livingstone) Bauer A, Kantelhardt J W, Barthel P, Schneider R, M Kikallio T, Ulm K, Hnatkova K, Sch¨omig A, Huikuri H and Bunde A 2006 Deceleration capacity of heart rate as a predictor of mortality after myocardial infarction: cohort study Lancet 367 1674–81 Bishop C 1996 Neural Networks for Pattern Recognition (New York: Oxford University Press) Bishop C 2006 Pattern Recognition and Machine Learning (Berlin: Springer) Breiman L 2001 Random forests Mach. Learn. 45 5–32 Burges C J C 1998 A tutorial on support vector machines for pattern recognition Data Min. Knowl. Discovery 2 121–67 Cazares S M 2002 Automated identification of abnormal patterns in the intrapartum cardiotocogram PhD Thesis University of Oxford Chang C-C and Lin C-J 2011 LIBSVM: a library for support vector machines ACM Trans. Intell. Syst. Technol. 2 1–27 Chauhan S P, Klauser C K, Woodring T C, Sanderson M, Magann E F and Morrison J C 2008 Intrapartum nonreassuring fetal heart rate tracing and prediction of adverse outcomes: interobserver variability Am. J. Obstet. Gynecol. 199 623.e1–5 Chud´acˇ ek V A 2011 Assessment of features for automatic CTG analysis based on expert annotation EMBS’11: 33rd Annu. Int. Conf. IEEE Engineering in Medicine and Biology Society (Boston, MA) p 1347 Cohen J 1960 A coefficient of agreement for nominal scales Educ. Psychol. Meas. 20 37 Cristianini N and Shawe-Taylor J 2000 An Introduction to Support Vector Machines and Other KernelBased Learning Methods (Cambridge: Cambridge University Press) Czaba´nski R, Je˙zewski J, Horoba K and Je˙zewski M 2013 Fetal state assessment using fuzzy analysis of fetal heart rate signals—agreement with the neonatal outcome Biocybern. Biomed. Eng. 33 145–55 Dawes G, Moulden M, Sheil O and Redman C 1992 Approximate entropy, a statistic of regularity, applied to fetal heart rate data before and during labor Obstet. Gynecol. 80 763 Fulcher B, Georgieva A, Redman C and Jones N 2012 Highly comparative fetal heart rate analysis EMBC’12: Annu. Int. Conf. IEEE Engineering in Medicine and Biology Society pp 3135–8 Georgieva A, Moulden M and Redman C W G 2013a Umbilical cord gases in relation to the neonatal condition: the EveREst plot Eur. J. Obstet. Gynecol. Reprod. Biol. 168 155–60 Georgieva A, Papageorghiou A T, Payne S J, Moulden M and Redman C W G 2014 Phase rectified signal averaging for intrapartum electronic fetal heart rate monitoring is related to acidaemia at birth Br. J. Obstet. Gynaecol. 121 889–94 Georgieva A, Payne S, Moulden M and Redman C 2011a Computerized fetal heart rate analysis in labor: detection of intervals with un-assignable baseline Physiol. Meas. 32 1549 Georgieva A, Payne S, Moulden M and Redman C 2013b Artificial neural networks applied to fetal monitoring in labour Neural Comput. Appl. 22 85–93 Georgieva A, Payne S and Redman C W G 2009 Computerised electronic foetal heart rate monitoring in labour: automated contraction identification Med. Biol. Eng. Comput. 47 1315–20 Georgieva A, Payne S J, Moulden M and Redman C W G 2011b Computerized intrapartum electronic fetal monitoring: analysis of the decision to deliver for fetal distress EMBC’11: Annu. Int. Conf. IEEE Engineering in Medicine and Biology Society (30 Aug. 2011–3 Sept. 2011) pp 5888–91 Georgoulas G, Gavrilis D, Tsoulos I G, Stylios C, Bernardes J and Groumpos P P 2007 Novel approach for fetal heart rate classification introducing grammatical evolution Biomed. Signal Process. Control 2 69–79 Goldberg D 1989 Genetic Algorithms in Search, Optimization, and Machine Learning (Reading, MA: Addison-Wesley) Guyon I and Elisseeff A 2003 An introduction to variable and feature selection J. Mach. Learn. Res. 3 1157–82 Mitchell M 2001 An Introduction to Genetic Algorithms (Cambridge, MA: MIT Press) Parer J T, King T, Flanders S, Fox M and Kilpatrick S J 2006 Fetal acidemia and electronic fetal heart rate patterns: is there evidence of an association? J. Matern. Fetal Neonatal Med. 19 289–94 Schwarz G 1978 Estimating the dimension of a model Ann. Stat. 6 461–4 Tibshirani R 1996 Regression shrinkage and selection via the lasso J. R. Stat. Soc. B 58 267–88

1370

L Xu et al

Physiol. Meas. 35 (2014) 1357

Van Niel T G, Mcvicar T R and Datt B 2005 On the relationship between training sample size and data dimensionality: Monte Carlo analysis of broadband multi-temporal classification Remote Sens. Environ. 98 468–480 Vapnik V N 1999 An overview of statistical learning theory IEEE Trans. Neural Netw. 10 988–99 Westgate J 2009 Computerizing the cardiotocogram (CTG) Medical Informatics in Obstetrics and Gynecology ed D Parry and E Parry (Hershey, PA: IGI Global Snippet) Xu L, Georgieva A, Payne S J, Moulden M and Redman C W G 2012 Detection and analysis of pattern readjustment in fetal heart rate signal MEDSIP’12 (Liverpool, UK) Xu L, Georgieva A, Payne S J, Moulden M and Redman C W G 2013 Feature selection for computerized fetal heart rate analysis using genetic algorithms IEEE EMBS (Osaka, Japan) pp 445–8

1371

Feature selection using genetic algorithms for fetal heart rate analysis.

The fetal heart rate (FHR) is monitored on a paper strip (cardiotocogram) during labour to assess fetal health. If necessary, clinicians can intervene...
1MB Sizes 2 Downloads 3 Views