Clinical Neurophysiology 126 (2015) 1692–1702

Contents lists available at ScienceDirect

Clinical Neurophysiology journal homepage: www.elsevier.com/locate/clinph

Automated analysis of multi-channel EEG in preterm infants Keelin Murphy ⇑, Nathan J. Stevenson, Robert M. Goulding, Rhodri O. Lloyd, Irina Korotchikova, Vicki Livingstone, Geraldine B. Boylan Neonatal Brain Research Group, University College Cork, Cork, Ireland Department of Paediatrics and Child Health, University College Cork, Cork, Ireland

a r t i c l e

i n f o

Article history: Accepted 29 November 2014 Available online 9 December 2014 Keywords: EEG Electroencephalography Neonatal Preterm Premature Burst Interburst Detection Automatic

h i g h l i g h t s  Automatic detection of burst and interburst periods in multi-channel preterm EEG.  EEG marked by 3 independent observers, detailed interobserver analysis.  Mathematical features of the EEG are calculated and correlated with gestational age.

a b s t r a c t Objective: To develop and validate two automatic methods for the detection of burst and interburst periods in preterm eight-channel electroencephalographs (EEG). To perform a detailed analysis of interobserver agreement on burst and interburst periods and use this as a benchmark for the performance of the automatic methods. To examine mathematical features of the EEG signal and their potential correlation with gestational age. Methods: Multi-channel EEG from 36 infants, born at less than 30 weeks gestation was utilised, with a 10 min artifact-free epoch selected for each subject. Three independent expert observers annotated all EEG activity bursts in the dataset. Two automatic algorithms for burst/interburst detection were applied to the EEG data and their performances were analysed and compared with interobserver agreement. A total of 12 mathematical features of the EEG signal were calculated and correlated with gestational age. Results: The mean interobserver agreement was found to be 77% while mean algorithm/observer agreement was 81%. Six of the mathematical features calculated (spectral entropy, Higuchi fractal dimension, spectral edge frequency, variance, extrema median and Hilberts transform amplitude) were found to have significant correlation with gestational age. Conclusions: Automatic detection of burst/interburst periods has been performed in multi-channel EEG of 36 preterm infants. The algorithm agreement with expert observers is found to be on a par with interobserver agreement. Mathematical features of EEG have been calculated which show significant correlation with gestational age. Significance: Automatic analysis of preterm multi-channel EEG is possible. The methods described here have the potential to be incorporated into a fully automatic system to quantitatively assess brain maturity from preterm EEG. Ó 2015 International Federation of Clinical Neurophysiology. Published by Elsevier Ireland Ltd. All rights reserved.

1. Introduction The survival rate of very preterm infants is increasing (Costeloe et al., 2012) and it is recognised that EEG is an extremely useful and non-invasive tool for monitoring neurological well being and for the prediction of neurological outcome (LeBihannic et al., 2012; ⇑ Corresponding author. Tel.: +353 21 4205940. E-mail address: [email protected] (K. Murphy).

Biagioni et al., 1994, 1996; Holmes and Lombroso, 1993; Marret et al., 1997; Hayashi-Kurahashi et al., 2012). With recent advances in technology it is now much more feasible to use preterm EEG monitoring for longer periods of time (Schumacher et al., 2011). The very preterm EEG typically exhibits a discontinuous pattern of high voltage bursts of activity interspersed with low voltage periods known as interburst intervals (IBIs). As the infant matures, IBIs become gradually shorter, bursts become longer and by the time the infant reaches full term the EEG has a continuous pattern

http://dx.doi.org/10.1016/j.clinph.2014.11.024 1388-2457/Ó 2015 International Federation of Clinical Neurophysiology. Published by Elsevier Ireland Ltd. All rights reserved.

1693

K. Murphy et al. / Clinical Neurophysiology 126 (2015) 1692–1702

(Vecchierini et al., 2007). The appropriate length of IBIs at particular gestational ages varies widely according to different reports (Hahn et al., 1989; Biagioni et al., 1994; Selton et al., 2000; Hayakawa et al., 2001; Vecchierini et al., 2003; Vecchierini et al., 2007), however it has been shown that long IBIs (>1 min) are abnormal at any gestational age (Vecchierini et al., 2007). Neurophysiologists generally assess brain maturity in the preterm EEG by measuring the duration of burst and interburst intervals and assessing age specific patterns. Since the opinion of the neurophysiologist is subjective and only uses selected representative periods of EEG for analysis, this task results in an output which is not reproducible or quantitative. For this reason, some recent studies have investigated automated methods of detecting bursts and IBIs in preterm EEG (Palmu et al., 2010; Jennekens et al., 2011; Koolen et al., 2014). In this way, measurements can be made automatically and virtually instantaneously. Previous methods of automated burst detection have been developed but have been limited to either single channel recordings or to very small numbers of subjects. Palmu et al. (2010) use data from 18 preterm infants with single channel EEG. Jennekens et al. (2011) use a database of just four infants, and Koolen et al. (2014) just 10, including seven with confounding conditions such as suspected seizures, mild asphyxia or intracranial haemorrhage. Although these studies provided promising results, larger databases are required to determine whether they are reproducible. Furthermore, these studies lack detailed comparisons with multiple observers and interobserver analysis. The aim of this study was to develop and validate two automatic methods for the detection of bursting patterns in eight-channel preterm EEG. Furthermore, we perform a detailed analysis of inter-observer agreement between three expert observers and use this as a benchmark for the performance of the automatic methods. Finally, we examine mathematical features of the EEG signal and their potential correlation with gestational age.

F4–C4, C4–O2, F3–C3, C3–O1, T4–C4, C4–Cz, Cz–C3, C3–T3. Sampling rate was 256 Hz for 32 of the infants and 1024 Hz for the remaining 4. A 10 min epoch was selected from each recording for analysis in this study. The criteria for selecting the epoch were as follows: –the EEG section was free from artifact, –a clear example of a burst/interburst pattern was visible. The resulting selected EEG epochs were recorded at a median age of 8 h and 15 min (minimum 3 h 5 min, maximum 41 h 19 min). 2.2. Burst detection Two methods to automatically detect burst activity on multi-channel EEG were developed and compared. The aim of automatic burst detection is to provide a fast, objective and quantitative analysis of the EEG recording. The methods developed were based on the single-channel detection method of Palmu et al. (2010) and extended to handle multi-channel EEG recordings. A brief summary of the methods is given here, however, the technical details are described in Appendix A Both methods begin by processing each channel independently, using the non-linear energy operator (NLEO). This is a filter sensitive to both frequency and amplitude which provides a first estimate of burst and interburst regions. Thereafter, two different methods were tested to combine the individual channel detections into a final result. The first, the ‘fixed-threshold’ method applies a fixed threshold to the NLEO filtered signal on each channel. Each time point is marked as burst activity if two or more channels are above the threshold at that time. The second method, the ‘variable threshold’ method calculates two different thresholds (upper and lower) for each channel, dependent on the amplitude values in that channel. Each time point is then marked as burst activity if two channels are above the upper threshold, or if four channels are above the lower. For comparison of expert observers with automatic algorithms we use only the better of the two automatic methods. As will be described in detail in the results Section (3.1), we are unable to demonstrate a significant difference between the two methods on our dataset, and choose the ‘fixed-threshold’ method as it shows marginally better results.

2. Methods 2.1. Material The data used consists of eight-channel EEG recordings on 36 preterm infants born at less than 30 weeks gestation. Fig. 1 shows the distribution of the subjects across gestational ages. EEG recording was commenced at a median age of 6 h and 15 min (minimum 3 h, maximum 23 h 11 min). The recordings lasted between 7 h 55 min and 71 h 14 min (median 46 h 57 min). All EEGs were recorded over eight channels as follows:

2.3. Interobserver analysis To validate an automatic analysis it is necessary to compare it with an expert opinion. In most applications, including burst

15 36 35 28 26 25

10

# Subjects

20 19 31

17

27

16

23

14

21

11

32

10

9

30 22

5

13

15

33

4

8

6

29

3

7

18

5

24

2

1

12

34

0 23

24

25

26

27

28

29

# Gestational Age (full weeks only)

Fig. 1. The distribution of subjects across gestational ages. Gestational ages are listed by number of full weeks gestation. The numbers shown within the bars are baby IDs.

1694

K. Murphy et al. / Clinical Neurophysiology 126 (2015) 1692–1702

detection, experts are known to disagree to some extent (Palmu et al., 2010). Therefore our aim is that the algorithm should agree with the observers at least as well as they agree with each other. It is important thus, that the interobserver disagreements are thoroughly analysed and compared in detail with the algorithm/ observer disagreeements. For clarity we consider only one automatic algorithm from this point onwards, the fixed threshold algorithm, which, as shown in the Results section has a slightly better performance overall. Throughout this section and the remainder of the paper, the term ‘‘sample’’ or ‘‘data sample’’ is used to refer to the value of the signal at a single time-point. For example if the signal is recorded at 256 Hz then there are 256 samples (signal values) captured every second. Three expert observers, experienced in neonatal EEG, examined the 10 min epochs independently of each other and marked the burst and interburst periods using NicoletOne EEG software with their preferred viewer settings. The observers were blinded to the clinical details of the infants. As in Palmu et al. (2010) the observers were not restricted to specific amplitude or frequency criteria but were free to use their own experience in determining what constituted a burst or an IBI. The level of agreement between observers, and between each observer and the algorithm was calculated for each subject in a number of different ways. Observer opinions are always considered individually and we do not attempt to combine their markings into a single ‘truth’. The combination of observers and algorithm means that each measure is calculated six times, resulting in a very detailed comparison. This detailed analysis allows us to determine whether the performance of the algorithm is considerably worse or similar compared to that of an additional expert observer. Agreement is considered in terms of sensitivity (percentage agreement over burst periods) and specificity (percentage agreement over IBIs.) We also measure overall agreement on the entire signal (percentage overlap, based on a per-data sample analysis). Details of how the overlap is calculated are provided below. The results are presented and described in Section (3.1). –‘Percentage Overlap Unanimous’ refers to the percentage of agreement of the algorithm with the observer annotations, only considering regions in which all three observers had the same opinion. This is calculated on a per sample basis. While this measure is a useful benchmark, testing only on regions where the reference standard (observer annotations) is most trustworthy, we do not believe it is sufficient on its own as a measure of algorithm performance. Firstly, the regions where all observers are unanimous are likely to be the easiest regions to categorise and therefore we are not testing our algorithm on the more challenging locations. Secondly, it is necessary to acknowledge that

expert observers always differ in their opinions and that the ultimate standard we hope for from our algorithm is that it differs from experts only to the same degree that they normally differ among themselves. A thorough analysis of interobserver differences is required to determine whether this is the case. Finally, reducing the data to only those regions where observers were unanimous involves a considerable loss of information. Over our 36 subjects a median of 34% of data was lost when considering only unanimous regions (min. 17%, max. 61%). –‘Percentage Overlap’ refers to the percentage of agreement between one set of annotations and another. This is measured on a per-sample basis. For each recorded sample, to compare the algorithm with observer 1, the annotation (burst or IBI) at every sample time point, t is obtained from both observer 1 annotations (obs1t ) and the algorithm output (alg t ). The percentage overlap is the percentage of samples for which obs1t and alg t are the same. Although the Percentage Overlap with individual observers is likely to be lower than the Percentage Overlap Unanimous this measure provides us with a realistic assessment of performance over all data regions. To determine whether the overlap values are acceptable in practice they will be compared to overlap values between the expert observers.

2.4. Feature analysis There is enormous potential for a computer-aided analysis of a signal to be able to make calculations and observations which are not possible by visual interpretation alone. At present, the most commonly considered feature of preterm EEG is the length of the interburst interval, with longer intervals considered as representing a more immature state of brain development or heralding an abnormality (Vecchierini et al., 2007). Our aim was to investigate whether a more complex analysis of the signal would yield new features which are correlated with gestational age, even on analysis of just a 10 min epoch. The features calculated are listed in Table 1. Each of these features is calculated firstly over the full 10 min recording, next over the periods marked as ‘burst’ only, and finally over the periods marked as interburst intervals only. The exception is the feature ‘IBI’ which is the interburst interval length and clearly can only be calculated on the interburst periods. In each case the feature value is averaged over all channels – again with the exception of ‘IBI’ since interburst periods are not marked per channel. The correlation (Spearman) for each feature with gestational age (in days) is calculated over the whole group of subjects. Correlation was also investigated separately for infants above and below 27 weeks gestation, however since this did not yield any significant information the results have been omitted for brevity.

Table 1 The list of features calculated. All features (except ‘IBI’) are calculated per channel and averaged over all channels. Additionally, all features except ‘IBI’ are calculated over the entire signal, over just the burst periods, and over just the interburst periods. The last column provides the abbreviations which are used throughout the remainder of this work. Name

Description

Abbrev.

Spectral Entropy Spectral Edge Frequency Relative Delta Power Moment 0 Moment 1 Moment 2 Moment 3 Higuchi Fractal Dimension Extrema Median Amplitude (Hilbert Transform) Symmetry

A measurement of the ‘disorder’ and unpredictability of the signal. Calculated as per Viertiö-Oja et al. (2004) The frequency below which 90% of the total power of the signal is located. The power in the Delta band ([0.5–4.0 Hz]) divided by the total power in the spectrum Mean value of the signal Variance of the signal Skewness – the extent to which the probability distribution leans to one side of the mean Kurtosis – a measure of the ‘peakedness’ of the signal A measure of how the signal pattern changes upon rescaling. Calculated as per (Higuchi, 1988). The median of the absolute values of the extrema peaks/troughs in the signal A measure of the signal amplitude, calculated via the Hilbert Transform The symmetry between right and left hemispheres. Calculated as per (van Putten, 2007). Symmetry is calculated on 5 s epochs and averaged over the full recording The length of the interburst intervals

SE SEF RDP M0 M1 M2 M3 HFD EM AHT SYM

Interburst interval length

IBI

1695

K. Murphy et al. / Clinical Neurophysiology 126 (2015) 1692–1702

3. Results 3.1. Burst detection In this section the results of the burst detection algorithm are shown and a comparison between the ‘fixed threshold’ method and the ‘variable threshold’ method is made. Fig. 2 shows the performance of the two automatic algorithms based on overlap in regions where observers were unanimous. The first group of boxplots, shown in green, give results for the ‘fixed threshold’ method, using various fixed thresholds, and the second group (in blue) show the results for the ‘variable threshold’ method using various percentile value thresholds (see appendix A

for details of these methods). The optimal performance using a fixed threshold was obtained by setting the threshold to 1.0 while the variable threshold method performed best with the two parameter percentile values set at 70% and 60%. Performance appears to vary smoothly with the threshold in the fixed method while the relationship is not so clear in the variable method. A paired t-test was performed to compare the data from the optimal fixed threshold performance (boxplot ‘fixed1.0’) and the optimal variable threshold performance (boxplot ‘var70/60’). This showed no significant difference between the methods (p = 0.34, mean difference = 0.56, confidence interval [1.3 to 2.4]). A further breakdown of these performances is provided in Fig. 3. The values in this figure refer to the ‘improvement’ in overlap

%overlap in regions where observers were unanimous

100

95

90

85

80

75

70

65

60 fixed0.5

fixed1.0

fixed1.5

fixed2.0

fixed2.5

var60/40 var60/50 var70/40 var70/50 var70/60 var80/40 var80/50 var80/60

Fig. 2. A comparison of the performance with fixed thresholds (first five boxes) compared with variable thresholds (last eight boxes). The box labels give the values of the fixed threshold used (thfixed ) or the upper and lower percentiles perc ui and perc li . Each boxplot represents the percentage overlap of the algorithm output with the human observer output, using only regions where all three observers were unanimous.

20 15 10

increase in %overlap

5 0 −5 −10 −15 −20 −25 −30

Obs1 Obs2 Obs3 Unanimous 13 6

5 15 29 33 24 21 3

4 31 10 27 23 2 34 14 26 28 16 19 20 7 25 35 36 1 baby ID

8

9 11 17 12 30 18 32 22

Fig. 3. The improvement in percentage overlap caused by using a variable threshold (70% and 60% quantiles) instead of a fixed threshold (1.0). Results are shown for the annotations of each observer individually and also for the annotations considering only regions where the three observers were unanimous. The baby IDs on the x-axis are ordered by gestational age (in days), youngest to oldest. The heavy vertical line indicates the division between babies 627 full weeks and babies >27 full weeks.

1696

K. Murphy et al. / Clinical Neurophysiology 126 (2015) 1692–1702

Fig. 4. An example of results from baby ID 28 (28 w + 2 d). Observer and algorithm opinions are shown at the top of the figure – peaks indicate burst, troughs indicate IBI. In this example we see that the variable threshold method performs generally better than the fixed threshold method. The exception is in the comparison with observer 3, where the fixed threshold method has better agreement. This is indicative of the overall trend for this subject, as shown in Fig. 3.

90 85 80

% overlap

75 70 65 60 55 50 45 40

Obs1/Obs2 Obs3/Obs1 Obs2/Obs3 Obs1/Alg Obs2/Alg Obs3/Alg 13 6 5 15 29 33 24 21 3 4 31 10 27 23 2 34 14 26 28 16 19 20 7 25 35 36 1 8 9 11 17 12 30 18 32 22

Baby ID Fig. 5. Interobserver and observer-algorithm agreement (as % overlap) shown for each baby and each pair-wise comparison. The baby IDs on the x-axis are ordered by gestational age (in days), youngest to oldest. The heavy vertical line indicates the division between babies 627 full weeks and babies >27 full weeks.

values obtained by using the variable threshold method in place of the fixed threshold method. Results are shown compared to each observer individually as well as for the combination of observers using only regions where they were unanimous. It can be seen that for the majority of subjects there is improvement compared with some observers and disimprovement compared with others. Cases 24, 25 and 26 show improvement for all comparisons while case 34 shows only disimprovements. Overall the performance of the two methods is very similar and on this dataset we cannot demonstrate a significant difference between them. Therefore for the purposes of the investigations into interobserver and algorithm/observer agreement (Section 3.2) we consider only the fixed threshold algorithm (with thfixed ¼ 1:0) which showed the highest overall performance in Fig. 2.

Fig. 4 shows an example of observer and algorithm findings on a section of EEG data from a subject at 28 weeks. The results are described in the figure caption. 3.2. Interobserver analysis The interobserver agreement rates (%overlap overall) are shown per subject in Fig. 5. Subjects are arranged left to right in order of gestational age. The dots in shades of red indicate overlap between observers while the triangles in shades of grey indicate overlap between the algorithm and an observer. There is no evidence that there is better agreement between individual observers than between the algorithm and the observers. In fact many of the poorest agreements are between observers, while many of the best

1697

K. Murphy et al. / Clinical Neurophysiology 126 (2015) 1692–1702

levels of agreement are found to be between algorithm and observer. For each baby the average interobserver agreement was compared with the average observer/algorithm agreement using the Wilcoxon signed rank test. The median (interquartile range) for interobserver agreements was 71.1 (73.5–83.1) while for algorithm/observer agreements it was 82.1 (77.0–84.3). The difference was statistically significant (p = 0.0006) showing that in fact for this data the algorithm/observer agreement is better than the interobserver agreement. As an additional statistical measure of agreement, Kappa values were also calculated per-baby for each paired comparison (comparing on a per-sample basis). The minimum, maximum and median over all babies is shown in Table 2. Although the agreement levels in general are only moderate (median Kappa from 0.53 to 0.66), (Landis and Koch, 1977) these figures reinforce the fact that the algorithm agrees better with the observers than they do with each other on this data. Agreement appears to be better for most babies below 27 weeks gestational age although this does not hold true for the two youngest babies in the study. The Wilcoxon rank sum test was performed for each interobserver and algorithm/observer comparison to determine whether the agreement figures in the two age groups are significantly different from each other. The difference between the younger and older age-groups was significant for observer 2/ algorithm (p = 0.0036) and observer 3/algorithm (p = 0.0466) only. In Fig. 6 we plot the agreement in terms of sensitivity (agreement over periods classified as burst) and specificity (agreement over periods classified as IBI). For the algorithm/observer comparisons (grey triangles), this enables us to determine whether there is an obvious tendency for the algorithm to over-segment – i.e. to find bursts where none exist, or vice versa. In fact neither of these

Table 2 Kappa values for each interobserver and algorithm/observer comparison. Kappa was calculated per-baby, this table shows minimum, maximum and median over all babies. The interobserver agreement values are seen to be no better than those for the algorithm/observer agreement. Kappa

Min

Max

Median

obs1/obs2 obs1/obs3 obs2/obs3 obs1/alg obs2/alg obs3/alg

0.1859 0.1066 0.1819 0.2534 0.3444 0.2863

0.7611 0.8111 0.8436 0.7901 0.814 0.7765

0.5275 0.63995 0.5422 0.65405 0.6604 0.58555

is the case and we see a roughly equal spread of over-segmentation (bottom right of graph) and under-segmentation (top left of graph). Furthermore, this figure also illustrates that over-segmentation and under-segmentation are no more common for the algorithm than for an expert observer (when compared to another) in that the spread of red circles (interobserver agreements) is similar to that of grey triangles (algorithm/observer agreements) throughout the figure. In the top right region of the graph, where agreement is highest, we again see an equal distribution of red and grey with some of the highest rates of agreements being found between algorithm and observer. These results reinforce our findings that the algorithm performance is indistinguishable from that of an additional observer. 3.3. Feature analysis Table 3 shows the relationship between gestational age of the subject and each of the features calculated (see Table 1 for details about the features). Analysis of features on groups divided by gestational age (less than 27 weeks and 27 weeks or older) did not show any results of significance and has therefore been omitted for brevity. For the overall 10 min recording only one feature is shown to have a significant correlation with gestational age – AHT which is a feature of amplitude. For the burst periods we see that SE and HFD, both features representing the signal complexity are significantly correlated with age for at least four of the five algorithm and observer opinions. Finally in the interburst periods we observe that features SEF,M1, EM and AHT, again features of complexity and amplitude, are correlated with gestational age for all observers and algorithms. In addition we note that features SE, and M3 show correlation for four of the five observer and algorithm opinions. It is particularly interesting that the features in the interburst periods are more often correlated with gestational age than in the burst periods, since conventional preterm EEG analysis frequently considers only the length of the interburst periods. A final point to note regarding Table 3 is that the feature IBI, the average length of the interburst interval periods does not show any significant correlation with gestational age in most cases (with the exception of the fixed threshold algorithm results). This may seem surprising since it is the feature most often used by experts in visual interpretation of the preterm EEG signal to determine developmental maturity. However there are a number of reasons why it

100 95 90

Specificity (% IBI agreement)

85 80 75 70 65 60 55 50 45 40 35 30 25 20

Obs1/Obs2 Obs3/Obs1 Obs2/Obs3 Obs1/Alg Obs2/Alg Obs3/Alg 35

40

45

50

55

60

65

70

75

80

85

90

95

100

Sensitivity (% burst agreement) Fig. 6. Agreement shown in terms of sensitivity (agreements in burst periods) and specificity (agreement in interburst periods). In each item in the legend the observer that is listed first is considered as the ‘truth’ for the purposes of comparison.

1698

K. Murphy et al. / Clinical Neurophysiology 126 (2015) 1692–1702

Table 3 Correlation values (Spearman) between gestational age and calculated features. The first section of the table shows results for features calculated on the overall 10 min epoch, while the subsequent sections show results for burst periods and interburst periods respectively. Since burst and interburst are defined differently by each observer (and by the algorithms) results are shown for each observer and algorithm individually within the burst and interburst sections. Bold font indicates significant correlation (p < 0.05), the 0 character indicates p < 0.001. SE

SEF

RDP

M0

M1

M2

M3

HFD

EM

AHT

SYM

Overall 0.208

0.193

0.198

0.044

0.183

0.137

0.348

0.344

0.502

0:5350

0.382

0.265

0.235

0.053

0.029

0.272

0.261

0:6440

0.067

0.161

0.13

0

Burst Obs1 0:6880 Obs2 0.425 Obs3 0.453 Alg-fixed 0:5570 Alg-variable 0.295 Interburst Obs1 0.325 Obs2 0.51 Obs3 0.507 Alg-fixed 0.443 Alg-variable 0:5350

0.325

0.303

0.046

0.065

0.186

0.007

0:618

0.067

0.042

0.035

0.183

0.212

0.243

0.055

0.198

0.069

0.441

0.124

0.118

0.227

0

IBI

0.107

0.152

0.069

0.144

0.199

0.239

0:629

0.259

0.156

0.132

0.236

0.187

0.274

0.156

0.187

0.185

0.363

0.267

0.266

0.224

0.075

0.173

0.48

0.341

0.376

0.193

0.483

0.493

0.341

0.202

0:547

0.09

0.106

0.446

0.197

0.466

0.303

0.523

0.505

0.394

0.269

0:5360

0.109

0.271

0.399

0.234

0.181

0.215

0.419

0:550 0

0

0.478

0.24

0.171

0

0

0:566

0.321

0.419

0:5880

0.171

0.249

0.432

0.047

0.105

0:586

0.33

0.518

0.298

0:535

0:60

0.057

0.196

0:5590

0.394

0.413

0.339

0:5560

is unlikely to show correlation in our data, which will be presented in the Discussion section. 4. Discussion 4.1. Burst detection In this study, a method for burst detection in multi-channel EEG of 36 preterm infants at less than 30 weeks gestation has been presented. The NLEO operator was utilised to measure activity in the EEG signal and two different methods of thresholding the NLEO output were implemented. In our dataset of 36 preterms we were unable to detect a significant difference between the output of the two methods (fixed threshold or variable threshold on NLEO). A paired t-test on the best fixed-threshold results versus the best variable-threshold results showed no significant difference (p=0.34). As shown in Fig. 3 the variable threshold improves performance for some observers and some subjects, but disimproves it for others. There are only three subjects (baby IDs 34, 26 and 25) for which one method outperformed the other in all comparisons. In fact an improvement is seen in 55% (79/144) of the comparisons shown in this figure overall, and for the individual comparisons with observers 1–3 and the unanimous markings there is improvement in 53%, 50%, 67% and 50% of cases respectively. Grouping the subjects by age (527 weeks and >27 weeks) we find that for the younger preterms there is improvement in 56% (36/64) of cases, while for the older group the corresponding figure is 54% (43/80). These figures illustrate that there is no clear pattern, either per observer or per age-group to determine which method will perform best. This confirms that there is rich variability in this dataset, even within narrow age ranges, in the types of signals encountered and the way in which these are annotated by the observers. Fig. 3 already begins to suggest that there is disagreement between the observers, and between algorithm and observers and that there is no single correct annotation of burst and interburst periods. This is analysed further in the interobserver and algorithm/observer comparisons.

4.2. Interobserver and algorithm/observer agreement Having determined that the two tested algorithms perform in different ways but with similar performance, we proceed to determine whether the algorithm performance is acceptable. The benchmark for this analysis is the interobserver agreement between our three expert observers with the aim being that the algorithm should agree with the observers at least as well as they agree with each other. For clarity of presentation only the algorithm which had the best performance by a very small margin (fixed threshold thfixed ¼ 1:0, see Fig. 2) is considered. Fig. 5 shows the agreement for each subject, for each pair of observers and for the algorithm compared with each observer. The first notable observation is that both the red and grey markers are well distributed, even when considered on a per-subject basis. Statistical analysis using the Wilcoxon signed rank test shows that the algorithm/observer agreements are in fact significantly better than the interobserver agreements (p = 0.0006) on this data. Previous studies have provided a more limited analysis of algorithm performance. Jennekens et al. (2011) asked observers to examine the output of the algorithm and indicate whether the event indicated took place during the segment marked by the algorithm. The accuracy of the onset/offset of the event marked was not considered. Only sensitivity values are reported, without specificity, and while these are high at around 80–95% the limitations of their analysis, and the fact that only four infants were included must be considered. Palmu et al. (2010) use three independent observers and consider only regions where they are unanimous when analysing the algorithm performance. As described previously, this limits the analysis to the regions of EEG recordings which are easiest to analyse, and does not provide a realistic estimate of how it will perform on a full recording. In addition, they dealt with single channel EEG which is by its nature easier to analyse. An average detection rate (average of sensitivity and specificity) of 96.4% is reported, which is comparable with the results from our algorithms when compared with periods of observer unanimity only (see Fig. 2). In Koolen et al. (2014) a comparison

K. Murphy et al. / Clinical Neurophysiology 126 (2015) 1692–1702

with interobserver agreement is made, however only two observers and sub-groups of either five or eight infants are included. The mean accuracy for interobserver agreement is found to be 86.2% and mean algorithm accuracy is reported at 84.27%. It should be noted that a limitation of our work is that the algorithm parameters were optimised for this dataset and it may be expected that performance would decrease on new unseen data. For an unbiased performance the parameters should be optimised on a large independent dataset in advance of testing. Cases 24 and 25 are the only subjects where there is a clear division showing algorithm/observer agreement to be lower than interobserver agreement in all comparisons. An example of EEG from baby 26 (28 w + 1 d) is provided in Fig. 7. The signal here is relatively suppressed and both algorithms omit the burst marked by all observers at 445 s. The algorithms also tend to shorten the burst durations, compared to the observer annotations, and the fixed threshold method inserts interburst periods (458 s, 487 s) within regions which are marked as continuing bursts by all observers and the variable threshold algorithm. It is expected that burst/interburst patterns may be clearer in younger preterms, therefore we illustrate the division at 27 weeks gestation on Fig. 5. As described in the Results section, however, this difference is statistically significant for only observer 2/algorithm and observer 3/algorithm. This lack of statistical significance may be related to the smaller size of the groups when divided by age in this way. In Fig. 6 the interobserver agreement and algorithm/observer agreement figures are divided into sensitivity (agreement over burst periods) and specificity (agreement over interburst periods). The general distribution of red and grey markers in this plot shows that there is no obvious tendency for the algorithm to either oversegment (detect too much burst activity) or under-segment (detect too little burst activity) in comparison with the observers. At the top-left of the graph, the two worst examples of under-segmentation (low sensitivity, high specificity) are found to be algorithm/ observer comparisons. Both of these under-segmentation examples come from baby ID 28 (28 w + 2 d) whose EEG showed long IBIs. Fig. 3 shows that once again, this is an example of a case where the variable threshold algorithm outperformed the fixed threshold algorithm in most instances (except in the comparison with observer 3). Fig. 8 shows a typical section from this EEG where the fixed algorithm under-segments the bursts, particularly in comparison with observers 1 and 2. The comparisons in Fig. 6 where specificity is very low, with a very high sensitivity

1699

(bottom-right of graph, indicating over-segmentation) are all found to be interobserver. The four worst examples of this all come from different subjects with no apparent common reason why the observers disagree on those cases. 4.3. Feature analysis Table 3 provides a detailed analysis of the correlation between the features calculated and the gestational age of the subjects. As described in Section 3.3 a number of features were shown to be significantly correlated with gestational age over the entire 10 min epoch, over the burst periods and over the interburst periods. It is extremely important to show, for the first time, that there are signal features which can be automatically calculated that have significant correlation with the gestational age of the preterm infant. Although EEG features of brain maturity are analysed in Holthausen et al. (2000) the infants included were both preterm and term and EEG was recorded at post-conceptional ages (PCA) from 28 up to 112 weeks. The features calculated do not show distinct levels of brain maturity within the group of infants who are less than 35 weeks PCA. We anticipate that the features presented here could be used to assess brain maturity in preterm infants in a fully independent and automatic manner. Furthermore these features have shown significant correlation with age even in a recording of just 10 min duration. It is important to be able to use short duration recordings as the preterm EEG is subject to sudden change, particularly if the baby has other clinical problems such as hypertension or apnoea for example. Other features such as burst-length, for example, may also be considered in future work to determine whether these correlate with brain maturity in preterm EEG. It was noted that the average length of the interburst periods (feature IBI) was not significantly correlated with age in most cases, in spite of the fact that this is the feature most commonly examined by neurophysiologists in assessing preterm EEG. In separate experiments the maximum or 90th percentile value of the IBI was also found to have no significant correlation with age. However, it must be observed that this is a 10 min recording which is exceptionally short and would not be considered sufficient for a neurophysiologist to make a reliable estimate of IBI length. In addition, the selection of the 10 min epoch used was based on it being artifact free, and showing a clear example of a

Fig. 7. EEG example from baby ID 26 (28 w + 1 d), where algorithm/observer agreement is lower than interobserver agreement in all instances. This example shows that the fixed threshold method in particular picks up less burst activity than the observers, marking regions with relatively low amplitude activity as interburst.

1700

K. Murphy et al. / Clinical Neurophysiology 126 (2015) 1692–1702

Fig. 8. EEG example from baby ID 28 (28 w + 2 d), where the fixed threshold algorithm under-segments burst activity.

burst/interburst pattern if possible. This may have lead to the selection of epochs where the interburst periods were very clear but perhaps not representative of the interburst periods in the total recording. In future work the feature analysis should be performed on longer data sections and preferably also on an independent dataset in order to verify the features which correlate well with age and to determine how significant the correlation with IBI length is in longer recordings. In addition, future work will incorporate correlation of features with long term outcomes, in the form of neurodevelopmental tests at 2 years of age. The end goal of our research is to produce a fully automatic system which can analyse preterm EEG, detect burst activity (with the accuracy that would be expected of a human observer), and calculate features of the signal which will provide a measure of brain maturity. Although we have taken the first crucial steps towards this goal, we acknowledge that there is much work to be done before this system can operate in the absence of any human intervention. The automatic algorithm must be further tested (and altered if necessary) on larger (and independent/unseen) datasets with much longer epochs of EEG which represent various changes in infant alertness, medication etc. An artifact removal system would need to be employed to allow the algorithm to work fully automatically on unseen data which may contain artifacts of various types. We recommend that during this development and testing phase the data is always annotated by at least two expert observers, so that we may continue to improve our knowledge of how expert observers differ, and set a realistic benchmark for the algorithm performance. In later phases of development the level of human intervention may be reduced, for example, by having an expert oberver simply mark ‘agree’ or ‘disagree’ with each burst event that is marked by the automatic system. This provides reassurance that the system is still functioning well while reducing the workload for expert observers. Only when the algorithm has been throroughly proven might the automatic system be allowed to work alone without any human intervention. Such rigorous testing, with more data and longer epochs, is also needed to verify whether features such as SYM (correlation seen only for observers 1 and 2 in this dataset) truly are related to gestational age. The difficulty of working in a field where interobserver differences are naturally relatively high is that the real ‘truth’ is difficult to determine. We propose that only features which have proven correlation, on a large dataset, for all observers/algorithm would be considered for inclusion in any system which might be developed to assist clinicians on future unseen data.

4.4. Conclusion We have shown that there is considerable variability between observers when analysing preterm EEG. However, automatic algorithms perform just as well, if not better than, different human observers, indicating a strong need to pursue this area of research. The preterm EEG is a complicated signal that contains very useful clinical information. Unfortunately it is very difficult for most clinicians to interpret. Automatic analysis methods will certainly help in this endeavour and detailed feature analysis is likely to uncover much previously unrecognised useful information that could be used for enhanced cerebral monitoring in this very vulnerable group. Acknowledgements This research was supported by a Science Foundation Ireland Research Centre Award (INFANT – 12/RC/2272). Conflict of interest: None of the authors have potential conflicts of interest to be disclosed. Appendix A. Burst detection Two methods to automatically detect burst activity on multi-channel EEG were developed and compared. The aim of automatic burst detection is to provide a fast, objective and quantitative analysis of the EEG recording. In this appendix the technical details of the methods developed are described. The values for all parameters mentioned are listed in Table A.1. A.1. Per-channel processing Each of the EEG channels, i; i ¼ 1 . . . 8, was first processed individually as described below. This initial per-channel process was based on the work of Palmu et al. (2010) which detected burst activity in single-channel EEG. The per-channel process is illustrated in Fig. A.1. The parameter values used (see Table A.1) are the same as those optimised in that work. –The channel was filtered using a Kaiser bandpass filter passing frequencies between k1 and k2 Hz with a stop-band attenuation of k3 dB (Fig. A.1(2)). The resulting filter order korder was 3.6 s. –Since the application of the Kaiser window filter causes the data to shift by korder =2 s to the right, the resulting filtered data was next shifted back to the original time position (Fig. A.1(3)).

1701

K. Murphy et al. / Clinical Neurophysiology 126 (2015) 1692–1702

This ensured that the data was correctly aligned with the expert annotations which were previously stored. –The non-linear energy operator (NLEO) was applied to the data (Fig. A.1(4)). The NLEO of a signal xðtÞ is defined as

NLEOðxðtÞÞ ¼ xðt  lÞxðt  pÞ  xðt  qÞxðt  sÞ

ð1Þ

This operator was first introduced in a general form in Plotkin EI (1992) and has been used in several EEG processing applications since then – e.g. (Agarwal et al., 1998; Särkelä et al., 2002; Deburchgraeve et al., 2008; Palmu et al., 2010). The output from the NLEO is sensitive to both frequency and amplitude of the original signal and as such is well suited to distinguishing activity bursts from IBIs. –The output of the NLEO was smoothed using a sliding average window with width w. (Fig. A.1(5)). –A baseline artifact correction was performed to remove the effect from any continuous artifact. This produced the final th processed signal, si ðtÞ. for the i channel (Fig. A.1(6)). Typically si ðtÞ is elevated during an activity burst and suppressed otherwise.

A.2. Multi-channel processing The NLEO signals si ðtÞ from each individual channel were now required to be combined in some way in order to produce a final burst/ interburst classification for each time point . In Palmu et al. (2010) a fixed threshold value for the processed signal s(t) (from single-channel EEG) was chosen in order to distinguish activity bursts from IBIs. The idea of a fixed threshold is intuitive and typical in many automatic and visual EEG processing methods. However, it is questionable whether a single threshold is suitable for many different subjects, particularly with different gestational ages and brain maturities. In multi-channel EEG a single threshold may not even be suited to all individual channels in a particular subject. It was therefore decided to experiment with two different methods to obtain a final burst/interburst classification. The first method, referred to hereafter as the ‘fixed threshold method’ emulated that used by Palmu et al. (2010). The second method, the ’variable threshold method’ used thresholds derived from the NLEO signals for each individual channel. Fixed Threshold Method: The ‘fixed threshold’ method emulated that used by Palmu et al. (2010). A single fixed threshold, thfixed , was applied to the NLEO signal si ðtÞ in every channel. Each channel was processed individually. As in Palmu et al. (2010), if si ðtÞ exceeded thfixed for longer than a minimum time duration tdur , then a period of burst activity was marked. The burst activity started at the point where si ðtÞ rose above thfixed and ended when si ðtÞ dropped below thfixed again. The output for a single channel is shown in Fig. A.1 (8). To combine the data from the eight channels and obtain an overall classification for the EEG, each time-point, t, was then considered individually. If two or more channels were marked as having burst activity at that time point then the final classification for that time point was of burst. Otherwise the time-point was classified as interburst. The cut off of two channels was selected on the advice of an experienced neonatal neurophysiologist. Variable Threshold Method: The NLEO signal si ðtÞ for each channel was used to obtain two threshold values for that channel. Each one was a percentile value, based on all the values from the full 10 min NLEO signal si ðtÞ. The higher percentile for the

Table A.1 Parameters used in the burst detection method. freq refers to the sampling frequency of the signal. Where a range of values are specified the results for different experiments are shown in the results section. Name

Description

Value

k1 k2 k3 l p q s w thfixed perc ui perc li tdur

Minimum pass frequency in Kaiser filter Maximum pass frequency in Kaiser filter Stop band attenuation in Kaiser filter Parameter for NLEO calculation Parameter for NLEO calculation Parameter for NLEO calculation Parameter for NLEO calculation Window width for smoothing NLEO signal Fixed threshold Upper percentile (threshold) Upper percentile (threshold) Minimum Duration for classification

0.5 Hz 10 Hz 60 dB 0 3  ðfreq=256Þ 1  ðfreq=256Þ 2  ðfreq=256Þ 1.5 s 0.5–2.5 60th–80th 40th–60th 1.5 s

100 µv 1 sec

1 2 3

4 5 6 7 8

95

100

105

110

115

120

125

130

135

140

Time (seconds)

Fig. A.1. An illustration of the processing steps performed on each channel i of EEG. (See Section A.1.) (1) The original EEG data xi ðtÞ. This section starts 95 s (baby ID 1, channel C3–O1) into the 10 min recording and lasts for 50 s. (2) The filtered data with the resulting phase shift. (3) The filtered data with the phase shift corrected. (4) The NLEO of the signal in (3). (5) The smoothed NLEO signal. (6) The smoothed NLEO si ðtÞ after baseline correction. (7)Peaks indicate where si ðtÞ is above 75th percentile p75 ðiÞ (upper line) and above 50th percentile p50 ðiÞ (lower line) (percentile values calculated over full 10 min of si ðtÞ). (8) Peaks indicate where si ðtÞ rises above a fixed threshold of 1.0 for longer than 1.5 s.

1702

K. Murphy et al. / Clinical Neurophysiology 126 (2015) 1692–1702

ith channel is referred to as perc ui and the lower as perc li . These thresholds vary depending on the amplitudes observed in si ðtÞ and are specific to the data of an individual channel for an individual subject. The output for a single channel using perc ui and perc li as thresholds is shown in Fig. A.1 (7). Every time point, t, in the EEG data was then considered individually. The time point was marked as being part of an activity burst if –si ðtÞ > perc ui in two or more of the channels or if –si ðtÞ > perc li in four or more of the channels As a post-processing step, interburst intervals that were less than tdur long were re-classified as bursts and subsequently any bursts less than tdur were re-classified as interburst intervals. Both methods of burst activity detection (fixed threshold and variable threshold) were applied to all 36 subjects. The threshold parameters thfixed ; perc ui and perc li were varied (see Table A.1) to determine which values gave optimal performance on our data. (Note that for an application to perform well on previously unseen data, parameters should best be determined on a large independent dataset with similar characteristics.) The results of these experiments can be seen in Section 3.1. References Agarwal R, Gotman J, Flanagan D, Rosenblatt B. Automatic EEG analysis during longterm monitoring in the ICU. Electroencephalogr Clin Neurophysiol 1998;107(1):44–58. Biagioni E, Bartalena L, Biver P, Pieri R, Cioni G. Electroencephalographic dysmaturity in preterm infants: a prognostic tool in the early postnatal period. Neuropediatrics 1996;27(6):311–6. Biagioni E, Bartalena L, Boldrini A, Cioni G, Giancola S, Ipata A. Background EEG activity in preterm infants: correlation of outcome with selected maturational features. Electroencephalogr Clin Neurophysiol 1994;91(3):154–62. Costeloe KL, Hennessy EM, Haider S, Stacey F, Marlow N, Draper ES. Short term outcomes after extreme preterm birth in England: comparison of two birth cohorts in 1995 and 2006 (the EPICure studies). Brit Med J 2012;345:e7976. Deburchgraeve W, Cherian PJ, De Vos M, Swarte RM, Blok JH, Visser GH, et al. Automated neonatal seizure detection mimicking a human observer reading EEG. Clin Neurophysiol 2008;119(11):2447–54. Hahn JS, Monyer H, Tharp BR. Interburst interval measurements in the EEGs of premature infants with normal neurological outcome. Electroencephalogr Clin Neurophysiol 1989;73(5):410–8. Hayakawa M, Okumura A, Hayakawa F, Watanabe K, Ohshiro M, Kato Y, et al. Background electroencephalographic (EEG) activities of very preterm infants

born at less than 27 weeks gestation: a study on the degree of continuity. Arch Dis child Fetal Neonatal Ed 2001;84(3):F163–7. Hayashi-Kurahashi N, Kidokoro H, Kubota T, Maruyama K, Kato Y, Kato T, et al. EEG for predicting early neurodevelopment in preterm infants: an observational cohort study. Pediatrics 2012;130(4):e891–7. Higuchi T. Approach to an irregular time series on the basis of the fractal theory. Phys D Nonlinear Phenomena 1988;31(2):277–83. Holmes GL, Lombroso CT. Prognostic value of background patterns in the neonatal EEG. J Clin Neurophysiol 1993;10(3):323–52. Holthausen K, Breidbach O, Scheidt B, Frenzel J. Brain dysmaturity index for automatic detection of high-risk infants. Pediatric Neurol 2000;22(3):187–91. Jennekens W, Ruijs LS, Lommen CML, Niemarkt HJ, Pasman JW, van KranenMastenbroek VHJM, et al. Automatic burst detection for the EEG of the preterm infant. Physiol Measur 2011;32(10):1623–37. Koolen N, Jansen K, Vervisch J, Matic V, De Vos M, Naulaers G, et al. Line length as a robust method to detect high-activity events: Automated burst detection in premature EEG recordings. Clin Neurophysiol 2014;125(10):1985–94. Landis J, Koch G. The measurement of observer agreement for categorical data. Biometrics 1977;33:159–74. LeBihannic A, Beauvais K, Busnel A, de Barace C, Furby A. Prognostic value of EEG in very premature newborns. Arch Dis Child Fetal Neonatal Ed 2012;97(2):F106–9. Marret S, Parain D, Ménard J-F, Blanc T, Devaux A-M, Ensel P, et al. Prognostic value of neonatal electroencephalography in premature newborns less than 33 weeks of gestational age. Electroencephalogr Clin Neurophysiol 1997;102(3):178–85. Palmu K, Stevenson N, Wikström S, Hellström-Westas L, Vanhatalo S, Palva JM. Optimization of an NLEO-based algorithm for automated detection of spontaneous activity transients in early preterm EEG. Physiol Measur 2010;31(11):N85–93. Plotkin EI, S.M. (1992). Nonlinear signal processing based on parameter invariant moving average modelling. In IEEE Canadian conference on electrical and computer engineering, p. TM3.11.1–4. Särkelä M, Mustola S, Seppänen T, Koskinen M, Lepola P, Suominen K, et al. Automatic analysis and monitoring of burst suppression in anesthesia. J Clin Monit Comput 2002;17(2):125–34. Schumacher EM, Westvik AS, Larsson PlG, Lindemann R, Westvik J, Stiris TA. Feasibility of long-term continuous EEG monitoring during the first days of life in preterm infants: an automated quantification of the EEG activity. Pediat Res 2011;69(5 Pt 1):413–7. Selton D, Andre M, Hascoët JM. Normal EEG in very premature infants: reference criteria. Clin Neurophysiol 2000;111(12):2116–24. van Putten MJAM. The revised brain symmetry index. Clin Neurophysiol 2007;118(11):2362–7. Vecchierini M-F, André M, D’Allest AM. Normal EEG of premature infants born between 24 and 30 weeks gestational age: terminology, definitions and maturation aspects. Clin Neurophysiol 2007;37(5):311–23. Vecchierini M-F, D’Allest A-M, Verpillat P. EEG patterns in 10 extreme premature neonates with normal neurological outcome: qualitative and quantitative data. Brain Dev 2003;25(5):330–7. Viertiö-Oja H, Maja V, Särkelä M, Talja P, Tenkanen N, Tolvanen-Laakso H, et al. Description of the Entropy algorithm as applied in the Datex-Ohmeda S/5 Entropy Module. Acta Anaesthesiol Scand 2004;48(2):154–61.

Automated analysis of multi-channel EEG in preterm infants.

To develop and validate two automatic methods for the detection of burst and interburst periods in preterm eight-channel electroencephalographs (EEG)...
1MB Sizes 1 Downloads 6 Views