JCM Accepted Manuscript Posted Online 22 March 2017 J. Clin. Microbiol. doi:10.1128/JCM.00162-17 Copyright © 2017 American Society for Microbiology. All Rights Reserved.
1
Robust Microbiota-Based Diagnostics for Inflammatory Bowel Disease
2
4
A. Eck1,*, E.F.J. de Groot2, T.G.J. de Meij3, M. Welling4,5, P.H.M Savelkoul1,6, A.E. Budding1
5 6
1
7
Amsterdam, The Netherlands; 2Department of Gastroenterology and Hepatology, VU University
8
medical center, Amsterdam, The Netherlands; 3Department of Pediatric Gastroenterology, VU
9
University medical center, Amsterdam, The Netherlands; 4Informatics Institute, University of
Department of Medical Microbiology and Infection Control, VU University medical center,
10
Amsterdam, Amsterdam, The Netherlands; 5Canadian Institute for Advanced Research, Toronto,
11
Canada; 6Department of Medical Microbiology, Maastricht University medical center,
12
Maastricht, The Netherlands .
13 14
Running Head: Robust Microbiota-Based IBD Diagnostics
15 16
*Address correspondence to A. Eck,
[email protected] 17
1
Downloaded from http://jcm.asm.org/ on March 24, 2017 by UNIV OF CALIF SAN DIEGO
3
Abstract
19
Strong evidence suggests that gut microbiota is altered in inflammatory bowel disease (IBD),
20
indicating its potential role in non-invasive diagnostics. However, no clinical applications are
21
currently used for routine patient care. The main obstacle to implementing a gut microbiota test
22
for IBD is the lack of standardization, which leads to high inter-laboratory variations. We studied
23
the between-hospital and between-platform batch effects and their effect on predictive accuracy
24
for IBD.
25
Fecal samples from 91 pediatric IBD patients and 58 healthy children were collected. IS-pro, a
26
standardized technique designed for routine microbiota profiling in clinical settings, was used for
27
microbiota composition characterization. Additionally, a large synthetic dataset was used to
28
simulate various perturbations and study their effect on the accuracy of different classifiers.
29
Perturbations were validated in two replicate datasets: one processed in another laboratory and
30
the other using a different analysis platform.
31
Type of perturbation determined its effect on predictive accuracy. Real-life perturbations induced
32
by between-platform variation were significantly higher than those caused by between-laboratory
33
variation. Random forest was found to be robust to both simulated and observed perturbations,
34
even when these perturbations had a dramatic effect on other classifiers. It achieved high
35
accuracy both when cross-validated within the same dataset, and using datasets analyzed in
36
different laboratories.
37
Robust clinical predictions based on gut microbiota can be performed, even when samples are
38
processed in different hospitals. This study contributes to the efforts taken towards development
39
of a universal IBD test that would enable simple diagnostics and disease activity monitoring.
40
2
Downloaded from http://jcm.asm.org/ on March 24, 2017 by UNIV OF CALIF SAN DIEGO
18
Introduction
42
Inflammatory bowel disease (IBD), which encompasses ulcerative colitis (UC) and Crohn’s
43
disease (CD), is characterized by chronic inflammation of the gastrointestinal (GI) tract. IBD is a
44
multifactorial disease attributed to a combination of genetic predisposition and environmental
45
risk factors (1-3). The apparent increase of some of these risk factors is thought to have led to the
46
emergence of IBD as a global disease with increasing incidence and prevalence (4). Among the
47
overall newly diagnosed cases, pediatric patients account for 20% to 25%, underlining the
48
importance of IBD diagnosis by pediatricians (5). This diagnosis is complex, requiring a
49
combination of clinical, biochemical, radiological, endoscopic, and histological tests (6), which
50
often leads to a delayed diagnosis that may result in reduced response to medical therapy and
51
higher incidence of surgical intervention (7). However, there is another emerging indicator of
52
IBD, which can be obtained quickly and easily: the intestinal microbiota.
53
Strong evidence links the microbial colonization of the gut to the development of IBD,
54
suggesting that intestinal inflammation results from the interaction of the gut microbiota with the
55
mucosal immune compartments (8-10). Substantial alterations in the composition and decreased
56
complexity of the gut microbial ecosystem are common features in IBD patients (11). The utility
57
of microbial signatures for prediction of clinical outcomes was recently demonstrated in pediatric
58
IBD patients (12, 13) and, more specifically, in CD patients (14) . As a non-invasive, cost-
59
effective technique, microbiota-based diagnostics hold a great potential for prevention of
60
exacerbations and early stage disease detection. However, despite the recent developments in
61
molecular methods that allow circumventing the inability to cultivate most gut microbiota
62
species, microbiota-based diagnostics are not applied in clinical practice yet.
3
Downloaded from http://jcm.asm.org/ on March 24, 2017 by UNIV OF CALIF SAN DIEGO
41
63
One of the main obstacles to dissemination of this application is the lack of reproducibility of microbiota signatures between hospitals and laboratories (15, 16). Advanced
65
molecular techniques commonly used for microbiota analysis, such as next generation
66
sequencing, have been shown to be highly sensitive to every step in the analysis, from sample
67
collection, through storage conditions, to molecular procedures such as DNA isolation, primer
68
selection, PCR settings and choice of sequencing platform (15-18). Since these techniques are
69
currently still difficult to implement in clinical practice in a cost- and time-efficient way, for
70
microbiota analysis to become part of routine diagnostics, a standardized, reproducible and easily
71
applied technique is needed.
72
An alternative to sequencing is the IS-pro technique, which allows detection of bacterial
73
DNA and its taxonomic classification down to species-level based on the 16S-23S ribosomal
74
interspace fragment length (19). Carrying out the IS-pro procedure requires infrastructure and
75
skills that are commonly found in every diagnostic lab. Moreover, the procedure is suited for
76
routine diagnostic use, and results are available within hours. Over the last years, IS-pro has been
77
extensively optimized and validated, and has become an effective method for standardized gut
78
microbiota profiling (20, 21). While IS-pro profiles have already been shown to be effective as
79
clinical predictors (22), for universal application in routine diagnostics, clinical predictions need
80
to be robust to variations introduced by differing laboratories, hospitals, technicians, and
81
equipment employed.
82
In this study we investigated the between-hospital reproducibility of IS-pro data and the
83
effect of different perturbations on the outcome of clinical prediction models. We formulated the
84
most probable perturbations, simulated them in the microbial profiles of a pediatric IBD cohort,
85
and assessed their effect on predictive accuracy. Furthermore, we compared the effect across
4
Downloaded from http://jcm.asm.org/ on March 24, 2017 by UNIV OF CALIF SAN DIEGO
64
different classifiers to identify the classification algorithm that is least sensitive to potential
87
perturbations and thus optimal for practical application. To assess the extent of perturbations in
88
real life, we analyzed replicate datasets in two independent laboratories, and estimated the
89
predictive accuracy when a model was trained on one replicate and tested on the other, as would
90
occur in clinical practice. Based on our simulations and validation in practice, we were able to
91
identify which perturbations should be avoided and which classifiers are most robust.
92
Establishing a reproducible tool for microbiota-based clinical predictions is a crucial step
93
towards a universal diagnostic application.
94 95
Results
96
Datasets assessment
97
This study was based on a cohort of 149 subjects, of which 91 were children diagnosed with IBD
98
(consisting of 53 CD and 38 UC patients) and 58 were healthy children (denoted A1). Two
99
replicate datasets were generated based on subsets of this dataset: A2 – consisting of 116
100
samples, of which 70 were IBD samples (28 CD and 42 UC patients) and the rest were controls,
101
and M2 - consisting of 30 samples, of which 20 were IBD samples (11 CD and 9 UC patients)
102
and the rest were controls. A2, which was analyzed in the same laboratory as the original dataset,
103
but using a different analysis platform, served as a platform replicate, whereas M2, which was
104
analyzed in a different laboratory but using the same platform as A2, served as a hospital
105
replicate (Fig. 1). Between-hospital replicate sample pairs clustered together (Fig. 2A) and were
106
highly correlated (n=28, median R2=0.89 [IQR=0.3]). The variation introduced by the analysis
107
platform was larger than that introduced by hospital, as shown in Fig. 2B.
5
Downloaded from http://jcm.asm.org/ on March 24, 2017 by UNIV OF CALIF SAN DIEGO
86
To improve the prediction models’ quality and to gain better predictive performance, the original
109
dataset was extended with synthetic samples. This allowed us to avoid overfitting, which is one
110
of the main concerns when fitting a high-dimensional model with only a small number of
111
samples. We generated a complementary synthetic dataset (denoted as “SIM”, for details, see
112
Methods), consisting of 10,000 samples in total, and balanced between the two classes: IBD and
113
healthy. The original and the synthetic datasets are displayed in Fig. 3A, demonstrating
114
segregation between the two classes, while synthetic samples are grouped together with the
115
original samples of the same class.
116
To validate the synthetic dataset, we compared the distribution of each peak in the
117
original dataset to that in the synthetic dataset. All resulting p-values were 1 following a false
118
discovery rate correction, which means that the resulting synthetic peaks’ distributions were not
119
statistically different from those of the original data. We also calculated the concordance per
120
class between the average original profile and the average synthetic profile using Kendall’s
121
coefficient of concordance. For both classes high concordance values were obtained (0.9977 and
122
0.9951, for the healthy class and the IBD class, respectively), which means that the synthetic
123
profiles and the original profiles generally agree. Based on this assessment, the synthetic dataset
124
was regarded as valid for further experiments.
125
The predictive accuracy of the original dataset was assessed using seven different
126
classifiers that obtained similar performance, with area under the curve (AUC) ranging from 0.85
127
for L2-regularized logistic regression to 0.92 for nearest shrunken centroids (NSC) (Fig. 3B).
128
Specifically, both IBD subtypes could be discriminated from healthy controls with high
129
accuracy. UC classification achieved a maximal AUC of 0.96 (for NSC, random forest (RF) and
130
L2-regularized logistic regression), and CD classification obtained a maximal AUC of 0.87 (for
6
Downloaded from http://jcm.asm.org/ on March 24, 2017 by UNIV OF CALIF SAN DIEGO
108
131
NSC and RF). UC and CD were taken together as a single IBD group since both presented
132
similar variation (see below for details). We used a learning curve of increasing training set sizes to assess the predictability of the
134
synthetic data, and to determine the optimal training set size that should be sampled. The
135
accuracy of all classifiers increased as more samples were considered for training (Fig. 3C).
136
Based on the moderating accuracy improvement observed towards the larger training sets, we
137
determined that 400 training samples would be a valid size for subsequent simulations. For this
138
sample size, the accuracy was also comparable to the accuracy achieved by cross-validation of
139
the original dataset.
140
Probing variation boundaries
141
This section of the study consists of synthetic data experiments, aiming to explore the effect of
142
different sources of variation on predictive accuracy. We estimated the effect of five simulated
143
perturbations on the accuracy of seven classifiers, and identified which classifiers were more
144
robust and therefore may outperform others when applied in diagnostics.
145
Single perturbations
146
Each sample is represented as a microbial profile (Fig. 4). We formulated the perturbations that
147
are most likely to occur between profiles processed in different runs (Table 1). Each of these was
148
simulated with increasing extent as a variation of the SIM dataset (see Methods). The following
149
procedure was repeated 100 times: a balanced training set (400 samples) was subsampled from
150
the SIM dataset, and the rest of the samples were perturbed and used as test set. With this setting,
151
we assumed that a clinical prediction model would be trained on a reference dataset, but would
152
be used to predict data generated in various diagnostic centers (therefore incorporating different
153
sources of variation). 7
Downloaded from http://jcm.asm.org/ on March 24, 2017 by UNIV OF CALIF SAN DIEGO
133
154
Fig. 5 displays how predictions were affected by each individual perturbation. An intensity cutoff (P#1) did not have any effect on the predictive accuracy across all classifiers
156
(Fig. 5A). Peak shifts (P#2) and addition of intensity noise (P#3) led to a gradual decrease in
157
accuracy with more peaks being perturbed (Fig. 5B, 5C). Still, the most pronounced effect was
158
caused by peak deletion. When deleted peaks were mainly low intensity peaks (P#4), we
159
observed a substantial decrease in predictive accuracy in all classifiers except RF (Fig. 5D). We
160
further noted that deletion of 4 to 7 peaks led to substantial variation between simulations,
161
suggesting that it is important which particular peaks are deleted. With more peaks being deleted,
162
the accuracy converged and the variation between different simulations was reduced.
163
Random forest (RF) stood out with a distinct response to the simulated perturbations
164
compared to the rest of the classifiers: it remained stable in response to deletion of low intensity
165
peaks, while deletion of even a limited number of peaks led to a considerable decrease in
166
accuracy for other classifiers (Fig. 5D). Even though it was affected when dominant peaks were
167
deleted (P#5), its accuracy decline was slower than the rest of the classifiers (Fig. 5E). On the
168
other hand, the drop in its accuracy was more pronounced in response to intensity noise
169
simulation compared to the rest of the classifiers (Fig. 5C).
170
Mixed Perturbations
171
We further explored the effect mixed perturbations have on predictive accuracy using 100
172
variations of the SIM dataset, each combining a random extent of every perturbation (excluding
173
P#5, as this is not a perturbation likely to occur in real data). Ten balanced training sets, of 400
174
samples each, were subsampled from every dataset, while the rest of the samples were used as a
175
test set. For each dataset, we calculated the average cosine distance between the perturbed
176
version of each sample and its unperturbed origin. Increased prediction error was observed as
8
Downloaded from http://jcm.asm.org/ on March 24, 2017 by UNIV OF CALIF SAN DIEGO
155
perturbed samples became less similar to their corresponding original sample, reflected by higher
178
mean cosine dissimilarity (Fig. S1). We identified several outlier datasets, characterized by
179
relatively high error rates for samples that were on average still quite similar to their origin
180
(reflected by low mean cosine dissimilarity). Further investigation revealed that these outliers
181
corresponded to datasets in which peak deletion (P#4) affected more peaks, which led to a larger
182
effect on prediction error. This is demonstrated in Fig. S1, where the marker size corresponds to
183
the average number of deleted peaks, and the outlier datasets are encircled. The rest of the
184
perturbations did not exhibit a similar relationship. RF, which proved robust to peak deletion, did
185
not have any outlier datasets.
186
Between-hospital and between-platform variation
187
Following our simulations, we investigated the batch effects on predictive accuracy using
188
replicate datasets, re-processed and analyzed in two independent hospitals and using two
189
different analysis platforms.
190
Perturbations assessment
191
Variation was estimated based on the perturbations defined in Table 1 (P#2, P#3 and P#4) using
192
pairwise comparisons of replicates, and compared to the calculated simulated values (gray bars
193
in Fig. 5B, 5C and 5D). The between-platform batch effect (light gray bars in Fig. 5) appeared to
194
be in the range where predictive accuracy is diminished for all three perturbations. The total
195
number of peaks detected in a reference profile, but missing from its corresponding replicate, is
196
shown in Fig. 6A, separated according to the type of perturbation. On average, more than half of
197
the peaks in each pairwise comparison were missing (65%), out of which 35% were shifted
198
peaks and 30% were deleted peaks.
9
Downloaded from http://jcm.asm.org/ on March 24, 2017 by UNIV OF CALIF SAN DIEGO
177
199
The effect of between-hospital variation, however, depended on the type of perturbation: The number of shifted peaks (P#2) and the difference in measured intensity (P#3) were in the
201
range where they had only a minor effect on predictive accuracy based on our simulation, but the
202
number of deleted peaks (P#4) was in the range where predictive accuracy decreased for all
203
classifiers, except RF (dark gray bars in Fig. 5). However, the proportion of missing peaks per
204
pairwise comparison was significantly lower than that observed for between-platform replicates
205
(Fig. 6A, p