JCM Accepted Manuscript Posted Online 22 March 2017 J. Clin. Microbiol. doi:10.1128/JCM.00162-17 Copyright © 2017 American Society for Microbiology. All Rights Reserved.

1

Robust Microbiota-Based Diagnostics for Inflammatory Bowel Disease

2

4

A. Eck1,*, E.F.J. de Groot2, T.G.J. de Meij3, M. Welling4,5, P.H.M Savelkoul1,6, A.E. Budding1

5 6

1

7

Amsterdam, The Netherlands; 2Department of Gastroenterology and Hepatology, VU University

8

medical center, Amsterdam, The Netherlands; 3Department of Pediatric Gastroenterology, VU

9

University medical center, Amsterdam, The Netherlands; 4Informatics Institute, University of

Department of Medical Microbiology and Infection Control, VU University medical center,

10

Amsterdam, Amsterdam, The Netherlands; 5Canadian Institute for Advanced Research, Toronto,

11

Canada; 6Department of Medical Microbiology, Maastricht University medical center,

12

Maastricht, The Netherlands .

13 14

Running Head: Robust Microbiota-Based IBD Diagnostics

15 16

*Address correspondence to A. Eck, [email protected]

17

1

Downloaded from http://jcm.asm.org/ on March 24, 2017 by UNIV OF CALIF SAN DIEGO

3

Abstract

19

Strong evidence suggests that gut microbiota is altered in inflammatory bowel disease (IBD),

20

indicating its potential role in non-invasive diagnostics. However, no clinical applications are

21

currently used for routine patient care. The main obstacle to implementing a gut microbiota test

22

for IBD is the lack of standardization, which leads to high inter-laboratory variations. We studied

23

the between-hospital and between-platform batch effects and their effect on predictive accuracy

24

for IBD.

25

Fecal samples from 91 pediatric IBD patients and 58 healthy children were collected. IS-pro, a

26

standardized technique designed for routine microbiota profiling in clinical settings, was used for

27

microbiota composition characterization. Additionally, a large synthetic dataset was used to

28

simulate various perturbations and study their effect on the accuracy of different classifiers.

29

Perturbations were validated in two replicate datasets: one processed in another laboratory and

30

the other using a different analysis platform.

31

Type of perturbation determined its effect on predictive accuracy. Real-life perturbations induced

32

by between-platform variation were significantly higher than those caused by between-laboratory

33

variation. Random forest was found to be robust to both simulated and observed perturbations,

34

even when these perturbations had a dramatic effect on other classifiers. It achieved high

35

accuracy both when cross-validated within the same dataset, and using datasets analyzed in

36

different laboratories.

37

Robust clinical predictions based on gut microbiota can be performed, even when samples are

38

processed in different hospitals. This study contributes to the efforts taken towards development

39

of a universal IBD test that would enable simple diagnostics and disease activity monitoring.

40

2

Downloaded from http://jcm.asm.org/ on March 24, 2017 by UNIV OF CALIF SAN DIEGO

18

Introduction

42

Inflammatory bowel disease (IBD), which encompasses ulcerative colitis (UC) and Crohn’s

43

disease (CD), is characterized by chronic inflammation of the gastrointestinal (GI) tract. IBD is a

44

multifactorial disease attributed to a combination of genetic predisposition and environmental

45

risk factors (1-3). The apparent increase of some of these risk factors is thought to have led to the

46

emergence of IBD as a global disease with increasing incidence and prevalence (4). Among the

47

overall newly diagnosed cases, pediatric patients account for 20% to 25%, underlining the

48

importance of IBD diagnosis by pediatricians (5). This diagnosis is complex, requiring a

49

combination of clinical, biochemical, radiological, endoscopic, and histological tests (6), which

50

often leads to a delayed diagnosis that may result in reduced response to medical therapy and

51

higher incidence of surgical intervention (7). However, there is another emerging indicator of

52

IBD, which can be obtained quickly and easily: the intestinal microbiota.

53

Strong evidence links the microbial colonization of the gut to the development of IBD,

54

suggesting that intestinal inflammation results from the interaction of the gut microbiota with the

55

mucosal immune compartments (8-10). Substantial alterations in the composition and decreased

56

complexity of the gut microbial ecosystem are common features in IBD patients (11). The utility

57

of microbial signatures for prediction of clinical outcomes was recently demonstrated in pediatric

58

IBD patients (12, 13) and, more specifically, in CD patients (14) . As a non-invasive, cost-

59

effective technique, microbiota-based diagnostics hold a great potential for prevention of

60

exacerbations and early stage disease detection. However, despite the recent developments in

61

molecular methods that allow circumventing the inability to cultivate most gut microbiota

62

species, microbiota-based diagnostics are not applied in clinical practice yet.

3

Downloaded from http://jcm.asm.org/ on March 24, 2017 by UNIV OF CALIF SAN DIEGO

41

63

One of the main obstacles to dissemination of this application is the lack of reproducibility of microbiota signatures between hospitals and laboratories (15, 16). Advanced

65

molecular techniques commonly used for microbiota analysis, such as next generation

66

sequencing, have been shown to be highly sensitive to every step in the analysis, from sample

67

collection, through storage conditions, to molecular procedures such as DNA isolation, primer

68

selection, PCR settings and choice of sequencing platform (15-18). Since these techniques are

69

currently still difficult to implement in clinical practice in a cost- and time-efficient way, for

70

microbiota analysis to become part of routine diagnostics, a standardized, reproducible and easily

71

applied technique is needed.

72

An alternative to sequencing is the IS-pro technique, which allows detection of bacterial

73

DNA and its taxonomic classification down to species-level based on the 16S-23S ribosomal

74

interspace fragment length (19). Carrying out the IS-pro procedure requires infrastructure and

75

skills that are commonly found in every diagnostic lab. Moreover, the procedure is suited for

76

routine diagnostic use, and results are available within hours. Over the last years, IS-pro has been

77

extensively optimized and validated, and has become an effective method for standardized gut

78

microbiota profiling (20, 21). While IS-pro profiles have already been shown to be effective as

79

clinical predictors (22), for universal application in routine diagnostics, clinical predictions need

80

to be robust to variations introduced by differing laboratories, hospitals, technicians, and

81

equipment employed.

82

In this study we investigated the between-hospital reproducibility of IS-pro data and the

83

effect of different perturbations on the outcome of clinical prediction models. We formulated the

84

most probable perturbations, simulated them in the microbial profiles of a pediatric IBD cohort,

85

and assessed their effect on predictive accuracy. Furthermore, we compared the effect across

4

Downloaded from http://jcm.asm.org/ on March 24, 2017 by UNIV OF CALIF SAN DIEGO

64

different classifiers to identify the classification algorithm that is least sensitive to potential

87

perturbations and thus optimal for practical application. To assess the extent of perturbations in

88

real life, we analyzed replicate datasets in two independent laboratories, and estimated the

89

predictive accuracy when a model was trained on one replicate and tested on the other, as would

90

occur in clinical practice. Based on our simulations and validation in practice, we were able to

91

identify which perturbations should be avoided and which classifiers are most robust.

92

Establishing a reproducible tool for microbiota-based clinical predictions is a crucial step

93

towards a universal diagnostic application.

94 95

Results

96

Datasets assessment

97

This study was based on a cohort of 149 subjects, of which 91 were children diagnosed with IBD

98

(consisting of 53 CD and 38 UC patients) and 58 were healthy children (denoted A1). Two

99

replicate datasets were generated based on subsets of this dataset: A2 – consisting of 116

100

samples, of which 70 were IBD samples (28 CD and 42 UC patients) and the rest were controls,

101

and M2 - consisting of 30 samples, of which 20 were IBD samples (11 CD and 9 UC patients)

102

and the rest were controls. A2, which was analyzed in the same laboratory as the original dataset,

103

but using a different analysis platform, served as a platform replicate, whereas M2, which was

104

analyzed in a different laboratory but using the same platform as A2, served as a hospital

105

replicate (Fig. 1). Between-hospital replicate sample pairs clustered together (Fig. 2A) and were

106

highly correlated (n=28, median R2=0.89 [IQR=0.3]). The variation introduced by the analysis

107

platform was larger than that introduced by hospital, as shown in Fig. 2B.

5

Downloaded from http://jcm.asm.org/ on March 24, 2017 by UNIV OF CALIF SAN DIEGO

86

To improve the prediction models’ quality and to gain better predictive performance, the original

109

dataset was extended with synthetic samples. This allowed us to avoid overfitting, which is one

110

of the main concerns when fitting a high-dimensional model with only a small number of

111

samples. We generated a complementary synthetic dataset (denoted as “SIM”, for details, see

112

Methods), consisting of 10,000 samples in total, and balanced between the two classes: IBD and

113

healthy. The original and the synthetic datasets are displayed in Fig. 3A, demonstrating

114

segregation between the two classes, while synthetic samples are grouped together with the

115

original samples of the same class.

116

To validate the synthetic dataset, we compared the distribution of each peak in the

117

original dataset to that in the synthetic dataset. All resulting p-values were 1 following a false

118

discovery rate correction, which means that the resulting synthetic peaks’ distributions were not

119

statistically different from those of the original data. We also calculated the concordance per

120

class between the average original profile and the average synthetic profile using Kendall’s

121

coefficient of concordance. For both classes high concordance values were obtained (0.9977 and

122

0.9951, for the healthy class and the IBD class, respectively), which means that the synthetic

123

profiles and the original profiles generally agree. Based on this assessment, the synthetic dataset

124

was regarded as valid for further experiments.

125

The predictive accuracy of the original dataset was assessed using seven different

126

classifiers that obtained similar performance, with area under the curve (AUC) ranging from 0.85

127

for L2-regularized logistic regression to 0.92 for nearest shrunken centroids (NSC) (Fig. 3B).

128

Specifically, both IBD subtypes could be discriminated from healthy controls with high

129

accuracy. UC classification achieved a maximal AUC of 0.96 (for NSC, random forest (RF) and

130

L2-regularized logistic regression), and CD classification obtained a maximal AUC of 0.87 (for

6

Downloaded from http://jcm.asm.org/ on March 24, 2017 by UNIV OF CALIF SAN DIEGO

108

131

NSC and RF). UC and CD were taken together as a single IBD group since both presented

132

similar variation (see below for details). We used a learning curve of increasing training set sizes to assess the predictability of the

134

synthetic data, and to determine the optimal training set size that should be sampled. The

135

accuracy of all classifiers increased as more samples were considered for training (Fig. 3C).

136

Based on the moderating accuracy improvement observed towards the larger training sets, we

137

determined that 400 training samples would be a valid size for subsequent simulations. For this

138

sample size, the accuracy was also comparable to the accuracy achieved by cross-validation of

139

the original dataset.

140

Probing variation boundaries

141

This section of the study consists of synthetic data experiments, aiming to explore the effect of

142

different sources of variation on predictive accuracy. We estimated the effect of five simulated

143

perturbations on the accuracy of seven classifiers, and identified which classifiers were more

144

robust and therefore may outperform others when applied in diagnostics.

145

Single perturbations

146

Each sample is represented as a microbial profile (Fig. 4). We formulated the perturbations that

147

are most likely to occur between profiles processed in different runs (Table 1). Each of these was

148

simulated with increasing extent as a variation of the SIM dataset (see Methods). The following

149

procedure was repeated 100 times: a balanced training set (400 samples) was subsampled from

150

the SIM dataset, and the rest of the samples were perturbed and used as test set. With this setting,

151

we assumed that a clinical prediction model would be trained on a reference dataset, but would

152

be used to predict data generated in various diagnostic centers (therefore incorporating different

153

sources of variation). 7

Downloaded from http://jcm.asm.org/ on March 24, 2017 by UNIV OF CALIF SAN DIEGO

133

154

Fig. 5 displays how predictions were affected by each individual perturbation. An intensity cutoff (P#1) did not have any effect on the predictive accuracy across all classifiers

156

(Fig. 5A). Peak shifts (P#2) and addition of intensity noise (P#3) led to a gradual decrease in

157

accuracy with more peaks being perturbed (Fig. 5B, 5C). Still, the most pronounced effect was

158

caused by peak deletion. When deleted peaks were mainly low intensity peaks (P#4), we

159

observed a substantial decrease in predictive accuracy in all classifiers except RF (Fig. 5D). We

160

further noted that deletion of 4 to 7 peaks led to substantial variation between simulations,

161

suggesting that it is important which particular peaks are deleted. With more peaks being deleted,

162

the accuracy converged and the variation between different simulations was reduced.

163

Random forest (RF) stood out with a distinct response to the simulated perturbations

164

compared to the rest of the classifiers: it remained stable in response to deletion of low intensity

165

peaks, while deletion of even a limited number of peaks led to a considerable decrease in

166

accuracy for other classifiers (Fig. 5D). Even though it was affected when dominant peaks were

167

deleted (P#5), its accuracy decline was slower than the rest of the classifiers (Fig. 5E). On the

168

other hand, the drop in its accuracy was more pronounced in response to intensity noise

169

simulation compared to the rest of the classifiers (Fig. 5C).

170

Mixed Perturbations

171

We further explored the effect mixed perturbations have on predictive accuracy using 100

172

variations of the SIM dataset, each combining a random extent of every perturbation (excluding

173

P#5, as this is not a perturbation likely to occur in real data). Ten balanced training sets, of 400

174

samples each, were subsampled from every dataset, while the rest of the samples were used as a

175

test set. For each dataset, we calculated the average cosine distance between the perturbed

176

version of each sample and its unperturbed origin. Increased prediction error was observed as

8

Downloaded from http://jcm.asm.org/ on March 24, 2017 by UNIV OF CALIF SAN DIEGO

155

perturbed samples became less similar to their corresponding original sample, reflected by higher

178

mean cosine dissimilarity (Fig. S1). We identified several outlier datasets, characterized by

179

relatively high error rates for samples that were on average still quite similar to their origin

180

(reflected by low mean cosine dissimilarity). Further investigation revealed that these outliers

181

corresponded to datasets in which peak deletion (P#4) affected more peaks, which led to a larger

182

effect on prediction error. This is demonstrated in Fig. S1, where the marker size corresponds to

183

the average number of deleted peaks, and the outlier datasets are encircled. The rest of the

184

perturbations did not exhibit a similar relationship. RF, which proved robust to peak deletion, did

185

not have any outlier datasets.

186

Between-hospital and between-platform variation

187

Following our simulations, we investigated the batch effects on predictive accuracy using

188

replicate datasets, re-processed and analyzed in two independent hospitals and using two

189

different analysis platforms.

190

Perturbations assessment

191

Variation was estimated based on the perturbations defined in Table 1 (P#2, P#3 and P#4) using

192

pairwise comparisons of replicates, and compared to the calculated simulated values (gray bars

193

in Fig. 5B, 5C and 5D). The between-platform batch effect (light gray bars in Fig. 5) appeared to

194

be in the range where predictive accuracy is diminished for all three perturbations. The total

195

number of peaks detected in a reference profile, but missing from its corresponding replicate, is

196

shown in Fig. 6A, separated according to the type of perturbation. On average, more than half of

197

the peaks in each pairwise comparison were missing (65%), out of which 35% were shifted

198

peaks and 30% were deleted peaks.

9

Downloaded from http://jcm.asm.org/ on March 24, 2017 by UNIV OF CALIF SAN DIEGO

177

199

The effect of between-hospital variation, however, depended on the type of perturbation: The number of shifted peaks (P#2) and the difference in measured intensity (P#3) were in the

201

range where they had only a minor effect on predictive accuracy based on our simulation, but the

202

number of deleted peaks (P#4) was in the range where predictive accuracy decreased for all

203

classifiers, except RF (dark gray bars in Fig. 5). However, the proportion of missing peaks per

204

pairwise comparison was significantly lower than that observed for between-platform replicates

205

(Fig. 6A, p

Robust Microbiota-Based Diagnostics for Inflammatory Bowel Disease.

Strong evidence suggests that the gut microbiota is altered in inflammatory bowel disease (IBD), indicating its potential role in noninvasive diagnost...
2MB Sizes 4 Downloads 9 Views