Woods: A fast and accurate functional annotator and classifier of genomic and metagenomic sequences.

YGENO-08729; No. of pages: 6; 4C: Genomics xxx (2014) xxx–xxx

Contents lists available at ScienceDirect

Genomics journal homepage: www.elsevier.com/locate/ygeno

3Q4

Ashok K. Sharma 1, Ankit Gupta 1, Sanjiv Kumar, Darshan B. Dhakan, Vineet K. Sharma ⁎

4

MetaInformatics Laboratory, Metagenomics and Systems Biology Group, Department of Biological Sciences, Indian Institute of Science Education and Research, Bhopal, Madhya Pradesh, India

5

a r t i c l e

6 7 8 9

Article history: Received 22 November 2014 Accepted 2 April 2015 Available online xxxx

10 11 12 13 14 24

Keywords: Metagenome Functional annotation Machine learning Random Forest

i n f o

R O

O

F

2

Woods: A fast and accurate functional annotator and classifier of genomic and metagenomic sequences

a b s t r a c t

D

P

Functional annotation of the gigantic metagenomic data is one of the major time-consuming and computationally demanding tasks, which is currently a bottleneck for the efficient analysis. The commonly used homologybased methods to functionally annotate and classify proteins are extremely slow. Therefore, to achieve faster and accurate functional annotation, we have developed an orthology-based functional classifier ‘Woods’ by using a combination of machine learning and similarity-based approaches. Woods displayed a precision of 98.79% on independent genomic dataset, 96.66% on simulated metagenomic dataset and N97% on two real metagenomic datasets. In addition, it performed N 87 times faster than BLAST on the two real metagenomic datasets. Woods can be used as a highly efficient and accurate classifier with high-throughput capability which facilitates its usability on large metagenomic datasets. © 2014 Published by Elsevier Inc.

38 39 40 41 42 43 44 45 46 47 48 49 50 51

C

E

R

36 37

R

34 35

Functional annotation of proteins and their classification into functional categories using an orthology-based approach is commonly performed for the annotation of newly sequenced genomes and metagenomes [1–3]. The functional annotation and classification provide valuable insights into the functional profile of a genome or of a metagenomic community which are immensely useful for comparative genomics and metagenomics [4]. Until recently, COGs has been mainly used as the reference database for functional annotation and classification of proteins into different functional categories. However, it is a relatively old database consisting of 138,458 proteins classified into 4873 COGs derived from 66 unicellular genomes [5]. With the rapid increase in genomic information, eggNOG (evolutionary genealogy of genes: Non-supervised Orthologous Groups), a database of orthologous groups of genes which is constructed by combining complete proteomes from RefSeq [6], Ensembl [7], UniProt [8], GiardiaDB [9], JGI (http://genome. jgi-psf.org/) and TAIR [10], has recently updated the classification of proteins into different orthologous groups using the COGs/KOGs/ arCOGs databases [11]. It has emerged as a comprehensive database of 1.7 million orthologous groups of more than 7.7 million proteins derived from 3686 species and classified into 25 functional categories. eggNOG is now commonly used as a reference database while performing homology-based functional annotation of genomic proteins and

N C O

32 33

1. Introduction

U

30 31

T

28 26 25 27 29

15 16 17 18 19 20 21 22 23

E

1Q3

⁎ Corresponding author. E-mail addresses: [email protected] (A.K. Sharma), [email protected] (A. Gupta), [email protected] (S. Kumar), [email protected] (D.B. Dhakan), [email protected] (V.K. Sharma). 1 The authors have contributed equally.

metagenomic ORFs using BLAST [12]. Though this approach is immensely useful, it suffers from the limitations of the homology-based approach (BLAST) and requires enormous amount of time for carrying out the annotation of genomic proteins or the millions of ORFs predicted in metagenomic datasets. In this scenario, machine learning approaches could provide valuable alternatives to homology-based approaches for the functional annotation of genomic and metagenomic data due to favorable tradeoffs among automation, speed and accuracy. In the present work, a combined approach has been implemented which integrates Random Forest (RF) [13] as machine learning-based method and RAPSearch2 [14] as similarity-based method for the development of a tool named as ‘Woods’. The combined approach can perform the functional annotation and classification of the complete (genomic proteins) and partial (metagenomic ORFs) protein sequences and can dramatically reduce the amount of time required for the functional annotation of large metagenomic datasets.

52 53

2. Results and discussion

69

2.1. Selection of prediction-based method

70

Randomly chosen ~ 10% sequences (231,255) from the training dataset constructed in this study were used to evaluate the performance of different prediction methods, such as Naive Bayesian classifier, Random Forest (RF), Adaboost, Multiclass classifier and Lib-svm using the Weka package [15]. Amino acid composition (AAC) and dipeptide composition (DPC) were used as input features for training of these methods. The performance of lib-svm could not be evaluated using the Weka package due to the inability of lib-svm to handle such a

71

http://dx.doi.org/10.1016/j.ygeno.2015.04.001 0888-7543/© 2014 Published by Elsevier Inc.

Please cite this article as: A.K. Sharma, et al., Genomics (2014), http://dx.doi.org/10.1016/j.ygeno.2015.04.001

54 55 56 57 58 59 60 61 62 63 64 65 66 67 68

72 73 74 75 76 77 78

A.K. Sharma et al. / Genomics xxx (2014) xxx–xxx

82 83

2.2. Optimization of parameters

85 86

To optimize RF for achieving the lowest OOB error, two parameters are important: mtry (subset of variables randomly selected at each node) and ntree (number of trees in the forest). A smaller subset (minimum mtry) not only reduces correlation between trees but also reduces the strength of the individual trees. Multiple RF models (ntree = 100) with amino acid and dipeptide composition as feature input were created for selecting the best feature input and optimum

R

R

O C N

91

U

89 90

E

84

87 88

D

large dataset. The performance of RF was observed to be better than Naïve Bayesian classifier, Adaboost and Multiclass classifier (Fig. 2). Therefore, RF was selected for constructing the final RF model as it displayed the highest accuracy using both AAC (68.63%) and DPC (65.69) on the sample dataset.

mtry. To construct the RF models, the mtry values of 4, 6, 8, 10, 12 and 14 were used in the case of amino acid composition (Fig. S1), and the mtry values of 20, 40, 80, 160, 240, 320 and 400 were used in the case of dipeptide composition (Fig. 3). The mtry value with the lowest OOB error was selected for both the inputs. The out-of-bag (OOB) error was lowest (OOB = 29.72%, mtry = 240) for DPC as compared to AAC (OOB = 30.35%, mtry = 8). Therefore, DPC and mtry = 240 were selected as the input parameters. To select the minimum number of variables required for an accurate prediction, the importance of each variable at the optimum mtry value was measured according to permutation variable importance (Fig. S2). Of the total 400 dipeptide combinations (variables), subsets were created by successively removing 20% of the least important variables from the total number of variables. The three subsets formed this way consisting of 320, 240 and 160 variables and the complete set of 400 variables were used as input to RF at mtry = 80. The result showed that OOB error using 400 input variables was almost similar with OOB error using 320 input variables. These two subsets with 400 and 320 variables were further examined at mtry values of 160 and 240 to select the best input variables (Fig. 4). At the selected value of mtry = 240, the RF with 400 input variables showed lower OOB error as compared to RF with 320 input variables. Hence, all 400 variables were selected as the input variables for constructing the RF. To examine the decrease in OOB error of RF on increasing the number of trees (ntree), at mtry = 240, the value of ntree was gradually increased from 100 to 1000. On increasing the number of trees (ntree) from 100 to 500, a significant decrease in OOB error was observed (Fig. 5). The OOB error, sensitivity, specificity, accuracy and MCC values of RF at mtry = 240 and ntree = 1000 using all 400 variables were 27.36%, 65.19%, 98.30%, 96.96% and 0.75, respectively (Table S2) and at ntree = 500 the values were 27.60%, 65.19%, 98.26%, 96.90% and 0.75, respectively (Table S3). The decrease in the OOB error was marginal (0.24%) on increasing the number of trees from 500 to 1000. However, the model size at ntree = 500 was 4.1 GB which is half of the model size at ntree = 1000 (model size = 8.2 GB). Thus, using ntree = 1000 will have several times higher memory requirements as compared to ntree = 500 which is not justified considering the marginal increase in the accuracy. Therefore, the RF model with ntree = 500 was selected.

E

80 81

T

79

Fig. 1. Workflow for the functional classification and annotation of a query sequence using Woods. A hybrid approach using Random Forest and RAPsearch2 is used by Woods for the functional assignment of a query sequence.

C

Q7

P

R O

O

F

2

Fig. 2. Performance of different prediction-based methods using amino acid composition (AAC) and dipeptide composition (DPC) as feature input. Weka was used to evaluate the different prediction-based methods using the randomly selected test dataset containing 10% of the total sequences.


92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129

3

O

F


Fig. 3. OOB error using DPC as feature input. OOB error found to be lowest at mtry = 240 (29.72%).

The performance of the RF model was assessed using the following measures:

133

TP þ TN Accuracy ¼ TP þ FN þ FP þ TN

142 143

TP TP þ FN

Specificity ¼

TN TN þ FP

P

D

Sensitivity ¼

2.4. Performance of Woods on test datasets

E

139 140

TP Precision ¼ TP þ FP

T

136 137

and could not handle the size of the eggNOG database. The threshold E-value of 1e − 6 was used for RAPsearch2 and BLAT. RAPsearch2 took 972 min and displayed 86.66% precision, whereas, BLAT took 1661 min and showed 76.22% precision. Therefore, RAPsearch2 which showed better speed and precision value as compared to BLAT was integrated in Woods to perform the similarity-based approach.

ðTP TNÞ−ð FP FNÞ MCC ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðTP þ FP ÞðTP þ FN ÞðTN þ FP ÞðTN þ FNÞ

C

134

E

145

152

R

N C O

150 151

To select a similarity-based method having high accuracy and speed, the performance of commonly used similarity-based approaches, such as RAPsearch2, BLAT [16], and Usearch [17] were evaluated in comparison to BLAST (version 2.2.26, [12]) on the randomly chosen ~ 10% (231,255) sequences from the eggNOGv3.0 database which were searched against the rest (2,080,850, 90%) of the sequences in the database. It was observed that Usearch suffered from memory limitations

153 154 155 156 157 158 159

The results of BLAST were used as reference to compare the predictions provided by Woods, since BLAST is commonly used as standard tools for the annotation of protein sequences. The homology-based alignment of protein sequences of the genomic and metagenomic dataset was carried out using BLAST against eggNOG v3.0 database. The protein sequences in datasets were assigned to any one of the 22 functional categories based on the top BLAST hit and this information was further used as the reference to compare the results of Woods.

160 161

2.4.1. Performance of Woods on independent genomic dataset To test the unbiased performance of Woods, an independent dataset was constructed using the protein sequences of 50 completed bacterial genomes belonging to different taxonomic lineages (ftp://ftp.ncbi.nlm. nih.gov/genomes/Bacteria/). These sequences were analyzed using both, the RF module of Woods and the complete Woods tool. It was observed that RF module displayed an average precision of

168 169

U

148 149

R

2.3. Similarity-based approaches 146 147

R O

130 131

Fig. 5. Comparison of OOB error of Random Forest on increasing the number of trees. A decrease in OOB error was observed on increasing the number of trees (ntree) from 100 to 1000 at mtry = 240.

Fig. 4. Selection of input variables for the construction of Random Forest model. A) Comparison of OOB error for different number of input variables at mtry = 80, B) Comparison of OOB error for variable (320 and 400) at increasing number of mtry (160 and 240).


162 163 164 165 166 167

170 171 172 173 174

197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218

F O

R O

195 196

2.4.2. Performance of Woods on metagenomic dataset The metagenomic proteins are often partial and are of different lengths depending on the length of the read or the assembled contig. Therefore, to identify the optimal protein fragment length at which the RF model would show satisfactory performance, eight simulated metagenomic datasets with different fragment lengths (250–300, 300–350, 350–400, 400–450, 450–500, 500–550, 550–600 and 600– 650 amino acids) were constructed using the protein sequences of the 50 bacterial genomes. The first dataset was constructed using a cut-off of 250 amino acids such that the proteins which were smaller than 250 amino acids in length were considered without fragmentation; and the sequences of larger proteins were randomly fragmented to generate fragments between 250 and 300 amino acids (aa) in length. The criterion for selecting the cut-off length of 250 amino acids is based on the observation that the average length of prokaryotic proteins is 267 amino acids [18]. Similarly, the other seven simulated metagenomic datasets were constructed. Woods and RF model were evaluated on all the eight datasets. It was observed that the accuracy shown by Woods was consistent on all the eight datasets, however, the performance of the RF model was found to be satisfactory (79.97%) at fragment lengths ≥ 500 aa (Fig. 7). Therefore, in the case of metagenomic ORFs where several predicted ORFs could be partial, only those ORFs which had length ≥ 500 amino acids are analyzed using RF and the ORFs which are shorter than 500 amino acids are analyzed directly through RAPsearch2 in Woods. On the simulated independent metagenomic dataset (fragment length ≥ 500 aa), RF and Woods displayed precision of 79.97% and

96.66%, respectively (Fig. 7). The classification into 22 functional categories is shown in Fig. S4. It is apparent that for all functional categories, both RF and Woods made almost similar functional classifications. These results underscore the accuracy of RF on independent genomic dataset consisting of complete protein sequences.

219

2.4.3. Performance of Woods on real metagenomic datasets The performance of Woods was also evaluated using real metagenomic datasets. Assembled metagenomic data of two human gut samples, MH0006 and MH0012, were retrieved [19]. The ORF prediction was carried out using MetaGeneMark. The output file of ORFs in nucleotide format (*.fna) generated using MetaGeneMark was used as the input file for Woods while using the ‘Metagenomic’ option. This file was internally processed to differentiate between the complete and partial ORFs using Woods. The numbers of predicted ORFs were 308,223 and 324,939 in the samples MH0006 and MH0012, respectively. In comparison to BLAST, Woods could classify 83.7% of the ORFs with a precision of 97.53% and 97.08% for the two metagenomic datasets, respectively (Table 1). The class-wise performance of Woods on the two metagenomic datasets is shown in detail in the Tables S5 and S6. The classification of ORFs into 22 functional categories for the two real metagenomic datasets is shown in Figs. S5 and S6. The total time taken by Woods for the analysis of the two datasets on a Linux-based PC with a single Intel Xeon 2.4 GhZ CPU using 45GB RAM was 7.75 and 8.81 (in CPU hours), respectively, whereas BLAST took 679.53 and 779.65 (in CPU hours), respectively, on the same datasets.

224 225

3. Conclusions

244

In the present study, we have developed Woods as an orthologybased functional classifier to carry out faster and accurate functional annotation and classification by using a combination of Random Forest and RAPsearch2. Woods provides the option to use either a prediction-based classification using the RF module for quicker analysis, or to use the combined approach. The RF module can only be used to predict the functional category, whereas, the integrated Woods tool can also assign functional annotation to the protein in addition of predicting the functional category. The precision (N97%, on real metagenomic datasets) displayed by Woods was comparable to BLAST, in addition, it performed N 87 times faster than BLAST. The only limitation of Woods is that it could provide predictions on 83.7% of the data as compared to BLAST, however, this is apparently a favorable trade-off between speed and percentage prediction given the size of the metagenomic datasets which needs to be analyzed. These results attest to the high efficiency, high precision and high-throughput capability of Woods in predicting the functional categories and eggNOG-

245 246

P

193 194

D

192

Fig. 7. Performance of Random Forest and Woods on simulated metagenomic datasets of different fragment lengths.

T

190 191

C

188 189

E

186 187

R

184 185

R

182 183

O

181

C

179 180

N

177 178

89.04%. It displayed the maximum precision (95.83%) in the case of Dehalococcoides_GT_uid42115 and showed the minimum precision (77.63%) in the case of Enterobacter_asburiae_LF7a_uid72793I. The integrated Woods tool performed better and showed an average precision of 98.79% on the independent dataset. It displayed the maximum precision (99.85%) in the case of Borrelia_garinii_NMJW1_uid177081 and a minimum precision (97.29%) in the case of Enterobacter_ asburiae_LF7a_uid72793 (Table S4, Fig. 6). The classification of genomic proteins of Dehalococcoides_GT_uid42115 into 22 functional categories is shown in Fig. S3. The performance of ‘Woods’ was ~9% higher as compared to using only the RF module of Woods which shows that the combined approach performs better for the functional classification of the protein sequences; however, it would take longer time as compared to only using the RF module. It should be noted that the RF module can only be used to predict the functional category, whereas, the integrated approach of Woods can also assign functional annotation to the protein in addition of predicting the functional category.

U

175 176


E

4

Fig. 6. Performance of Random Forest and Woods tool on independent genomic dataset. RF module displayed an average precision of 89.04%, whereas, Woods showed average precision of 98.79% on the independent dataset.


220 221 222 223

226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243

247 248 249 250 251 252 253 254 255 256 257 258 259 260 261

A.K. Sharma et al. / Genomics xxx (2014) xxx–xxx t1:1 t1:2

5

Table 1 Performance of Woods on real metagenomic datasets.

t1:3

Sample

Contigs

ORF

# predictions by BLAST

# predictions by Woods

# correct predictions by Woods (as compared to BLAST)

# incorrect prediction by Woods (as compared to BLAST)

Novel predictions by Woods (not predicted by BLAST)

Precision (%)

t1:4 Q1 t1:5 Q2

MH0006 MH0012

156,752 140,991

308,223 324,939

211,948 223,062

177,395 186,794

1,72,571 1,80,880

4375 5434

449 480

97.53 97.08

4.1. Construction of datasets

266 267

278 279

The protein sequences and associated information of orthologous groups of genes were retrieved from eggNOGv3.0 to construct the training dataset. The version 3.0 of eggNOG database consists of 721,801 non-supervised orthologous groups, comprising of 4,396,591 genes (http://eggNOG.embl.de) [20]. From the retrieved data, a minidatabase was constructed using the information only from the bacterial genomes. The resultant database consisted of 2,379,565 bacterial genes classified into 87,349 non-supervised orthologous groups, of which, 86,266 groups had functional annotations comprising of 2,372,782 genes. Sequences (60,669, 2.6%) which belonged to more than one functional categories and functional category Z, which contained only 8 sequences, were not considered for training. Resulting database consisting of 2,312,105 genes belonging to the 22 functional categories were considered for this study (Table S1).

280

4.2. Feature extraction

281

Amino acid composition (AAC) and dipeptide composition (DPC) were evaluated as input features for training different predictionbased methods evaluated in this study. The amino acid composition provides information on the relative composition of each amino acid in a protein, whereas, the dipeptide composition provides information on the composition of amino acids along with their local order [21]. The amino acid and dipeptide composition of each protein was calculated using the following formula:

284 285 286 287 288

T

282 283

C

276 277

E

274 275

R

272 273

AAC ðiÞ ¼ 290

where, AAC(i) is the amino acid composition of the amino acid (i), and amino acid (i) is one of the 20 amino acids, and DPC ðiÞ ¼

Total number of dipeptides ðiÞ 100 Total number of all possible dipeptides

where, DPC(i) is the dipeptide frequency of dipeptide i and dipeptide (i) is one out of 400 dipeptides.

U

292

Total number of amino acid ðiÞ 100 Total number of all possible amino acids

R

270 271

N C O

268 269

293

4.3. Random Forest (RF)

294 295

RF which is available in the R package (randomForest) (http://cran. r-project.org//) was used for the analysis and for the development of Woods. The classification algorithm, high prediction accuracy and variable importance information provided by RF make it appropriate for the analysis of large datasets [22]. RF uses ensemble learning method for the classification and regression, and is an implementation of bagging approach where each tree in the forest works as an independent model [23,24]. Bootstrapping was used to grow classification trees in the forest from the training set. In each bootstrap training set, about two third of the data was randomly selected to grow classification tree and rest one third was considered as out-of-bag (OOB) for prediction.

296 297 298 299 300 301 302 303 304

4.4. Implementation

314

F

265

O

4. Materials and methods

305

R O

264

Out of total input variables, a subset of variables (mtry) was randomly specified at each split node to calculate the variables with highest information gain using the measures such as gini index and permutation importance. The permutation importance value of a variable is directly related to its predictive ability and it is the most commonly used importance measure [25,26]. The error depends on two things, correlation between the trees and strength of each tree in the forest, and is measured in terms of out-of-bag (OOB) estimate. The RF with lowest OOB error was used to classify a new object from an input vector.

The web server for Woods is developed using the standalone Woods application. The web server can be used for the classification and annotation of complete and partial proteins. The flowchart of the methodology used by Woods is shown in Fig. 1. On the ‘Applications’ page, two options, namely ‘Genomic’ and ‘Metagenomic’ are provided to analyze the complete (genomic) proteins or partial (metagenomic) proteins. Using the ‘Genomic’ option, the user should upload a file containing the protein sequences in FASTA format. Using the ‘Metagenomic’ option, the user should upload the file containing the metagenomic ORFs predicted by MetaGeneMark gene-finding software in nucleotide format (*.fna file) [27]. On submission of the query, a ‘Job ID’ page will be shown which provides the link to access the ‘Results’ page. The standalone version of Woods runs on Linux-based computer and requires the installation of free packages R, Random Forest and RAPsearch2. To achieve faster and accurate functional assessment of the query sequence as compared to either a prediction-based or homologybased approach, a combined approach was adopted by integrating both RF and RAPserach2 for the development of Woods. In case of genomic proteins (complete) or partial proteins (metagenomic) with length ≥ 500 aa or the complete ORFs, the query sequence is first analyzed using the RF and classified into any one of the 22 functional categories. In the second step, which is a confirmatory step involving a similarity-based search using RAPsearch2, the query sequence is searched against the dataset of protein sequences only from the predicted category. If the resultant hit is found above the threshold e-value (1e−6), the sequence is assigned to that category. Else, the RAPserach2 search is performed against the sequence dataset of each of the 22 categories. In the latter case, the protein is annotated with the category where it finds the best hit (Fig. 1). In case of partial proteins (metagenomic) with length less than 500 aa, functional annotation is directly performed by using RAPsearch2 against all the 22 functional categories of the eggNOG database. On completion of the process, the ‘Results’ page provides the summary of the results and links to download the result files. The results can be retrieved by using the ‘Job ID’ provided on the Results page within a month of the submission. In order to provide the user with an option to carry out quick analysis, in the standalone version of Woods, an option to run either the RF (prediction-based) module or complete Woods pipeline which integrates both RF (prediction-based) and RAPsearch2 (similarity-based) modules is provided. The RF module may be used if the objective is only to carry out the functional classification, though the accuracy would be lower as compared to Woods (Figs. 6 and 7). In the case of metagenomic MH0006 dataset; the time taken by RF module was 6 min and 16 s, whereas, the time taken by Woods was 465 min and 26 s. It is apparent that the RF module takes much lesser time to carry

P

based annotation of the proteins in large genomic and metagenomic datasets.

E

263

D

262


306 307 308 309 310 311 312 313

315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360

370

out the analysis as compared to the integrated approach. Therefore, to provide the user with a comparison of using only a predictionbased approach or an integrated approach, the results are shown for both the options on the datasets which are used to test the performance in this analysis. The Woods web server is freely accessible at http://metagenomics. iiserb.ac.in/woods/index.php and http://metabiosys.iiserb.ac.in/woods/ index.php. The standalone version of Woods can be downloaded from the above web servers and usage instructions are provided in Text S1 and also in the Tutorial section of the web server.

371

Conflict of interest

368 369

372 373 374

The authors declare that they have no conflict of interest.

F

367

Funding

O

365 366

We thank the intramural funding received from IISER Bhopal. Acknowledgments

376 Q5 377

380

We thank MHRD, Govt of India, funded Centre for Research on Environment and Sustainable Technologies (CREST) at IISER Bhopal for its support. However, the views expressed in this manuscript are that of the authors alone and no approval of the same, explicit or implicit, by MHRD should be assumed.

381

Appendix A. Supplementary data

382 383

Supplementary data to this article can be found online at http://dx. doi.org/10.1016/j.ygeno.2015.04.001.

384

References

T C

C

O

R

R

E

[1] M. Kim, K.-H. Lee, S.-W. Yoon, B.-S. Kim, J. Chun, H. Yi, Analytical tools and databases for metagenomics in the next-generation sequencing era, Genome Inform. 11 (2013) 102–113. [2] D.A. Natale, U.T. Shankavaram, M.Y. Galperin, Y.I. Wolf, L. Aravind, E.V. Koonin, Towards understanding the first genome sequence of a crenarchaeon by genome annotation using clusters of orthologous groups of proteins (COGs), Genome Biol. 1 (2000) RESEARCH0009. [3] J.R. White, C. Arze, K. Galens, M. Matalka, S. Mekosh, D.R. Riley, M. Vangala, O. White, S.V. Angiuoli, W.F. Fricke, CloVR-Metagenomics (orfs): Microbial community functional and taxonomic characterization from metagenomic shotgun sequences– standard operating procedure v. 1.0. [4] R. Carr, E. Borenstein, Comparative analysis of functional metagenomic annotation and the mappability of short reads, PLoS ONE 9 (2014) e105776.

N

385 386 387 388 389 390 391 392 393 394 395 396 397

U

378 379

D

375

R O

363 364

[5] R.L. Tatusov, N.D. Fedorova, J.D. Jackson, A.R. Jacobs, B. Kiryutin, E.V. Koonin, D.M. Krylov, R. Mazumder, S.L. Mekhedov, A.N. Nikolskaya, The COG database: an updated version includes eukaryotes, BMC Bioinforma. 4 (2003) 41. [6] K.D. Pruitt, T. Tatusova, D.R. Maglott, NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins, Nucleic Acids Res. 35 (2007) D61–D65. [7] P. Flicek, M.R. Amode, D. Barrell, K. Beal, S. Brent, D. Carvalho-Silva, P. Clapham, G. Coates, S. Fairley, S. Fitzgerald, Ensembl 2012, Nucleic Acids Res. 40 (2012) D84–D90. [8] C. UniProt, Ongoing and future developments at the Universal Protein Resource, Nucleic Acids Res. 39 D214-D219. [9] C. Aurrecoechea, J. Brestelli, B.P. Brunk, J.M. Carlton, J. Dommer, S. Fischer, B. Gajria, X. Gao, A. Gingle, G. Grant, GiardiaDB and TrichDB: integrated genomic resources for the eukaryotic protist pathogens Giardia lamblia and Trichomonas vaginalis, Nucleic Acids Res. 37 (2009) D526–D530. [10] D. Swarbreck, C. Wilks, P. Lamesch, T.Z. Berardini, M. Garcia-Hernandez, H. Foerster, D. Li, T. Meyer, R. Muller, L. Ploetz, The Arabidopsis Information Resource (TAIR): gene structure and function annotation, Nucleic Acids Res. 36 (2008) D1009–D1014. [11] S. Powell, K. Forslund, D. Szklarczyk, K. Trachana, A. Roth, J. Huerta-Cepas, T. Gabaldón, T. Rattei, C. Creevey, M. Kuhn, eggNOG v4. 0: nested orthology inference across 3686 organisms, Nucleic Acids Res. 42 (2014) D231–D239. [12] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, D.J. Lipman, Basic local alignment search tool, J. Mol. Biol. 215 (1990) 403–410. [13] A. Liaw, M. Wiener, Classification and regression by randomForest, R. News 2 (2002) 18–22. [14] Y. Zhao, H. Tang, Y. Ye, RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data, Bioinformatics 28 (2012) 125–126. [15] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I.H. Witten, The WEKA data mining software: an update, ACM SIGKDD explorations newsletter, 112009. 10–18. [16] W.J. Kent, BLAT—the BLAST-like alignment tool, Genome Res. 12 (2002) 656–664. [17] R.C. Edgar, Search and clustering orders of magnitude faster than BLAST, Bioinformatics 26 (2010) 2460–2461. [18] L. Brocchieri, S. Karlin, Protein length in eukaryotic and prokaryotic proteomes, Nucleic Acids Res. 33 (2005) 3390–3400. [19] J. Qin, R. Li, J. Raes, M. Arumugam, K.S. Burgdorf, C. Manichanh, T. Nielsen, N. Pons, F. Levenez, T. Yamada, A human gut microbial gene catalogue established by metagenomic sequencing, Nature 464 (2010) 59–65. [20] S. Powell, D. Szklarczyk, K. Trachana, A. Roth, M. Kuhn, J. Muller, R. Arnold, T. Rattei, I. Letunic, T. Doerks, eggNOG v3. 0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges, Nucleic Acids Res. 40 (2012) D284–D289. [21] A. Gupta, R. Kapil, D.B. Dhakan, V.K. Sharma, MP3: a software tool for the prediction of pathogenic proteins in genomic and metagenomic data, PLoS ONE 9 (2014) e93907. [22] W.G. Touw, J.R. Bayjanov, L. Overmars, L. Backus, J. Boekhorst, M. Wels, S.A. van Hijum, Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle? Brief. Bioinform. 14 (2013) 315–326. [23] L. Breiman, Random forests, Mach. Learn. 45 (2001) 5–32. [24] J.J. Rodriguez, L.I. Kuncheva, C.J. Alonso, Rotation forest: a new classifier ensemble method, Pattern Analysis and Machine Intelligence, IEEE Transactions on 28 (2006) 1619–1630. [25] R. Genuer, J.-M. Poggi, C. Tuleau-Malot, Variable selection using random forests, Pattern Recogn. Lett. 31 (2010) 2225–2236. [26] C. Strobl, A.-L. Boulesteix, A. Zeileis, T. Hothorn, Bias in random forest variable importance measures: illustrations, sources and a solution, BMC Bioinformatics 8 (2007) 25. [27] W. Zhu, A. Lomsadze, M. Borodovsky, Ab initio gene identification in metagenomic sequences, Nucleic Acids Res. 38 (2010) e132.

E

361 362


P

6


398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455

MyTaxa: an advanced taxonomic classifier for genomic and metagenomic sequences.

CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers.

16S classifier: a tool for fast and accurate taxonomic classification of 16S rRNA hypervariable regions in metagenomic datasets.

Accurate binning of metagenomic contigs via automated clustering sequences using information of genomic signatures and marker genes.

Fast and accurate discovery of degenerate linear motifs in protein sequences.

Expression of heterologous sigma factors enables functional screening of metagenomic and heterologous genomic libraries.

Fast and accurate propagation of coherent light.

Functional genomic and metagenomic approaches to understanding gut microbiota-animal mutualism.

Re-Annotator: Annotation Pipeline for Microarray Probe Sequences.

Fast and accurate exhaled breath ammonia measurement.

Evaluation of a hybrid approach using UBLAST and BLASTX for metagenomic sequences annotation of specific functional genes.

Fast alignment of DNA and protein sequences.

A fast and accurate decoder for underwater acoustic telemetry.

PEAR: a fast and accurate Illumina Paired-End reAd mergeR.

A fast and accurate algorithm for volume determination in MRI.

TreeSeq, a Fast and Intuitive Tool for Analysis of Whole Genome and Metagenomic Sequence Data.

Genomic and Metagenomic Approaches for Predicting Pathogen Evolution.

Woods and Tilly reply.

Correction: TreeSeq, a Fast and Intuitive Tool for Analysis of Whole Genome and Metagenomic Sequence Data.

MICCA: a complete and accurate software for taxonomic profiling of metagenomic data.

Fast and Accurate Construction of Confidence Intervals for Heritability.

Fast and accurate identification of fat droplets in histological images.

Fast and Accurate Digital Morphometry of Facial Expressions.

FastRNABindR: Fast and Accurate Prediction of Protein-RNA Interface Residues.