Analytica Chimica Acta 804 (2013) 70–75

Contents lists available at ScienceDirect

Analytica Chimica Acta journal homepage: www.elsevier.com/locate/aca

Using random forest to classify T-cell epitopes based on amino acid properties and molecular features Jian-Hua Huang a,1 , Hua-Lin Xie a,c,1 , Jun Yan a , Hong-Mei Lu a , Qing-Song Xu b , Yi-Zeng Liang a,∗ a

Research Center of Modernization of Traditional Chinese Medicines, Central South University, Changsha 410083, PR China School of Mathematical Sciences and Computing Technology, Central South University, Changsha 410083, PR China c School of Chemistry and Chemical Engineering, Yangtze Normal University, Fuling 408100, PR China b

h i g h l i g h t s

g r a p h i c a l

a b s t r a c t

• An effective approach has been developed for T-cell epitopes prediction.

• A combined feature has been provided to show significant improvement in accuracy. • Random forest provides some useful tools to select informative features and make classification simultaneously. • A freely available web server of for predicting peptide immunogenicity is established.

a r t i c l e

i n f o

Article history: Received 27 August 2013 Received in revised form 28 September 2013 Accepted 2 October 2013 Available online 12 October 2013 Keywords: Amino acid properties Chemical molecular features Random forest (RF) T-cell epitopes

a b s t r a c t T-lymphocyte (T-cell) is a very important component in human immune system. T-cell epitopes can be used for the accurately monitoring the immune responses which activation by major histocompatibility complex (MHC), and rationally designing vaccines. Therefore, accurate prediction of T-cell epitopes is crucial for vaccine development and clinical immunology. In current study, two types peptide features, i.e., amino acid properties and chemical molecular features were used for the T-cell epitopes peptide representation. Based on these features, random forest (RF) algorithm, a powerful machine learning algorithm, was used to classify T-cell epitopes and non-T-cell epitopes. The classification accuracy, sensitivity, specificity, Matthews correlation coefficient (MCC), and area under the curve (AUC) values for proposed method are 97.54%, 97.22%, 97.60%, 0.9193, and 0.9868, respectively. These results indicate that current method based on the combined features and RF is effective for T-cell epitopes prediction. © 2013 Published by Elsevier B.V.

1. Introduction

Abbreviations: RF, random forest; MHC, major histocompatibility complex; TCRs, T cell receptors. ∗ Corresponding author at: Department of Chemistry and Chemical Engineering, Central South University, Changsha 410083, PR China. Tel.: +86 731 88830831; fax: +86 731 88830831. E-mail address: yizeng [email protected] (Y.-Z. Liang). 1 The first two authors have equal contribution to this article. 0003-2670/$ – see front matter © 2013 Published by Elsevier B.V. http://dx.doi.org/10.1016/j.aca.2013.10.003

Immunogenicity is the ability to induce an immune response, which can protect body from future infections and fight chronic diseases. T-cells and major histocompatibility complex (MHC) molecules play central roles in controlling the acquired immune responses. MHC molecules bind peptides derived from intracellular and extracellular proteins and present them on the cell surface for surveillance by the immune system [1]. T-cells are capable of recognizing these MHC complexes via T-cell receptors (TCRs). Recognition of a MHC complex by a T-cell induces an immune response. And a peptide capable of inducing a T-cell mediated

J.-H. Huang et al. / Analytica Chimica Acta 804 (2013) 70–75

immune response is called epitope [2,3]. Hence, identification of T-cell epitopes is of vital importance in the design of vaccines and understanding of the immune system. However, experimentally screening all possible peptides for each MHC allele is time consuming, expensive and inefficient [4–6]. The computational algorithms were very helpful in epitopes identification. By using the computational methods, we could obtain some predicted peptides that are more likely to be epitopes, and then screen these potential epitopes by experimentally methods. These computational methods can improve the efficiency of the epitopes identification. Various computer-based algorithms have been developed to predict T-cell epitopes [7–11]. Initially, methods for direct T-cell epitopes prediction were developed based on amino acid sequence analysis and epitope motif alignment [12–14]. The first prediction method utilized simple sequence motif search for identifying MHC class I binding peptides was developed by Sett et al. [15]. These direct methods have since been defined into position-specific scoring matrix (PSSM) approaches [15–17]. One drawback of these methods is that they assume an independent contribution of each amino acid in the peptides to the overall binding affinity, neglecting the effects of neighbor residues. And these direct epitope prediction methods always have relatively low accuracies [18]. Motivated by the limits of directly method, some advanced machine learning algorithms have been used in predicting T-cell epitopes, such as artificial neural network (ANNs) [19,20], hidden Markov models (HMMs) [21,22], support vector machines (SVMs) [7,18,23], and biosupport vector machines (BioSVMs) [24]. In the early stage, artificial neural network (ANNs) was the main tool to perform T-cell prediction. Nielsen et al. used a combination of several neural networks with a number of different sequence encoding strategies, and achieved a accurate prediction of the T-cell epitopes [2]. With the rapid development of support vector machine, SVM based methods were carried out on T-cell epitopes prediction. In 2002, Donnes et al. [23] developed the SVMHC method based on SVM to predict T-cell epitopes. Then, Zhao et al. [7] developed a SVM for T-cell epitopes prediction with an MHC type I restricted T-cell clone, and each peptide was encoded by ten factors which obtained from 188 physical properties of 20 amino acids via multivariate statistical analysis. Yang et al. [24] used the bioSVM combined with the distributed encoding method to predict the T-cell epitopes. Tung and Ho [1,25] designed a SVM based system (named POPI) for the prediction of peptide immunogenicity and selected a feature set of informative physicochemical properties from MHC class I binding peptides. As amino acids in peptides are non-numerical attributes, an encoding process is necessary for the modeling process. In the above mentioned methods amino acids physicochemical properties were extensively and successfully used in sequence-based prediction methods [25]. These conventional feature encoding methods are depended on peptide amino acid sequence information; but ignore peptides chemical molecular information. Therefore, we proposed a novel combination features based on two kinds of peptides encoding strategies, i.e., peptides amino acid sequence properties and chemical molecular features. Based on these combined features, a powerful machine learning algorithm, random forest (RF) was adopted as classifier to identify T-cell epitopes and non-T-cell epitopes. By taking amino acid physicochemical properties and chemical molecular properties into account, a better result was obtained when compared with previous methods. 2. Datasets and methods 2.1. Dataset All the tests have been conducted on the same dataset used in previous studies [7,26]. Peptides were synthesized by the

71

simultaneous-multiplepeptide synthesis methods and characterized using HPLC and mass spectrometry. LAU203-1.5 is an A*0201 restricted T-cell clone from tumor-infiltrated lymph node cells of a melanoma patient. 203 synthetic peptides were selected based on results using single- and multiple-amino acid substitutions and combinatorial peptide library experiments with a chromium release antigen recognition assay. These peptides were tested against the LAU203-1.5 clone using the same assay. A peptide with percentage-specific lysis higher than 10% was considered positive. The dataset was composed of 167 non-epitopes and 36 epitopes. 2.2. Feature encoding Each peptide is encoded based on two kinds of features. 2.2.1. Amino acid descriptors generation In this study, the peptides amino acid properties were chosen for residue representation. A peptide can be represented as a series of amino acids by their single-character codes A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W and Y, formulated as R1 R2 R3 R4 R5 R6 R7 R8 . . .RL

(1)

Suppose H(R1 ) is the hydrophobic value of the 1st residue R1 , H(R2 ) that of the 2nd residue R2 , and so forth. In this study, the peptide length L is equal to 10. In terms of these hydrophobic values the peptide sequence of Eq. (1) can be converted to a digit signal. Each amino acid in peptides was encoded by 14 properties. These properties included hydrophobicity, volume of side chains, polarity, polarizability, solvent accessible, net charge index of side chains, molecular weight, PK-N, PK-C, melting point, optical rotation, entropies formation, heat capacities, and absolute entropies. These amino acid properties are available at AAindex [27]. Therefore, the total number of input amino acid properties for a peptide is 140 (10*14). 2.2.2. Molecular descriptors generation In Quantitative structure-activity relationships (QSAR) study, molecular descriptors of chemical structures are widely used for establishing various models [9]. Various structural attributes of the molecule are used as descriptors. In current study, peptide molecular characterizations were obtained by Dragon software (version 5.4 TALETE srl, Italy). The molecular structures of peptides were first constructed in HyperChem (Hypercube inc.) and further optimized using AMBER force field [28]. The output files were transferred into the Dragon to calculate kinds of molecular descriptors, such as constitutional, topological, geometrical, and quantum chemical descriptors. Descriptors generation contain the following steps: (1) 927 molecular descriptors, which can represent molecular structural information were calculated using the Dragon software. (2) All descriptors were pre-selected by eliminating: (i) those descriptors are not available for each compounds; (ii) descriptors having a small variation in magnitude for all structures and (iii) the value of descriptors equal to zero for more than 80% compounds. Finally, 217 molecular descriptors were remained. Before combining the two kinds of features, both peptides features are firstly standardized with zero mean and unit variance. 2.3. Random forest Random forest (RF) is a classifier consisting of an ensemble of classification and regression tree-structured classifiers [29]. All trees in the forest are unpruned. RF takes advantages of two powerful machine learning techniques: bagging and random feature selection. In bagging, each tree is trained on bootstrap samples of the training data, and predictions are made by the majority vote of

72

J.-H. Huang et al. / Analytica Chimica Acta 804 (2013) 70–75

the trees. RF is a further development of bagging, which instead of using all features, it randomly selects a subset of features to split at each node when growing a tree. In order to assess the prediction performance of the random forest algorithm, RF performs a type of a cross-validation in parallel with the training step by using the so-called OOB samples. Specifically, in the process of training, each tree is grown using some particular bootstrap samples. Since bootstrapping is a sampling method with replacement from the training data, some of the data will not be chosen to establish the training dataset or can be called “left out”, while some samples will be chosen to train the model many times. The ‘left out’ data is also called the “OOB sample”. On average, each tree is grown using about 2/3 of the training data, leaving about 1/3 samples as “OOB sample”. Since OOB data have not been used in the tree construction, it can be used to estimate the prediction performance. The RF algorithm implemented in the R-package randomForest was used in this study [30]. The algorithm (for both classification and regression) can be stated as follows: 1. Draw ntree bootstrap samples from the original data, ntree is the number of ensemble trees; 2. For all bootstrap samples, grow an un-pruned classification or regression tree, with the following modification: at each node, rather than choosing the best split among all variables, randomly select mtry variables and choose the best split among those variables (bagging can be thought as the special case of random forest when mtry = p, the number of variables). In general, mtry is simply a number (positive integer) between 1 and p [29]. In current study, the value of mtry is 18. 3. Predict new data by aggregating the predictions of the ntree (i.e., majority votes for classification, average for regression). 2.3.1. Variable importance RF, as an ensemble of trees, inherits the ability to estimate feature importance. A measure of how each feature contributes to the prediction performance of RF can be calculated in the course of the training. The important scores can be used to identify biomarkers or as a filter to remove non-informative variables. The frequently used type of RF to measure feature importance is the mean decrease in classification based on permutation. For each tree, the classification accuracy of the OOB samples is determined both with and without random permutation of the values to each variable, one by one. The prediction accuracy of after permutation is subtracted from the prediction accuracy before permutation and averaged over all trees in the forest to give the permutation importance value. In the current research, the mean decrease in classification accuracy was

accepted to measure variable importance. The importance of each variable can be calculated as Eq. (1) Importance of j = Accuracyj normal − Accuracyj

permuted

(2)

2.4. Model evaluation The performance of classifier classification has been evaluated by the following measures [31]: Specificity =

TN TN + FP

(3)

Sensitivity =

TP TP + FN

(4)

Precision =

TP TP + FP

(5)

Accuracy =

TP + TN TP + FP + TN + FN

(6)

MCC =

TP × TN − FP × FN



(TP + FN) × (TP + FP) × (TN + FN) × (TN + FP)

(7)

where TP is the number of the true positives, TN is the number of true negatives, FP is number of the false positives and FN is the number of false negatives. Recall and precision are used to assess performance of the classifier. Here, the definition of recall is identical with Sensitivity. MCC is the Matthews correlation coefficient [31], which reflects both the sensitivity and specificity of the prediction algorithm. Accuracy is the classification accuracy of the classifier model for both positive and negative data classes. Thus MCC, which is a weighted measure, is increasingly being used to measure the predictive capability of classifier models. 3. Results and discussion 3.1. Selection of the number of trees The first step in RF modeling is to select the number of trees in the train process. In the current study, a model was established with 2000 trees to check the change trend of prediction error. The OOB classification error was plotted against the number of trees grown (Fig. 1). As can be seen from Fig. 1, the OOB errors do not decrease with the number of grown tree, no matter how many trees are built. The optimal number of trees is considered to be the one where a relatively stable trend of the lowest OOB error is reached. When

Fig. 1. Selection of tree number used in random forest.

J.-H. Huang et al. / Analytica Chimica Acta 804 (2013) 70–75

73

Fig. 2. Feature importance scores for all the variables.

the number of grown trees is equal to 1200, the lowest error is obtained. Therefore, in the model training and testing process, we used the 357 combined features to grow up 1200 trees for all the models.

atomic polarizabilities), and JGI3 (mean topological charge index of order 3) are also have large contribution to the classification. These results are corresponding to the amino acid features. 3.3. Receiver operating characteristic curves

3.2. Calculate the feature importance by RF Identification of informative physicochemical properties of peptides provides a better understanding of the T-cell epitopes. The variable importance calculated by RF for 357 combined features was listed in Fig. 2. Estimating each selected property is important for understanding peptide immunogenicity. Since the model accuracy shows a large decrease after the permutation of a variable or feature, it may be thought that this variable is very important as an informative variable. It could be found that some features have large contributions to the accuracies, such as amino acid hydrophobicity in peptide positions 1 and 9 (the 1st variable and 9th variable in Fig. 2), polarity in positions 4 and 9 (the 24th variable in Fig. 2), and some chemical molecule features: AlogP and topological charge index (the 278th variable and 312th variable in Fig. 2); others have negative contributions to the accuracies, such as amino acid hydrophilicity in position 4(the 14th variable), volume of side chains (the 50th and 51th variables), and the numbers of chemical elements (O and N) in compounds (the 150th variable and 152th variable); while some have no contributions at all, as the accuracies of the model has little changes after permutation. We could find that almost features have positive contributions to the classification. In order to get a better result, we have to select these more important features to establish the classification models. In current study, the features, which scores larger than 0.05 were chosen as the informative features. Then, 188 features including amino acid properties and chemical molecular features were selected as the input variables for classification. In these selected features, the amino acid hydrophobicity is importance for T-cell epitopes classification. As we know, the T-cell epitopes are always local in the hydrophobicity residues areas in the peptides. Then, some chemical molecular features relative to hydrophobicity, such as Alop2 (squared Ghose-crippen octanol–water partition coefficient), MlogP (Moriguchi octanol–water partition coefficient), BEHp4 (highest eigenvalue no. 4 of Burden matrix/weighted by

In this section, our aim is to assess the benefit of incorporating amino acid and chemical molecular information into T-cell epitopes prediction. Therefore, we constructed three models based on amino acid properties, chemical molecular features, and combined features, respectively. For the prediction of T-cell epitopes based on amino acid information only, each individual peptide sequence was encoded using 14 kinds of physicochemical properties; the molecular model only used the 217 molecular features; the combined model used the 188 selected features obtained by RF in Section 3.2. The results for three models were listed in Table 1, and for further investigating of the effects caused by different features. The receiver operating characteristic (ROC) for three models were plotted in Fig. 3. The area under the ROC curve, which is denoted AUC, is often used as an additional performance index. A model with no predictive ability would yield the diagonal line. The closer AUC is to 1, the greater is the predictive ability of the model. As could be seen from Fig. 3, the AUC values for three models were 0.9775, 0.9546, and 0.9868, respectively. These indicated that the models we developed have good performances. Peptide amino acid physicochemical properties play important roles in determining T-cell epitopes. Furthermore, the chemical molecular features also have great contributions to the model classification ability. And the combined features have the highest results. This scenario indicates that a set of properties should be considered simultaneously rather than single property at a time because of strong correlation among different properties. The molecular information is a great complement for the current T-cell epitopes study. 3.4. Compare with other method As the dataset used in current research also adopted by previous researches, therefore we compared the performance of current approach with other methods in Table 2.

Table 1 Prediction results of three models based on different features (%). Results

Sensitivity

Specificity

Accuracy

MCC

AUC

Amino acid properties Molecular properties Combined features

83.33 80.56 97.22

98.80 97.01 97.60

96.06 94.09 97.54

0.8609 0.7909 0.9194

0.9775 0.9546 0.9868

74

J.-H. Huang et al. / Analytica Chimica Acta 804 (2013) 70–75

Fig. 3. Three ROC cures were obtained by using three different features with random forest.

In current study, we compared our RF-based method with three SVM-based methods for predicting T-cell epitopes with the same data. The differences between these methods and our method were the peptide descriptors. The SVM-based methods did not use the molecular properties; they all used the sequence-based and amino acid properties. For more detailed, in SVM method each amino acid was encode by ten factors. These factors were obtained from 188 physical properties of 20 amino acids via multivariate statistical analyses; in BioSVM method the homology alignment scores between template peptides and submitted peptides were used as features for epitope identification; and LSSVM method used 14 amino acid properties to code the peptide. And the parameters for all the SVM methods could be found in the corresponding researches. As can be seen from Table 2, the prediction results of our method are better than most of the SVM-based method. The accuracy of our method is 97.54%, which is 10%, and 7% higher than the SVM method [7] and BioSVM method [24], respectively. The sensitivity of current method is 97.22%, which is 20%, 14% higher than the SVM method and BioSVM method, respectively. Furthermore, we further compared our method with LSSVM method [26] which also used 14 amino acid properties to represent T-cell epitopes. However, the LSSVM method did not combine the molecular properties. The accuracy for current method is the same as the LSSVM method, which is 97.54%. LSSVM has a higher specificity, and our method have a better sensitivity. So, we further compare the AUC values of two methods, the AUC value is thought as a more stable evaluation. The AUC value of current method is 0.9868, which is higher than the LSSVM method. The satisfied results indicate that the combined features could reveal more comprehensive information to represent the T-cell epitopes. In order to further demonstrate reliability of our model, we established an independent test dataset. As for this independent test, we chose 10% data from the current datasets; remaining 90% datasets were used as training set. Therefore, in the independent dataset contains 6 positive samples and 27 negative samples. As

Table 2 Compare current method with other methods (%). Results

Sensitivity

Specificity

Accuracy

AUC

SVM BioSVM LSSVM RF Independent dataset

76.21 83.29 94.44 97.22 83.33

92.24 93.06 98.20 97.60 96.30

87.86 90.31 97.54 97.54 93.94

0.9190 0.9310 0.9630 0.9868 0.9691

can be seen from Table 2, the prediction accuracy, sensitivity, specificity, and area under the curve (AUC) values for proposed method are 93.74%, 83.33%, 96.30%, and 0.9691, respectively. Based on these results, we thought that the presented method was useful and reliable. 3.5. Web server An effective prediction server is available at http:// sysbio.yznu.cn/Research/Epitopesprediction.aspx, and it is host on Asp.Net 4.0 web server by using Windows 2003 server environment. Most of the web pages were written with HTML and JavaScript. The backend programs were written by C#. NET 4.0. In the web server, model based on amino acid features and RF was used to predict sites in submitted sequences. Users can submit their uncharacteristic sequences and select the specific residues whose sites are to be predicted. The system will return the prediction results, including T-cell epitopes with class labels, and corresponding probability of the prediction results. 4. Conclusion Accurate prediction of T-cell epitopes is crucial for vaccine development and clinical immunology. In this paper, we developed an accurate RF-based model to predict T-cell epitopes based on combined features, i.e., amino acid physicochemical properties and chemical molecular properties. The advantages of current method are: Firstly, the RF-based learning methods are shown effective for T-cell epitopes predictions, and RF provides some useful tools for feature importance estimating, the valuable information is helpful in determining a best set of features to implement an accurate prediction system as well as to further understand immune responses from the informative physicochemical properties. As a result, the RF method can select informative features and make classification simultaneously. Thus, these advantages will make RF widely used in Bioinformatics researches. Secondly, the peptide molecular properties were used to represent the peptides. By using the combined features a better result was obtained. The peptide molecular information have been widely used in the QSAR and QSRR researches, but in Bioinformatics research fields such information is always ignored. In current research, we just trialed used such informative for the T-cell epitopes prediction. And more deeply researches for the better represent the molecular properties are required. Finally,

J.-H. Huang et al. / Analytica Chimica Acta 804 (2013) 70–75

a freely available web server for predicting peptide immunogenicity is established. Acknowledgements This work has been financially supported by National Nature Foundation Committee of P.R. China (grants no. 20875104, no. 11271374, no. 21175157), Chongqing Municipal Commission of Education (grant no. KJ131323), Hunan Postdoctoral Scientific Program, and Science and Technology Project of Hunan Province (2013FJ3093, 2013SK3268). References [1] C.-W. Tung, M. Ziehm, A. Kaemper, O. Kohlbacher, S.-Y. Ho, BMC Bioinform. 12 (2011). [2] M. Nielsen, C. Lundegaard, P. Worning, S.L. Lauemoller, K. Lamberth, S. Buus, S. Brunak, O. Lund, Protein Sci. 12 (2003) 1007–1017. [3] M. Nielsen, C. Lundegaard, O. Lund, C. Kesmir, Immunogenetics 57 (2005) 33–41. [4] A.O. Weinzierl, D. Maurer, F. Altenberend, N. Schneiderhan-Marra, K. Klingel, O. Schoor, D. Wernet, T. Joos, H.-G. Rammensee, S. Stevanovic, Cancer Res. 68 (2008) 2447–2454. [5] K.T. Hogan, M.A. Coppola, C.L. Gatlin, L.W. Thompson, J. Shabanowitz, D.F. Hunt, V.H. Engelhard, C.L. Slingluff, M.M. Ross, Immunol. Lett. 90 (2003) 131–135. [6] C. Lemmel, S. Stevanovic, Methods 29 (2003) 248–259. [7] Y.D. Zhao, C. Pinilla, D. Valmori, R. Martin, R. Simon, Bioinformatics 19 (2003) 1978–1984. [8] Y. Zhao, M.-H. Sung, R. Simon, Methods Mol. Biol. (2007) 217–225.

75

[9] Y. Ren, X. Chen, M. Feng, Q. Wang, P. Zhou, Protein Peptide Lett. 18 (2011) 670–678. [10] M. Bhasin, G.P.S. Raghava, Vaccine 22 (2004) 3195–3204. [11] H. Tsurui, T. Takahashi, J. Pharm. Sci. 105 (2007) 299–316. [12] R.R. Mallios, Bioinformatics 15 (1999) 432–439. [13] M. Rajapakse, B. Schmidt, L. Feng, V. Brusic, BMC Bioinform. 8 (2007). [14] H.G. Rammensee, J. Bachmann, N.P.N. Emmerich, O.A. Bachor, S. Stevanovic, Immunogenetics 50 (1999) 213–219. [15] A. Sette, S. Buus, E. Appella, J.A. Smith, R. Chesnut, C. Miles, S.M. Colon, H.M. Grey, Proc. Natl. Acad. Sci. U. S. A. 86 (1989) 3296–3300. [16] M.L. Hoffmann, L.M. Jablonski, K.K. Crum, S.P. Hackett, Y.I. Chi, C.V. Stauffacher, D.L. Stevens, G.A. Bohach, Infect. Immun. 62 (1994) 3396–3407. [17] M. Halling-Brown, R. Quartey-Papafio, P.J. Travers, D.S. Moss, Int. J. Immunogenet. 33 (2006) 289–295. [18] P. Donnes, O. Kohlbacher, Nucleic Acids Res. 34 (2006) W194–W197. [19] M.C. Honeyman, V. Brusic, N.L. Stone, L.C. Harrison, Nat. Biotechnol. 16 (1998) 966–969. [20] M. Filter, M. Eichler-Mertens, A. Bredenbeck, F.O. Losch, T. Sharav, A. Givehchi, P. Walden, P. Wrede, QSAR Comb. Sci. 25 (2006) 350–358. [21] H. Noguchi, R. Kato, T. Hanai, Y. Matsubara, H. Honda, V. Brusic, T. Kobayashi, J. Biosci. Bioeng. 94 (2002) 264–270. [22] H. Mamitsuka, Proteins 33 (1998) 460–474. [23] P. Donnes, A. Elofsson, BMC Bioinform. 3 (2002). [24] Z.R. Yang, F.C. Johnson, J. Chem. Inf. Model. 45 (2005) 1424–1428. [25] C.-W. Tung, S.-Y. Ho, Bioinformatics 23 (2007) 942–949. [26] S. Li, X. Yao, H. Liu, J. Li, B. Fan, Anal. Chim. Acta 584 (2007) 37–42. [27] S. Kawashima, P. Pokarowski, M. Pokarowska, A. Kolinski, T. Katayama, M. Kanehisa, Nucleic Acids Res. 36 (2008) D202–D205. [28] A.W. Goetz, M.J. Williamson, D. Xu, D. Poole, S. Le Grand, R.C. Walker, J. Chem. Theory Comput. 8 (2012) 1542–1555. [29] L. Breiman, Mach. Learn. 45 (2001) 5–32. [30] A. Liaw, M. Wiener, R. News 2 (2002) 18–22. [31] B.W. Matthews, BBA 405 (1975) 442–451.

Using random forest to classify T-cell epitopes based on amino acid properties and molecular features.

T-lymphocyte (T-cell) is a very important component in human immune system. T-cell epitopes can be used for the accurately monitoring the immune respo...
808KB Sizes 0 Downloads 0 Views