Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Contents lists available at ScienceDirect

Journal of Theoretical Biology journal homepage: www.elsevier.com/locate/yjtbi

Prediction of posttranslational modification sites from amino acid sequences with kernel methods Yan Xu a, Xiaobo Wang b, Yongcui Wang b, Yingjie Tian c, Xiaojian Shao d, Ling-Yun Wu e, Naiyang Deng b,n a

Department of Information and Computer Science, University of Science and Technology Beijing, Beijing 100083, China Department of Applied Mathematics, College of Science, China Agricultural University, Beijing 10083, China c Research Center on Fictitious Economy and Data Science, Chinese Academy of Sciences, Beijing 100190, China d Department of Mathematics and Information Science, BinZhou University, BinZhou 256603, China e Institute of Applied Mathematics, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China b

H I G H L I G H T S

 In this paper, a novel encoding method PSPM (position-specific propensity matrices) is developed.  Then a support vector machine (SVM) with the kernel matrix computed by PSPM is applied to predict the PTM sites.  The prediction software can be freely downloaded from http://www.aporc.org/doc/wiki/PTMPred.

art ic l e i nf o

a b s t r a c t

Article history: Received 28 June 2013 Received in revised form 13 September 2013 Accepted 16 November 2013

Post-translational modification (PTM) is the chemical modification of a protein after its translation and one of the later steps in protein biosynthesis for many proteins. It plays an important role which modifies the end product of gene expression and contributes to biological processes and diseased conditions. However, the experimental methods for identifying PTM sites are both costly and time-consuming. Hence computational methods are highly desired. In this work, a novel encoding method PSPM (position-specific propensity matrices) is developed. Then a support vector machine (SVM) with the kernel matrix computed by PSPM is applied to predict the PTM sites. The experimental results indicate that the performance of new method is better or comparable with the existing methods. Therefore, the new method is a useful computational resource for the identification of PTM sites. A unified standalone software PTMPred is developed. It can be used to predict all types of PTM sites if the user provides the training datasets. The software can be freely downloaded from http://www.aporc.org/doc/wiki/PTMPred. & 2013 Elsevier Ltd. All rights reserved.

Keywords: Kinase-specific O-glycosylation Phosphorylation Support vector machine

1. Introduction The posttranslational modification (PTM) of proteins is a common biological mechanism which regulates protein functions. The PTM sites are covalent processing events that change the properties of a protein by proteolytic cleavage or adding a modifying group to one or more amino acids (Mann and Jensen, 2003) and occur on almost proteins (Blom et al., 2004). Protein PTM sites can determine proteins' activity state, localization, turnover, and interactions with other proteins. For example, kinase cascades are turned on or off by the reversible adding or

n

Corresponding author. Tel.: þ 86 1062736265. E-mail address: [email protected] (N. Deng).

removing phosphate groups in signaling. Ubiquitination marks cyclins for destruction at defined time points in the cell cycle. The study of PTM sites has been restricted due to lack of suitable methods. Some high-throughput experimental technologies, including mass spectrum (Kraft et al., 2003), peptide microarray (Rychlewski et al., 2004), and phosphor-specific proteolysis (Knight et al., 2003) have been applied to study PTM sites. However, such methods are usually expensive and timeconsuming. A vital question is that the functions of proteins may be hampered or altered outside the living organisms (Wang et al., 2008). For these reasons, the computational methods to predict PTM sites are urgently needed. Many studies have indicated that sequence-based prediction approaches, such as protein subcellular location prediction (Chou and Elrod, 1999; Chou and Cai, 2002), identification of membrane proteins and their types (Cai et al., 2003; Chou and Shen, 2007a),

0022-5193/$ - see front matter & 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.jtbi.2013.11.012

Please cite this article as: Xu, Y., et al., Prediction of posttranslational modification sites from amino acid sequences with kernel methods. J. Theor. Biol. (2013), http://dx.doi.org/10.1016/j.jtbi.2013.11.012i

2

Y. Xu et al. / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

identification of enzymes and their functional classes (Cai and Chou, 2005; Chou and Cai, 2004), identification of GPCR (G-protein-coupled receptor) and their types (Xiao et al., 2009; Lin et al., 2009), protein cleavage site prediction (Chou and Shen, 2009; Chou, 1993, 1996), signal peptide prediction (Chou and Shen, 2009, 2007b), and protein 3D structure prediction based on sequence alignment (Chou, 2004), can timely provide very useful information and insights for both basic research and drug design. The support vector machine (SVM) has been efficiently applied into the pattern recognition problems in the fields of computational biology and bioinformatics including predicting protein subcellular location (Chou and Cai, 2002), membrane protein type (Cai et al., 2003), posttranslational modification (Xu et al., 2013), GalNAc-transferase (Chou, 1995) and so on. SVM also has good performances in the prediction of PTM sites. Kim et al. (2004) first used SVM with the standard binary encoding scheme to predict phosphorylation sites and obtained better outcomes than all previous methods. Wong et al. (2007) applied SVM based on coupling patterns to predict kinase-specific phosphorylation sites. Shao et al. (2009a) used SVM with bi-profile Bayes feature to predict methylation sites and Chang et al. (2009) applied SVM combined with structural features to predict protein tyrosine sulfation sites. Gao et al. (2010) used SVM to predict general and kinase-specific phosphorylation sites. Lately, Zhao et al. (2012) predicted protein phosphorylation sites via SVM by using the composition of k-spaced amino acid pairs. Although many advanced methods have been exploited in the prof PTM sites, there are some room to improve the accuracy. In the theme of using machine learning methods to predict PTM sites, the encoding scheme (i.e. the construction of input feature vectors) is very important. In this work, two encoding schemes are introduced for predicting PTM sites, one is PSPM (position-specific propensity matrices) which is developed by us, the other is constructed by Tang et al. (2007) and named as PSAAP (position-specific amino acid propensity). The proposed encoding scheme PSPM characterizes the position-specific amino acid pairs propensity surrounding the potential PTM sites and PSAAP reflects amino acids in different positions surrounding the corresponding PTM sites. In addition, PSPM considers the sequence-order effects which would influence the PTM sites, however PSAAP ignores this important situation. The novel algorithms are mainly applied in two reversible PTM sites, phosphorylation and O-glycosylation. The results show that the performance of the new algorithms is better than or comparable with previous methods. Take into account the performance of the two encoding schemes, PSPM is better than PSAAP. The dimensions of PSPM and PSAAP are absolutely low comparing with other encoding schemes such as conventional binary encoding and coupling pattern. To facilitate the biological community, a executable software “PTMPred” is developed, which can be used for proteome-wide PTM sites prediction. More and more PTM sites will be confirmed experimentally, our “PTMPred” allows the users to submit their own training datasets to obtain different predictors. In order to examine the behavior of our PTMPred for general PTM sites, it is used to predict sulfated proteins. The PTMPred obtain good performance. As pointed out by a comprehensive review (Chou, 2011a) and demonstrated by a series of recent publications (Chen et al., 2013, 2012b; Xu et al., 2013), to develop a useful statistical predictor for a protein or peptide system, we need to engage in the following procedures: (i) construct or select a valid benchmark dataset to train and test the predictor; (ii) formulate the protein or peptide samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the target to be predicted; (iii) introduce or develop a powerful algorithm (or engine) to operate the prediction; (iv) properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor;

Input query proteins

Encode peptide P through PSPM matrix Z

Output Q(P) by SVM engine Q(P) > cutoff ? Yes

No

P is a PTM site

P is a non- PTM site

Fig. 1. System flow chart of the PTM site prediction.

(v) establish a user-friendly web-server for the predictor that is accessible to the public. Below, let us describe how to deal with these steps one by one. Fig. 1 is the flow chart of the PTM sites prediction.

2. Methods 2.1. PSPM encoding scheme To successfully use SVM as a powerful classifier, the key is how to effectively define a feature-vector to formulate the statistical samples concerned. According to Eq. (6) of Xu et al. (2013), the feature vector for any protein, peptide, or biological sequence is none but a general form of pseudo amino acid composition (Chou, 2011a, 2005) can be formulated as P ¼ ½ψ 1 ψ 2 …ψ u …ψ Ω T

ð1Þ

where T is a transpose operator, the components ψ 1 ; ψ 2 ; … will depend on how to extract the desired information from the statistical samples concerned, while the subscript Ω is an integer representing the dimension of the feature vector P. Below, let us describe how to extract useful information from the training dataset to define the peptide samples concerned via Eq. (1). A new feature encoding scheme called position-specific propensity matrix (PSPM), which uses position-specific amino acid pairs to construct input features, is proposed in this section. The feature vector P is based on a propensity matrix Z which is constructed as follows:

 In the first step, considering 21 types of amino acids (20 native and one dummy amino acid X), there are 441 kinds of dipeptide. Suppose that the positive training dataset Mpos consists of l positive sequence fragments and the length of every sequence fragment is m. The value of m is empirically determined. In this work, we set the m as 13 for phosphorylation site prediction and 41 for Mucin-type O-glycosylation prediction after some trials (see Tables 1 and 5). A position specific dipeptide composition matrix A þ is constructed based on Mpos. The j-th column of A þ is defined as follows: þ þ þ ; a2;j ; …; a441;j ÞT ; Ajþ ¼ ða1;j

j ¼ 1; 2; …; m  1;

ð2Þ

ai;jþ



represents the frequency of i-th dipeptide in the jwhere þ th position of sequence fragments in Mpos. For example, a1;j denotes the frequency of the first dipeptide (AA) in the j-th position. Obviously, the size of the matrix A þ is 441  ðm  1Þ. In the second step, similarly, the position specific dipeptide composition matrix A  can be computed for the negative training dataset Mneg. However, there are much more

Please cite this article as: Xu, Y., et al., Prediction of posttranslational modification sites from amino acid sequences with kernel methods. J. Theor. Biol. (2013), http://dx.doi.org/10.1016/j.jtbi.2013.11.012i

Y. Xu et al. / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

where the symbol  represents the orthogonal sum and n ¼ mnðm  1Þ=2. Obviously, a large value of zki;j implies that the i-th amino acid pair separated by other k amino acids ðk ¼ 0; 1; 2; …; m  2Þ is enriched in the j-th position. In fact, we use either the whole matrix Z or a part of it. When the whole matrix is used, we need to consider Z k ¼ 0 ; Z k ¼ 1 ; …; Z k ¼ m  2 . To encode one potential site (i.e. a fragment of m amino acids), the elements ψ 1 ; ψ 2 ; … are constructed as follows: Its first m  1 elements come from the matrix Z k ¼ 0 and the j-th element is zka;j¼ 0 ðj ¼ 1; 2; …m  1Þ, where a is the index of dipeptide type in the j-th position. Similarly, its next m  2 elements , m  3 elements, …, 1 element come from Z k ¼ 1 ; Z k ¼ 2 ; …; Z k ¼ m  2 , respectively. When a part of the matrix Z is used, we only consider a part of Z k ¼ 0 ; Z k ¼ 1 ; …; Z k ¼ m  2 . For instance, if only Z k ¼ 0 and Z k ¼ 1 is considered, a ðm  1Þ þ ðm  2Þ dimensional feature vector x is constructed in a similar way.

Table 1 The size of positive and negative datasets and optimal parameters for the well studied kinases CDK, CK2, PKA, PKC . Phosphorylation datasets are obtained from Phospho.ELM (version 9.0). Protein kinase

Positive Negative data size data size

Optimal kernel

Training C

Training

γ

Optimal window size

CDK group CK2 group PKA group PKC group ATM CDK1 CDK2 CK2 alpha PKB group PKC alpha MAPK1 MAPK3

93 215 321 225 58 144 67 113 73 133 166 86

Sigmoid Sigmoid Sigmoid Sigmoid Sigmoid Sigmoid Sigmoid Sigmoid RBF Sigmoid Sigmoid Sigmoid

0.0875 0.0875 0.0875 1 100 0.5 500 100 100 100 100 100

0.125 0.125 0.125 0.125 0.125 0.725 0.375 0.175 0.0975 0.125 0.975 0.0875

6  6  6  6  6  6  6  6  6  6  6  6 

2295 4569 11,175 5931 2350 4947 1729 3298 4434 4174 5839 3504

3

þ6 þ6 þ6 þ6 þ6 þ6 þ6 þ6 þ6 þ6 þ6 þ6

sequence fragments in the negative dataset than that in the positive dataset. We randomly select l sequence fragments from Mneg 20 times, and use the average frequencies in the sampled 20 datasets to construct the position specific dipeptide composition matrix A  with size 441  ðm  1Þ. The j-th column of A  is denoted as follows:

For example, given a sequence fragment ACDESEFGH with nine amino acids, jointly considering Z k ¼ 0 ; Z k ¼ 1 , the dipeptides in the position 1–8 are AC, CD, DE, ES, SE, EF, FG, GH and the amino acid pairs separated by other one amino acid ðk ¼ 1Þ in the position 1–7 are AD, CE, DS, EE, SF, EG, FH, respectively. Since the indices of AC, CD, GH, AD, CE and FG are 2, 24 , 112, 4, 25 and 133, respectively. Therefore the corresponding feature vector x is derived as

   ; a2;j ; …; a441;j ÞT ; Aj ¼ ða1;j

¼0 ¼0 k¼1 k¼1 ¼1 x ¼ ðx1 ; x2 ; …; x8 ; x9 ; x10 ; …; x15 Þ ¼ ðzk2;1¼ 0 ; zk24;2 ; …; zk112;8 ; z4;1 ; z25;2 ; …; zk133;7 Þ:

j ¼ 1; 2; …; m  1:

ð3Þ

This is called PSPM feature of the fragment ACDESEFGH with Z k ¼ 0 ; Z k ¼ 1 . In this work, we jointly consider Z k ¼ 0 ; Z k ¼ 1 ; …; Z k ¼ 8 for PTM sites prediction.

In addition, the standard deviations of these frequencies are also calculated and recorded    ; s2;j ; …; s441;j ÞT ; Sj ¼ ðs1;j

j ¼ 1; 2; …; m  1:

ð4Þ 2.2. Classification

 In the third step, the propensity matrix Z k ¼ 0 with size 441  ðm  1Þ is constructed by 0 zk ¼ 0 zk1;2¼ 0 … B 1;1 B zk ¼ 0 zk ¼ 0 … B 2;2 Z k ¼ 0 ¼ B 2;1 B ⋮ ⋮ @ ¼0 ¼0 zk441;2 … zk441;1

¼0 zk1;m 1

The positive and negative datasets are generated by using local peptides flanking potential PTM residue, which is widely used in the literature. First, all sequence fragments of length m with potentially PTM residue in the center position are extracted from the experimentally identified PTM protein sequences. If the center residue is annotated as PTM site, the fragment is selected into the positive dataset, otherwise the negative dataset. The prediction problem is formulated as a two-class classification problem, and solved by the SVM algorithm. In order to discriminate between PTM sites and Non-PTM sites, the standard SVM learns a decision function from a set of sequence fragments composed by training datasets Mpos and Mneg. The decision function takes the form:

1

C ¼0 C zk2;m 1 C C C ⋮ A k¼0 z441;m  1

441;m  1

where zki;j¼ 0 denotes the propensity of the i-th dipeptide appearing in the j-th position in the fragments. k ¼0 denotes the nearest amino acid pairs. The mathematical definition of zki;j¼ 0 is as follows: zki;j¼ 0 ¼ ðai;jþ ai;j Þ=si;j :

gðxÞ ¼

ð5Þ

pairs that are separated by other k amino acids with k ¼ 1; 2; …m  2. Repeating the steps similar to the above step 1–3, we obtain the matrices Z k ¼ 1 ; Z k ¼ 2 ; …; Z k ¼ m  2 with the sizes 441nðm  2Þ; 441nðm  3Þ; …; 441n1, respectively. At last we merge these matrices into one big matrix and the total propensity matrix Z is obtained by

zk1;1¼ 0

B B zk ¼ 0 B Z ¼ Z k ¼ 0  Z k ¼ 1  ⋯  Z k ¼ m  2 ¼ B 2;1 B ⋮ @ ¼0 zk441;1

ξi Kðx; xi Þ 



i:xi A M neg

ξi Kðx; xi Þ þ b;

zk1;2¼ 0



¼0 zk1;m 1

zk1;1¼ 1

zk1;2¼ 1

...

¼1 zk1;m 2



zk2;2¼ 0



¼0 zk2;m 1

zk2;1¼ 1

zk2;2¼ 1

...









¼1 zk2;m 2





¼0 zk441;m 1

¼1 zk441;1

¼1 zk441;2

¼1 zk441;m 2



⋮ ¼0 zk441;2



ð7Þ

where Kð; Þ is a kernel function, the non-negative weights ξi and real number b are computed during training of SVM by solving a convex quadratic programming. More details about the theory of SVM can be found in Vapnik (1995, 1998) and Deng et al. (2012). However, the value of g(x) is a real number. In this work, in order to obtain the probability output from SVM, i.e. the probability of that unlabeled input x belongs to a certain class, PðyjxÞ, we build a logistic model to map the outputs of the SVM into estimated

 In the fourth step, we expand the dipeptide to amino acid

0



i:xi A M pos



zk1;1¼ m  2

1

C zk2;1¼ m  2 C C C C ⋮ A ¼ m2 zk441;1

ð6Þ 441;n

Please cite this article as: Xu, Y., et al., Prediction of posttranslational modification sites from amino acid sequences with kernel methods. J. Theor. Biol. (2013), http://dx.doi.org/10.1016/j.jtbi.2013.11.012i

Y. Xu et al. / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

4

probabilities (Platt, 1999). Prðy ¼ 1jxÞ ¼ P A;B ðgðxÞÞ 1=½1 þ expðAngðxÞ þ BÞ

ð8Þ

Parameters A and B can be obtained by solving the following model: l

min

z ¼ ðA;BÞ

FðzÞ ¼  ∑ ðt i log ðpi Þ þ ð1  t i Þlog ð1  pi ÞÞ; i¼1

( s:t:

ti ¼

ðN þ þ 1Þ=ðN þ þ 2Þ 1=ðN  þ 2Þ

pi ¼ P A;B ðgðxi ÞÞ;

if yi ¼ þ 1 if yi ¼  1

i ¼ 1; …; l;

ð9Þ

ð10Þ ð11Þ

where N þ and N  represent the number of Mpos and Mneg during training process, respectively. A new sequence fragment x is then predicted to be positive or negative depending on whether the posterior class probability of x is greater or less than a predetermined cutoff value. ( 1 if Prðy ¼ 1jxÞ 4 cutoff value; f ðxÞ ¼ ð12Þ  1 otherwise; where 1 represents PTM site, whereas  1 represents non-PTM site. The cutoff value is 0.55 for balancing the true positive and negative rate, unless an additional introduction is attached. The SVM algorithm is implemented by LIBSVM (Chang and Lin, 2001), a public and widely used SVM library. 2.3. Performance evaluation criteria To illuminate the performance of our methods we adopt five frequent measurements: sensitivity (Sen), specificity (Spe), accuracy (Acc), precision (Pre) and Mathew correlation coefficient (Mcc). They are defined as below: Sen ¼ TP=ðTP þ FNÞ;

ð13Þ

Spe ¼ TN=ðTN þ FPÞ;

ð14Þ

Acc ¼ ðTP þ TNÞ=ðTP þ FP þ TN þFNÞ;

ð15Þ

Pre ¼ TP=ðTP þ FPÞ;

ð16Þ

pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi MCC ¼ ðTP nTN  FN nFPÞ= ðTP þ FNÞnðTN þFPÞnðTP þ FPÞnðTN þ FNÞ; ð17Þ where TP, FN, TN and FP denote the number of true positive, false negative, true negative and false positive, respectively. Apart from the above criteria, the area under the Receiver Operating Characteristic curve (Gribskov and Robinson, 1996) (AUC) is also used as an indicator. Readers also reference another metrics using Eqs. 10–14 in Chen et al. (2013). The set of the five metrics is valid only for the single-label systems. For the multi-label systems whose existence has become more frequent in system biology (Chou et al., 2011, 2012) and system medicine (Chen et al., 2012a; Xiao et al., 2013), a different set of metrics as defined in Chou (2013) is needed.

3. Results Based on the proposed PSPM and PSAAP encoding schemes, four algorithms are implemented and applied into two reversible PTM sites, phosphorylation and glycosylation. The predictor is named PSPM based on PSPM encoding scheme. Similarly, predictor PSAAP is based on encoding scheme PSAAP. In order to examine the performance of statistical prediction methods, the kfold cross validations test (with k ¼6, 8 and 10) and jackknife test are often used during the computing experiments. The jackknife

test is deemed the most objective way that can always yield a unique result for a given benchmark dataset (Chou and Zhang, 1995; Chou, 2011b). Therefore, the jackknife test has been increasingly and widely adopted by investigators to test the power of various prediction methods (Chou, 2013; Chen et al., 2009; Zeng et al., 2009; Ding et al., 2009; Khosravian et al., 2013; Mohabatkar et al., 2013; Chen and Li, 2013; Mei, 2012; Fan and Li, 2012). However, to reduce the computational time, we adopt the 10-fold cross-validation as done with SVM as the prediction engine for PTM sites prediction (Kim et al., 2004; Wong et al., 2007; Shao et al., 2009a; Chang et al., 2009). The parameters and kernel functions are selected by trials according to the AUC. 3.1. Kinase-specific phosphorylation sites 3.1.1. Data preprocessing The database of the Phospho.ELM (version 9.0, released in September 2010) (Diella et al., 2011) is used in this work. This dataset has been used as a benchmark to test the performance of many computational phosphorylation prediction models previously published. It contains a collection of experimentally verified serine (S), threonine (T) and tyrosine (Y) phosphorylation sites in eukaryotic proteins. The entries are manually annotated based on scientific literature, and provide the information about the phosphorylated proteins and the exact positions of the known phosphorylated residues which are catalyzed by a given kinase. Phospho.ELM (version 9.0) contains 8718 substrate proteins from different species. From Phospho.ELM 9.0 we selected 12 kinases, each of which has at least 50 experimentally identified phosphorylation sites. The threshold of 50 was chosen because less sites may not be convincing for testing the efficacy of a machine learning method. The peptide fragments P can be generally formulated by P ¼ R  ξ R  ðξ  1Þ …R  2 R  1 S=T=YR þ 1 R þ 2 …R þ ðξ  1Þ R þ ξ

ð18Þ

where the subscript ξ is an integer, R  ξ is the ξ-th downstream amino acid residue from S/T/Y, Rξ the ξ-th upstream amino acid residue, and so forth. We define a peptide as phosphorylation or nonphosphorylation peptide if its center is a phosphorylation or nonphosphorylation site, respectively. If the upstream or downstream in a dipeptide is less than ξ, the lacking residues are filled with the dummy code X. To reduce the bias due to the homology, which may result in overestimation on cross-validation accuracy, as in Wang et al. (2008) and Kim et al. (2004), we discard highly homologous sequences (over 70% identity to the fragment) in training datasets. The final numbers of positive and negative sequence fragments are listed in Table 1. 3.1.2. Comparison of proposed methods with the existing tools One particular challenge in training SVM comes from the fact that the available dataset is highly unbalanced: the number of PTM sites (positive dataset) is much smaller than the number of nonPTM sites (negative dataset). Standard SVM algorithm does not consider class-imbalance. Following the approach used in the literatures (Kim et al., 2004; Wong et al., 2007), we balance the positive and negative dataset during the cross-validation by randomly selecting the negative sequence fragments from the whole negative dataset. The cross-validation is performed 50 times and the average performance is shown. Until now, there are many computational methods to predict kinase-specific phosphorylation sites such as Dang et al. (2008). Table 2 shows the comparison results of our proposed models PSPM and PSAAP with other phosphorylation site predictors on the 12 kinases datasets. The performance scores reported in the corresponding literatures are shown: AMS 2.0 (Plewczynski et al., 2008), PPSP (Xue et al., 2006), GPS (Zhou et al., 2004;

Please cite this article as: Xu, Y., et al., Prediction of posttranslational modification sites from amino acid sequences with kernel methods. J. Theor. Biol. (2013), http://dx.doi.org/10.1016/j.jtbi.2013.11.012i

Y. Xu et al. / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

5

Table 2 The performance of our new methods for most main kinases, using 10-fold cross validation. For comparison, corresponding performance scores reported in the literature are shown in the table: AMS 2.0 (Plewczynski et al., 2008), PPSP (Xue et al., 2006), GPS (Zhou et al., 2004; Xue et al., 2005), GPS 2.0 (Xue et al., 2008), PredPhospho (Kim et al., 2004), KinasePhos 2.0 (Wong et al., 2007), KinasePhos 1.0 (Huang et al., 2005a, 2005b), NetPhosK (Blom et al., 2004), Scansite (Obenauer et al., 2003), IEPP (Wang et al., 2008), PostMod (Inkyung et al., 2010), Phos3D (Durek et al., 2009), Musite (Gao et al., 2010), GPS2.1 (Xue et al., 2011). LOO means Leave-one-out cross-validation. Kinase

Method

Pre (%)

Sen (%)

Spe (%)

MCC

AUC

CDK group

PSPM PSAAP AMS 2.0 PPSP GPS GPS 2.0 PredPhospho Consensus KinasePhos 2.0 KinasePhos 1.0 PostMod

92.95 90.93 79.66 – – – – – 85.00 83.00 60.00

97.84 92.79 59.80 91.21 85.00 85.66 95.02 45.45 95.00 87.00 88.00

92.52 90.64 – 93.35 90.00 89.87 95.10 99.16 83.00 82.00 –

0.905 0.835 – – – 0.760 – – – – –

0.947 0.937

PSPM PSAAP AMS 2.0 PPSP GPS GPS 2.0 PredPhospho Consensus KinasePhos 2.0 KinasePhos 1.0 Scansite PostMod Phos3D Musite LOO

81.35 69.04 82.76 – – – – – 88.00 95.00 – 71.00 – –

87.39 94.11 43.33 81.84 83.00 81.32 83.90 76.51 84.00 79.00 33.24 58.00 88.00 81.42

79.93 57.74 – 88.22 88.00 88.29 96.43 88.78 88.00 96.00 100.00 – 61.00 89.99

0.675 0.556 – – – 0.698 – – – – 0.48 – – –

0.858 0.806

PSPM PSAAP AMS 2.0 PPSP GPS GPS 2.0 PredPhospho Consensus KinasePhos 2.0 KinasePhos 1.0 Scansite NetPhosK IEPP PostMod Phos3D MusiteLOO

84.75 81.50 90.43 – – – – – 89.00 85.00 – – – 81.00 – –

89.42 88.80 58.15 90.13 89.00 75.12 88.32 56.09 91.00 91.00 46.99 – 88.70 59.00 86.00 81.62

83.83 80.44 – 90.28 90.00 91.51 91.11 98.70 89.00 84.00 99.25 – 91.60 – 80.00 90.03

0.734 0.689 – – – – – – – – 0.570 0.610 0.350 – – –

0.862 0.902

PSPM PSAAP AMS 2.0 PPSP GPS GPS 2.0 PredPhospho Consensus KinasePhos 2.0 KinasePhos 1.0 NetPhosK PostMod Phos3D

80.71 69.10 84.85 – – – – – 86.00 87.00 – 76.00 –

74.82 90.23 23.53 80.99 83.00 58.56 78.71 47.54 81.00 77.00 – 30.00 80.00

82.04 59.63 – 83.40 82.00 94.18 85.79 89.78 87.00 88.00 – – 81.00

0.571 0.523 – – – 0.564 – – – – 0.480 – –

0.846 0.788

PSPM PSAAP AMS 2.0 PPSP GPS 2.0 KinasePhos 2.0 KinasePhos 1.0 PostMod

89.2 93.39 91.00 – – 96.00 92.00 85.00

96.00 85.39 75.00 86.05 97.14 100.00 87.00 60.00

88.20 93.98 – 91.39 95.36 96.00 92.00 –

0.845 0.797 – – 0.925 – – –

0.936 0.943

PSPM PSAAP AMS 2.0 PPSP GPS 2.0 KinasePhos 2.0 KinasePhos 1.0

78.25 86.29 89.00 – – 93.00 95.00

66.80 65.67 15.00 72.77 65.00 94.00 79.00

81.20 89.47 – 91.71 88.75 93.00 96.00

0.486 0.568 – – 0.553 – –

0.796 0.826

CK2 group

PKA group

PKC group

ATM

CaM–KII group

– – – – –



– –

– – –



– – – – –

– –

– – –

– –



– –

Please cite this article as: Xu, Y., et al., Prediction of posttranslational modification sites from amino acid sequences with kernel methods. J. Theor. Biol. (2013), http://dx.doi.org/10.1016/j.jtbi.2013.11.012i

Y. Xu et al. / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

6

Table 2 (continued ) Kinase

CDK1

CDK2

CK2 alpha

PKB group

PKC alpha

MAPK1

MAPK3

Method

Pre (%)

Sen (%)

Spe (%)

MCC

AUC

PostMod

63.00

18.00







PSPM PSAAP AMS 2.0 GPS 2.0 PostMod

87.85 92.41 65.00 – 63.00

98.40 91.30 28.00 83.18 84.00

86.35 92.49 – 93.64 –

0.854 0.838 – 0.772 –

0.938 0.950

PSPM PSAAP AMS 2.0 GPS 2.0 PostMod

81.45 90.25 45.00 – 61.00

97.91 81.78 7.00 85.00 78.00

77.38 91.25 – 92.81 –

0.771 0.734 – 0.781 –

0.919 0.924

PSPM PSAAP AMS 2.0 GPS 2.0 PostMod

76.25 69.49 73.00 – 77.00

93.14 95.14 32.00 79.44 52.00

71.32 58.13 – 87.22 –

0.661 0.573 – 0.669 –

0.825 0.824

PSPM PSAAP AMS 2.0 PPSP GPS 2.0 KinasePhos 1.0 PostMod

90.55 89.69 87.00 – – 88.00 87.00

90.00 86.63 57.00 87.13 57.50 76.00 46.00

90.54 90.02 – 92.57 95.00 89.00 –

0.805 0.767 – – 0.566 – –

0.938 0.937

PSPM PSAAP AMS 2.0 GPS 2.0 PostMod

76.79 72.00 78.00 – 75.00

75.45 86.69 11.00 56.77 27.00

77.14 66.24 – 92.74 –

0.526 0.541 – 0.531 –

0.802 0.802

PSPM PSAAP AMS 2.0 GPS 2.0 PostMod

87.87 88.85 67.00 – 67.00

90.42 86.63 42.00 85.95 76.00

87.47 89.12 – 88.81 –

0.779 0.758 – 0.748 –

0.896 0.900

PSPM PSAAP AMS 2.0 GPS 2.0 PostMod

90.96 89.32 88.00 – 68.00

93.54 88.41 55.00 83.44 69.00

90.64 89.41 – 90.62 –

0.842 0.779 – 0.743 –

0.932 0.914

– –

– –

– –

– – –

– –

– –

– –

Table 3 The details of testing dataset MetaPS06 and optimal parameters for our new method. Protein kinase

Positive data size

Negative data size

Training kernel

Training C

Training γ

Training window size

CDK group CK2 group PKA group PKC group

294 229 360 348

441 343 540 522

Sigmoid Sigmoid Sigmoid Sigmoid

0.0875 0.0875 0.0875 1

0.125 0.125 0.125 0.125

6  6  6  6 

Xue et al., 2005), GPS 2.0 (Xue et al., 2008), PredPhospho (Kim et al., 2004), KinasePhos 2.0 (Wong et al., 2007), KinasePhos 1.0 (Huang et al., 2005a, 2005b), NetPhosK (Blom et al., 2004), Scansite (Obenauer et al., 2003), IEPP (Wang et al., 2008), PostMod (Inkyung et al., 2010), Phos3D (Durek et al., 2009), BAE (Yu et al., 2010) and PAAS (Sobolev et al., 2010) (no web accessible), Musite (Gao et al., 2010), GPS2.1 (Xue et al., 2011), PPRED (Biswas et al., 2010) and CKSAAP- PhSite (Zhao et al., 2012) (general S,T and Y residues not kinase-specific), PHOSFER (Trost and Kusalik, 2013) only in plants. Considering the four well studied kinase CDK, CK2, PKA and PKC, the performance of PSPM outperforms all other methods for kinase CDK. KinasePhos 1.0 and PredPhospho are the best of all for kinase CK2. The existing tools obtain almost the same results except AMS 2.0, Consensus, Scansite and PostMod for

þ6 þ6 þ6 þ6

kinase PKA and PKC. The performance of our methods PSPM and PSAAP are much better than AMS 2.0 and GPS 2.0 in other kinases.

3.1.3. Comparison of our proposed methods with the existing tools on the independent test Obviously, the different versions of databases used in the literature will generate more or less bias in comparison. A reasonable solution to perform an unbiased comparison is facilitating cross-validations on the same dataset. This is hard to do since trainable versions for most models are not available. An alternative solution is to test all methods on the same independent testing dataset. However, there is a high chance to get a biased result if some testing data are already used as training data. Wan et al.

Please cite this article as: Xu, Y., et al., Prediction of posttranslational modification sites from amino acid sequences with kernel methods. J. Theor. Biol. (2013), http://dx.doi.org/10.1016/j.jtbi.2013.11.012i

Y. Xu et al. / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

(2008) constructed database MetaPS06 which does not contain any training data used by the above prediction tools. The MetaPS06 comes form Swiss-Prot and the details are shown in Table 3. In MetaPS06 we discard data which existed in training dataset to access the unbiased results. The results are shown in Table 4. PSPM obtains a small but consistent improvement compared to PPSP, GPS, PredPhospho, KinasePhos, Scansite and NetPhosK. Specially, when compared to PredPhospho, also a SVM method using conventional binary encoding as input features, an apparent performance improvement is achieved by PSPM and PSAAP. On MetaPS06 the performance of PSPM is slightly better than that of PSAAP in total. As the experimental identification of phosphorylation sites is proven to be both expensive and time-consuming, the two methods PSPM and PSAAP introduced in this study can help the science community to identify phosphorylation sites.

3.2. Mucin-type O-glycosylation sites in mammalian proteins 3.2.1. Data preprocessing The experimentally validated mucin-type O-glycosylation sites in mammalian proteins come from Chen et al. (2008). The sequences are extracted from Swiss-Prot database (Release 52.4) Table 4 The performance of our new method in testing dataset MetaPS06. The performance measures of the other existing methods are reported by Wan et al. (2008). Kinase

Method

Acc (%)

Sen (%)

Spe (%)

Mcc

AUC

CDK group

PSPM PSAAP PPSP GPS PredPhospho KinasePhos Scansite NetPhosK

86.06 81.81 83.90 84.40 85.30 82.20 74.40 70.50

91.13 74.56 90.50 90.80 89.80 79.90 40.50 63.90

82.69 86.63 79.60 80.00 82.30 83.70 97.10 74.80

0.724 0.619 0.687 0.695 0.708 0.632 0.479 0.387

0.868 0.880 0.872 0.876 0.867 0.871 0.758 0.777

CK2 group

PSPM PSAAP PPSP GPS PredPhospho KinasePhos Scansite NetPhosK

87.13 78.02 85.70 81.60 81.30 76.00 75.00 87.10

88.71 91.01 74.20 69.90 59.40 47.60 38.00 75.50

86.08 69.34 93.30 89.50 95.90 95.00 99.70 94.80

0.738 0.600 0.700 0.613 0.616 0.504 0.512 0.730

0.908 0.888 0.877 0.813 0.779 0.751 0.773 0.931

PKA group

PSPM PSAAP PPSP GPS PredPhospho KinasePhos Scansite NetPhosK

79.77 74.75 82.30 81.20 82.70 79.20 75.80 80.20

89.40 87.38 85.00 81.70 80.80 65.00 42.20 69.40

73.35 66.34 80.60 80.90 83.90 88.70 98.10 87.40

0.615 0.530 0.645 0.618 0.642 0.560 0.515 0.583

0.787 0.848 0.886 0.845 0.854 0.823 0.766 0.875

PSPM PSAAP PPSP GPS PredPhospho KinasePhos Scansite NetPhosK

70.59 65.60 74.30 73.90 72.20 71.10 63.60 70.10

76.05 88.06 74.10 71.80 59.80 48.00 17.10 49.10

66.96 50.63 74.30 75.30 80.50 86.40 94.60 84.10

0.421 0.397 0.477 0.466 0.412 0.378 0.189 0.358

0.759 0.756 0.799 0.757 0.715 0.744 0.640 0.758

PKC group

7

(O'Donovan et al., 2002). They retrieved those proteins with annotations of mucin-type O-glycosylation linked to S or T in mammalian, excluding the proteins which have the annotation of ‘potential’, ‘predicted’, ‘similarity’ or ‘probably’. Finally, 103 proteins covered 367 mucin-type O-glycosylation S/T sites are obtained. Similar to phosphorylation sites, in order to avoid overestimation on accuracy in cross-validation, they also discarded highly homologous sequences. The number of positive and negative sequence fragments of mucin-type O-glycosylation sites is shown in Table 5. 3.2.2. Comparison of proposed methods with the existing tools Following the approach used in the paper (Chen et al., 2008), we also set the ratios of positive and negative datasets as 1:1. The crossvalidation is also performed 50 times. The average performance of different methods is shown in Table 6. From the results we can see that the performance of our proposed PSPM is much better than PSAAP and others. This reveals that PSPM encoding scheme is more suitable than PSAAP and conventional binary encoding scheme for identifying O-glycosylation sites in mammalian proteins. The performance of PSPM also outperforms CKSAAP-OGlySite. 3.3. Application of PTMPred: a case study for annotating sulfated proteins To verify whether the proposed encoding scheme PSPM has generalization ability into other PTM sites prediction, we check the corresponding software PTMPred into annotating sulfated proteins. Tyrosine sulfation is one of the most common PTM sites in secreted and transmembrane proteins, and has been experimentally demonstrated to be essential to extracellular protein–protein interactions (Chang et al., 2009). Sulfated tyrosines have an essential role in the immune response such as promoting HIV infection of T-helper lymphocytes and leukocyte rolling. Recently, Chang et al. (2009) constructed a tyrosine sulfation predictor called SulfoSite which mainly considered structural information and sequence positional weighted matrix (PWM) as the input features of SVM. The performance of SulfoSite is much better than that of Sulfinator (Monigatti et al., 2002) on a new independent testing dataset. The latter is the first and main tyrosine sulfation predictor. In this work, we use the same training dataset and independent testing dataset as in SulfoSite to test our method. The training dataset which covers 59 sulfation sites are extracted from Swiss-Prot release 53.0. 11 sulfated proteins in Table 6 The performance of our new methods for mucin-type O-glycosylation sites prediction, using 10-fold cross validation. CKSAAP-OglySite and Binary (Chen et al., 2008). Site type

Method

Acc (%)

Sen (%)

Spe (%)

Mcc

AUC

S site

PSPM PSAAP CKSAAP-OGlySite Binary

83.32 73.66 82.20 78.00

83.10 58.97 77.90 74.20

83.53 88.36 86.50 81.90

0.671 0.495 0.655 0.567

0.899 0.857 0.895 0.854

T site

PSPM PSAAP CKSAAP-OGlySite Binary

83.10 74.83 81.30 76.60

84.79 62.59 80.40 74.80

81.23 87.08 82.30 78.30

0.663 0.513 0.631 0.536

0.910 0.858 0.878 0.829

Table 5 The size of positive and negative datasets used in the paper (Chen et al., 2008) and optimal parameters for the proposed method. Mucin-type O-glycosylation

Positive data size

Negative data size

Optimal kernel

Optimal C

Optimal γ

Optimal window size

S site T site

116 212

1153 1702

Sigmoid Sigmoid

500 500

0.175 0.225

 20  þ 20  20  þ 20

Please cite this article as: Xu, Y., et al., Prediction of posttranslational modification sites from amino acid sequences with kernel methods. J. Theor. Biol. (2013), http://dx.doi.org/10.1016/j.jtbi.2013.11.012i

Y. Xu et al. / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

8

Table 7 Comparison among PTMPred, SulfoSite (Chang et al., 2009) and Sulfinator (Monigatti et al., 2002) on independent dataset. SWISS-PROT ID

Real Sulfotyrosine sites

Sulfinator predicted sites

SulfoSite predicted sites

PTMPred predicted sites

CCKN_CANFA FMOD_BOVIN

Y52 Y20, Y38, Y53, Y55, Y63, Y65 Y20, Y53, Y55

Y52 Y38, Y42, Y45, Y47, Y62, Y64 Y38, Y39, Y42, Y45, Y47 Y46, Y49

Y52 Y38, Y55

Y52 Y42, Y65

Y20, Y39

Y20, Y50, Y53

Y46, Y49

Y49



Y23

Y23



Y31, Y39, Y51

Y80 Y80 – – Y259, Y263, Y265, Y271, Y275, Y278, Y290, Y293, Y297, Y299 23.00 19.00 80.00 65.00

Y80, Y82 Y80, Y82 Y110, Y112 Y110, Y112 Y265, Y271, Y305

Y8, Y31, Y39, Y417 Y80, Y82 Y80, Y82 Y110, Y112 Y110, Y112 Y314

73.00 52.00 94.00 84.00

80.95 54.83 95.91 86.04

FMOD_HUMAN HIS1_HUMAN

OMD_HUMAN

Y46, Y49, Y53, Y55 Y20, Y21, Y23, Y30 Y39, Y416, Y417

PSK1_ORYSI PSK1_ORYSI PSK2_ORYSI PSK2_ORYSI SIAL_HUMAN

Y80, Y82 Y80, Y82 Y110, Y112 Y110, Y112 Y313, Y314

Pre (%) Sen (%) Spe (%) Acc (%)

– – – –

LUM_MOUSE

Swiss-Prot release 55.0 which are not available in Swiss-Prot release 53.0 are constructed as independent testing dataset. The independent dataset consists of 31 experimental sulfation sites and 98 non-sulfated tyrosines. The details of the independent dateset and prediction results are shown in Table 7. From Table 7, we can see that PSPM outperforms other machine learning methods SulfoSite and Sulfinator in identifying protein sulfation sites. Therefore, PTMPred based PSPM has good generalization ability in predicting PTM sites.

4. Conclusion In this paper, a novel encoding scheme PSPM is designed to construct the input features for PTM sites prediction. PSPM considers the position-specific amino acid pairs propensity surrounding PTM sites. Selecting the SVM as classifier, several algorithms based on PSPM and PSAAP are developed to predict two widely studied PTM sites: kinase-specific phosphorylation sites and mucin-type O-glycosylation sites in mammalian proteins. The computational experiments are conducted on the new datasets derived from the Phospho.ELM , Swiss-Prot and the existing benchmark datasets from the published papers (Wan et al., 2008; Chen et al., 2008). PSPM achieves satisfied predictive performance. On the other hand, the proposed encoding scheme PSPM maps the protein sequence into low-dimensional Euclid space. Due to the fact that more and more biological data is emerged, low-dimensional input is critical for large dataset prediction problem. These results show that the PSPM-based methods might play a complementary role to the existing predictive methods and can provide helpful insights for future experimental design and verification. The feature construction method is crucial to determine the performance of a predictive model. In future we should focus on the development of the simple and efficient feature construction (such as conjoint triad feature Shen et al., 2007; Shao et al., 2009b) and construction of more efficient classifiers (for examples, the twin support vector machines Jayadeva et al., 2007 which have a good performance on imbalance classification problems). It should be pointed out that, unlike most of the existing methods for PTM sites prediction which only provide web-based server, our models are implemented as a

unified standalone software named PTMPred. Taking into account the performance of prediction method depends on the available updated biological data, the PTMPred allows the users to train the models by using their own training dataset. PTMPred also allows users to select all parameters such as C. The most important usefulness of PTMPred is that it can predict all type of PTM sites if the user provides the training datasets. The software can be freely downloaded from http://www.aporc.org/doc/wiki/PTMPred Since user-friendly and publicly accessible web-servers represent the future direction for developing practically more useful models, simulated methods, or predictors (Cai et al., 2010), we shall make efforts in our future work to provide a web-server for the method presented in this paper.

Acknowledgments We thank Professor Zhang Ziding at China Agriculture University for providing Mucin-Type O-glycosylation dataset in Mammalian Proteins and we also thank Professor Hunag Hsienda at National Chiao Tung University for making public training and independent testing datasets of sulfated proteins. Funding: This work is supported by the National Natural Science Foundation (nos. 11301024, 11371365, 11101029, 11071013, 31201002, 11271361, 11131009 and 91330114), and the Fundamental Research Funds for the Central Universities.

References Biswas, A.K., Noman, N., Sikder, A.R., 2010. Machine learning approach to predict protein phosphorylation sites by incorporating evolutionary information. BMC Bioinforma. 11, 273. Blom, N., Sicheritz-Ponten, T., Gupta, R., Gammeltoft, S., Brunak, S., 2004. Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence. Proteomics 4 (6), 1633–1649. Cai, Y.D., Chou, K.C., 2005. Using functional domain composition to predict enzyme family classes. J. Prot. Res. 4, 109–111. Cai, Y.D., Zhou, G.P., Chou, K.C., 2003. Support vector machines for predicting membrane protein types by using functional domain composition. Biophys. J. 84, 3257–3263. Cai, Y., He, J., Li, X., Feng, K., Lu, L., Feng, K., Kong, X., Lu, W., 2010. Predicting protein subcellular locations with feature selection and analysis. Prot. Pept. Lett. 17 (April (4)), 464–472.

Please cite this article as: Xu, Y., et al., Prediction of posttranslational modification sites from amino acid sequences with kernel methods. J. Theor. Biol. (2013), http://dx.doi.org/10.1016/j.jtbi.2013.11.012i

Y. Xu et al. / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Chang, C.C., Lin, C.Z., 2001. LIBSVM: a library for support vector machines. Software available at: 〈http://www.csie.ntu.edu.tw/cjlin/libsvm〉. Chang, Wen-Chi, Lee, Tzong-Yi, Shien, Dray-Ming, Hsu, Bo-Kai.J., Horng, JorngTzong, Hsu, Po-Chiang, Wang, Ting-Yuan, Huang, Hsien-Da, Pan, Rong-Long, 2009. Incorporating support vector machine for identifying protein tyrosine sulfation sites. J. Comput. Chem. 30, 2526–2537. Chen, Y.K., Li, K.B., 2013. Predicting membrane protein types by incorporating protein topology, domains, signal peptides, and physicochemical properties into the general form of Chou's pseudo amino acid composition. J. Theor. Biol. 318, 1–12. Chen, Y.Z., Tang, Y.R., Sheng, Z.Y., Zhang, Z.D., 2008. Prediction of mucin-type O-glycosylation sites in mammalian proteins using the composition of k-spaced amino acid pairs. BMC Bioinforma. 9, 101. Chen, C., Chen, L., Zou, X., Cai, P., 2009. Prediction of protein secondary structure content by using the concept of Chou's pseudo amino acid composition and support vector machine. Prot. Pept. Lett. 16, 27–31. Chen, L., Zeng, W.M., Cai, Y.D., Feng, K.Y., et al., 2012a. Predicting Anatomical Therapeutic Chemical (ATC) classification of drugs by integrating chemical– chemical interactions and similarities. PLoS ONE 7, e35254. Chen, W., Lin, H., Feng, P.M., Ding, C., et al., 2012b. iNuc-PhysChem: a sequencebased predictor for identifying nucleosomes via physicochemical properties. PLoS ONE 7, e47843. Chen, W., Feng, P.M., Lin, H., et al., 2013. iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Res. 41, e68. Chou, K.C., 1993. A vectorized sequence-coupling model for predicting HIV protease cleavage sites in proteins. J. Biol. Chem. 268, 16938–16948. Chou, K.C., 1995. A sequence-coupled vector-projection model for predicting the specificity of GalNAc-transferase. Prot. Sci. 4, 1365–1383. Chou, K.C., 1996. Review: prediction of HIV protease cleavage sites in proteins. Anal. Biochem. 233, 1–14. Chou, K.C., 2004. Review: structural bioinformatics and its impact to biomedical science. Curr. Med. Chem. 11, 2105–2134. Chou, K.C., 2005. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21, 10–19. Chou, K.C., 2011a. Some remarks on protein attribute prediction and pseudo amino acid composition (50th Anniversary Year Review). J. Theor. Biol. 273, 236–247. Chou, K.C., 2011b. Some remarks on protein attribute prediction and pseudo amino acid composition. J. Theor. Biol. 273, 236–247. Chou, K.C., 2013. Some remarks on predicting multi-label attributes in molecular biosystems. Mol. Biosyst. 9, 1092–1100. Chou, K.C., Cai, Y.D., 2002. Using functional domain composition and support vector machines for prediction of protein subcellular location. J. Biol. Chem. 277, 45765–45769. Chou, K.C., Cai, Y.D., 2004. Predicting enzyme family class in a hybridization space. Prot. Sci. 13, 2857–2863. Chou, K.C., Elrod, D.W., 1999. Protein subcellular location prediction. Prot. Eng. 12, 107–118. Chou, K.C., Shen, H.B., 2007a. MemType-2L: a web server for predicting membrane proteins and their types by incorporating evolution information through PSEPSSM. Biochem. Biophys. Res. Commun. 360, 339–345. Chou, K.C., Shen, H.B., 2007b. Signal-CF: a subsite-coupled and window-fusing approach for predicting signal peptides. Biochem. Biophys. Res. Commun. 357, 633–640. Chou, K.C., Shen, H.B., 2009. Review: recent advances in developing web-servers for predicting protein attributes. Nat. Sci. 2, 63–92. Chou, K.C., Zhang, C.T., 1995. Prediction of protein structural classes. Crit. Rev. Biochem. Mol. Biol. 30, 275–349. Chou, K.C., Wu, Z.C., Xiao, X., 2011. iLoc-Euk: a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins. PLoS One 6, e18258. Chou, K.C., Wu, Z.C., Xiao, X., 2012. iLoc-Hum: Using accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites. Mol. Biosyst. 8, 629–641. Dang, T.H., Leemput, K.V., Verschoren, A., Laukers, K., 2008. Prediction of kinasespecific phosphorylation sites using conditional random field. Bioinformatics 24, 2857–2864. Deng N.Y., Tian, Y.J., Zhang, C.H., 2012. Support Vector Machines: Optimization Based Theory, Algorithms and Extensions. Chapman & Hall/CRC, Boca Raton, FL. Diella, F., Gould, C.M., Chica, C., Via, A., Gibson, T.J., 2011. Phospho.ELM: a database of phosphorylation sites—update 2011. Nucleic Acids Res. 39 (January), D261–D267. Ding, H., Luo, L., Lin, H., 2009. Prediction of cell wall lytic enzymes using Chou's amphiphilic pseudo amino acid composition. Prot. Pept. Lett. 16, 351–355. Durek, P., Schudoma, C., Weckwerth, W., Selbig, J., Walther, D., 2009. Detection and characterization of 3D-signature phosphorylation site motifs and their contribution towards improved phosphorylation site prediction in proteins. BMC Bioinforma. 10, 117. Fan, G.L., Li, Q.Z., 2012. Predict mycobacterial proteins subcellular locations by incorporating pseudo-average chemical shift into the general form of Chou's pseudo amino acid composition. J. Theor. Biol. 304, 88–95. Gao, J., Thelen, J.J., Dunker, A.K., Xu, D., 2010. Musite, a tool for global prediction of general and kinase-specific phosphorylation sites. Mol. Cell. Prot. 9, 2586–2600. Gribskov, M., Robinson, N.L., 1996. Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. J. Comput. Chem. 20, 25–33.

9

Huang, H.D., Lee, T.Y., Tzeng, S.W., Wu, L.C., Horng, J.T., Tsou, A.P., Huang, K.T., 2005a. Incorporating hidden Markov models for identifying protein kinase-specific phosphorylation sites. J. Comput. Chem. 26, 1032–1041. Huang, H.D., Lee, T.Y., Tzeng, S.W., Horng, J.T., 2005b. KinasePhos: a web tool for identifying protein kinase-specific phosphorylation sites. Nucleic Acids Res. 33, 226–229. Inkyung, J., Akihisa, M., Minoru, Y., Dongsup, K., 2010. PostMod: sequence based prediction of kinase-specific phosphorylation sites with indirect relationship. BMC Bioinforma. 11, S10. Jayadeva, Khemchandani, R., Chandra, S., 2007. Twin support vector machines for pattern classification. IEEE Trans. Pattern Anal. Mach. Intell. 29, 905–910. Khosravian, M., Faramarzi, F.K., Beigi, M.M., Behbahani, M., Mohabatkar, H., 2013. Predicting antibacterial peptides by the concept of Chou's pseudo-amino acid composition and machine learning methods. Prot. Pept. Lett. 20, 180–186. Kim, J.H., Lee, J., Oh, B., Kimm, K., Koh, I., 2004. Prediction of phosphorylation sites using SVMs. Bioinformatics 20, 3179–3184. Knight, Z.A., Schilling, B., Row, R.H., Kenski, D.M., Gibson, B.W., Shokat, K.M., 2003. Phosphospecific proteolysis for mapping sites of protein phosphorylation. Nat. Biotechnol. 21, 1047–1054. Kraft, C., Herzog, F., Gieffers, C., Mechtler, K., Hagting, A., Pines, J., Peters, J., 2003. Mitotic regulation of the human anaphase-promoting complex by phosphorylation. EMBO J. 22, 6598–6609. Lin, W.Z., Xiao, X., Chou, K.C., 2009. GPCR-GIA: a web-server for identifying Gprotein coupled receptors and their families with grey incidence analysis. Prot. Eng. Des. Sel. 22, 699–705. Mann, M., Jensen, O.N., 2003. Proteomic analysis of post-translational modifications. Nat. Biotechnol. 21, 255–261. Mei, S., 2012. Predicting plant protein subcellular multi-localization by Chou's PseAAC formulation based multi-label homolog knowledge transfer learning. J. Theor. Biol. 310, 80–87. Mohabatkar, H., Beigi, M.M., Abdolahi, K., Mohsenzadeh, S., 2013. Prediction of allergenic proteins by means of the concept of Chou's pseudo amino acid composition and a machine learning approach. Med. Chem. 9, 133–137. Monigatti, F., Gasteiger, E., Bairoch, A., Jung, E., 2002. The Sulfinator: predicting tyrosine sulfation sites in protein sequences. Bioinformatics 18, 769–770. O'Donovan, C., Martin, M.J., Gattiker, A., Gasteiger, E., Bairoch, A., Apweiler, R., 2002. High-quality protein knowledge resource: SWISS-PROT and TrEMBL. Brief. Bioinforma. 3 (September (3)), 275–284. Obenauer, J.C., Cantley, L.C., Yaffe, M.B., 2003. Scansite 2.0: proteome-wide prediction of cell signaling interactions using short sequence motifs. Nucleic Acids Res. 31, 3635–3641. Platt, J.C., 1999. Probabilistic output for support vector machines and comparisons to regularized likehood methods. Adv. Larg. Marg. Classif., 61–74. Plewczynski, D., Tkacz, A., Wyrwicz, L., Rychlewski, L., Ginalski, K., 2008. AutoMotif Server for prediction of phosphorylation sites in proteins using support vector machine: 2007 update. J. Mol. Model. 14, 69–76. Rychlewski, L., Kschischo, M., Dong, L., Schutkowski, M., Reimer, U., 2004. Target specificity analysis of the Abl kinase using peptide microarray data. J. Mol. Biol. 336, 307–311. Shao, J.L., Xu, D., Tsai, Sau-Na, Wang, Y.F., Ngai, Sai-Ming, 2009a. Computational identification of protein methylation sites through bi-profile Bayes feature extraction. PLoS ONE 4, e4920. Shao, X.J., Tian, Y.J., Wu, L.Y., Wang, Y., Jing, L., Deng, N.Y., 2009b. Predicting DNAand RNA-binding proteins from sequences with kernel methods. J. Theor. Biol. 258, 289–293. Shen, J., Zhang, J., Luo, X.M., Zhu, W.L., Yu, K.Q., Chen, K.X., Li, Y.X., 2007. Predicting protein–protein interactions based only on sequences information. Proc. Natl. Acad. Sci. 104, 4337–4341. Sobolev, B., Filimonov, D., Lagunin, A., Zakharov, A., Koborova, O, et al., 2010. Functional classification of proteins based on projection of amino acid sequences: application for prediction of protein kinase substrates. BMC Bioinforma. 11, 313. Tang, Y.R., Chen, Y.Z., Canchaya, C.A., Zhang, Z.D., 2007. GANNPhos: a new phosphorylation site predictor based on a genetic algorithm integrated neural network. Prot. Eng. Des. Sel. 20 (8), 405–412. Trost, B., Kusalik, A., 2013. Computational phosphorylation site prediction in plants using random forests and organism-specific instance weights. Bioinformatics 29, 686–694. Vapnik, V., 1995. The Nature of Statistical Learning Theory. Springer. Vapnik, V., 1998. Statistical Learning Theory. Wiley. Wan, J., Kang, S., Tang, C.N., Yan, J.H., Ren, Y.L., Liu, J., Gao, X.L., Banerjee, A., Ellis, L.B.M., Li, T.B., 2008. Meta-prediction of phosphorylation sites with weighted voting and restricted grid search parameter selection. Nucleic Acids Res. 36, e22. Wang, M.H., Li, C.H., Chen, W.Z., Wang, C.X., 2008. Prediction of PK-specific phosphorylation site based on information entropy. Sci. China Ser. C: Life Sci. 51, 12–20. Wong, Y.H., Lee, T.Y., Liang, H.K., Huang, C.M., Wang, T.Y., Yang, Y.H., Chu, C.H., Huang, H.D., Ko, M.T., Hwang, J.K., 2007. KinasePhos 2.0: a web server for identifying protein kinase-specific phosphorylation sites based on sequences and coupling patterns. Nucleic Acids Res. 35, W588–W594. Xiao, X., Wang, P., Chou, K.C., 2009. GPCR-CA: a cellular automaton image approach for predicting G-protein-coupled receptor functional classes. J. Comput. Chem. 30, 1414–1423. Xiao, X., Wang, P., Lin, W.Z., Jia, J.H., et al., 2013. iAMP-2L: a two-level multi-label classifier for identifying antimicrobial peptides and their functional types. Anal. Biochem. 436, 168–177.

Please cite this article as: Xu, Y., et al., Prediction of posttranslational modification sites from amino acid sequences with kernel methods. J. Theor. Biol. (2013), http://dx.doi.org/10.1016/j.jtbi.2013.11.012i

10

Y. Xu et al. / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Xu, Y., Shao, X.J., Wu, L.Y., Deng, N.Y., Chou, K.C., 2013. iSNO-AAPair: incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteins. Peer J. 1, e171. Xue, Y., Zhou, F.F., Zhu, M., Ahmed, K., Chen, G.L., Yao, X.B., 2005. GPS: a comprehensive www server for phosphorylation sites prediction. Nucleic Acids Res. 33, W184–W187. Xue, Y., Li, A., Wang, L.R., Feng, H.Q., Yao, X.B., 2006. PPSP: prediction of PK-specific phosphorylation site with Bayesian decision theory. BMC Bioinforma. 7, 163. Xue, Y., Ren, J., Gao, X.J., Jin, C.J., Wen, L.P., Yao, X.B., 2008. GPS 2.0, a tool to predict kinase-specific phosphorylation sites in hierarchy. Mol. Cell. Prot. 7, 1598–1608. Xue, Y., Liu, Z., Cao, J., Ma, Q., Gao, X., et al., 2011. GPS 2.1: enhanced prediction of kinase-specific phosphorylation sites with an algorithm of motif length selection. Prot. Eng. Des. Sel. 24, 255–260.

Xu, Y., Ding, J., Wu, L.Y., et al., 2013. iSNO-PseAAC: predict cysteine S-nitrosylation sites in proteins by incorporating position specific amino acid propensity into pseudo amino acid composition. PLoS ONE 8, e55844. Yu, Z., Deng, Z., Wong, H.S., Tan, L., 2010. Identifying protein-kinase-specific phosphorylation sites based on the Bagging-AdaBoost ensemble approach. IEEE Trans. Nanobiosci. 9, 132–143. Zeng, Y.H., Guo, Y.Z., Xiao, R.Q., Yang, L., Yu, L.Z., Li, M.L., 2009. Using the augmented Chou's pseudo amino acid composition for predicting protein submitochondria locations based on auto covariance approach. J. Theor. Biol. 259, 366–372. Zhao, X., Zhang, W., Xu, X., Ma, Z., Yin, M., 2012. Prediction of protein phosphorylation sites by using the composition of k-spaced amino acid pairs. PLoS One 7, e46302. Zhou, F.F., Xue, Y., Chen, G.L., Yao, X.B., 2004. GPS: a novel group-based phosphorylation predicting and scoring method. Biochem. Biophys. Res. Commun. 325, 1443–1448.

Please cite this article as: Xu, Y., et al., Prediction of posttranslational modification sites from amino acid sequences with kernel methods. J. Theor. Biol. (2013), http://dx.doi.org/10.1016/j.jtbi.2013.11.012i

Prediction of posttranslational modification sites from amino acid sequences with kernel methods.

Post-translational modification (PTM) is the chemical modification of a protein after its translation and one of the later steps in protein biosynthes...
510KB Sizes 0 Downloads 0 Views