Chapter 17 Phosphorylation Site Prediction in Plants Qiuming Yao, Waltraud X. Schulze, and Dong Xu Abstract Protein phosphorylation events on serine, threonine, and tyrosine residues are the most pervasive protein covalent bond modifications in plant signaling. Both low and high throughput studies reveal the importance of phosphorylation in plant molecular biology. Although becoming more and more common, the proteome-wide screening on phosphorylation by experiments remains time consuming and costly. Therefore, in silico prediction methods are proposed as a complementary analysis tool to enhance the phosphorylation site identification, develop biological hypothesis, or help experimental design. These methods build statistical models based on the experimental data, and they do not have some of the technical-specific bias, which may have advantage in proteome-wide analysis. More importantly computational methods are very fast and cheap to run, which makes large-scale phosphorylation identifications very practical for any types of biological study. Thus, the phosphorylation prediction tools become more and more popular. In this chapter, we will focus on plant specific phosphorylation site prediction tools, with essential illustration of technical details and application guidelines. We will use Musite, PhosPhAt and PlantPhos as the representative tools. We will present the results on the prediction of the Arabidopsis protein phosphorylation events to give users a general idea of the performance range of the three tools, together with their strengths and limitations. We believe these prediction tools will contribute more and more to the plant phosphorylation research community. Key words Phosphorylation site prediction, PhosPhAt, Musite, Support vector machines, Machine learning

1

Introduction Protein phosphorylation is the most well-known protein posttranslational modification event, which plays important role for plant growth, cell death, and innate immunology response through altering the signaling pathways or protein functionalities [1–3]. High-throughput techniques especially like mass spectrometry make it more systematic and more practical to conduct proteomics study in plant sciences. With the growing number of the proteomewide studies for plants, phospho-proteome-wide study [4] has also become more and more popular in the past decade for hacking the plant signaling events as a whole, by high-resolution screening of

Waltraud X. Schulze (ed.), Plant Phosphoproteomics: Methods and Protocols, Methods in Molecular Biology, vol. 1306, DOI 10.1007/978-1-4939-2648-0_17, © Springer Science+Business Media New York 2015

217

218

Qiuming Yao et al.

the phosphorylated peptides and sites. These studies contribute to the dramatic increase of phospho-proteomics data in just a few years, which led to the establishment of several high-quality web resources of plant phospho-proteomics, e.g., P3DB [5] and PhosPhAt [6]. On the other hand, although high-throughput experimental studies using mass spectrometry can identify large amounts of modified peptides and sites at one time, it is still expensive in terms of cost and running time. The experimental approach is not problem-free either. More and more arguments have been raised to address the reliability of database search and technical variations, such as the enrichment step which is needed in most approaches but potentially cause more bias in peptide identification. Besides, it is even more problematic in ambiguity resolving in phosphosite localization. Furthermore, while some of the model organisms like Arabidopsis and rice are relatively well studied, most other plant species still lack of phosphorylation data from any of the existing experimental records. While researchers can still infer phosphorylation sites from the homologous proteins in species with experimental data, it is very limited, especially as plants often have multiple paralogs whose phosphorylation sites may vary. Besides, the extent of conservation of the phosphorylation events across homologs in different species is still a challenging question [7]. Therefore, in silico prediction of phosphorylation sites is an attractive alternative for single protein prediction or even proteomewide annotation. To conduct general plant phosphorylation site prediction more accurately and with more confidence, it is important to utilize the explosive experimental data effectively. The tools for phosphorylation prediction in plants build comprehensive statistical model to address the inference from these data. Most of existing tools not only use the sequence similarities to known phosphorylation peptides, but also other intrinsic characteristics of protein sequence, including evolutionary patterns across the species [7]. Since the statistical or machine learning methods tend to approach the true classification with sufficient data sets, it is believed that the rapid growth in plant experimental phosphorylation data will bring high predictive power and confidence level in either intra- or crossspecies prediction. Because these methods or tools bridge the information from the known to the unknown, and they are fast, cheap, and scalable, the in silico prediction is becoming an active research topic in the plant phosphorylation community. This chapter will provide a general review or guidance of the current popular plant phosphorylation prediction tools, including Musite [7, 8], PhosPhAt [6], and PlantPhos [9]. PhosphAt specifically trains and infers the data from Arabidopsis, based on its own comprehensive Arabidopsis phosphorylation database. PlantPhos is a plant-specific phosphorylation tool using the

Phosphorylation Site Prediction in Plants

219

maximum dependence decomposition (MDD) to resolve the feature dependencies. Musite is a machine learning based tool by applying feature selection and support vector machine to conduct the training and prediction for phosphorylation. Although Musite was not specifically designed for plant, it can pair up with a plant specific database like P3DB and obtain the plant-specific models after being trained with specific datasets. The results on the Arabidopsis phosphorylation prediction will be discussed to provide the user a general idea for applying these three algorithms based on their own specialties.

2

Overview of the Prediction Methods In this section, an overview of the methodology for Musite and PhosphAt is provided to give the reader a brief idea about the prediction procedure.

2.1 Methodology in Musite

The method for building prediction models in Musite consists of data preprocessing step and the machine learning step with feature extractions.

2.1.1 Data Processing and Sampling

Plant specific predictor Musite is trained specifically by the plant phosphorylation data sets from P3DB [5] as well as from Uniprot-KB [10]. The data from both resources were merged into a single dataset. More specifically, if a phosphorylated site is observed in either source, the correspondent peptide is considered as the positive data. The proteins without any known phosphorylation of the correspondent organism are used as the negative data. The phosphorylated proteins may also contain the non-phosphorylated sites for serine, threonine, or tyrosine residue, in which case these peptides centered by non-phosphorylated sites are considered as negative data too. Generally, there are more negative data than the positive data, and this imbalance issue is not only at the protein level but also amplified to several magnitudes at the site/peptide level. The sequence redundancy was removed in order to avoid potential bias in the machine-learning training process, because some proteins or protein families are artificially over-studied, as well as their homologs. CDHit [11] was used in this process and proteins with more than 50 % (or based on user settings) sequence identity were removed. The unbalancing issue between the phosphorylated sites and non-phosphorylated sites still exists after removing the redundancy. In order to address this problem, the resampling for obtaining a balanced set composed of positive and negative data is performed by the bootstrapping procedure, and is repeated thousands of times to explore the whole space of the original data set. These are the strategies on generating the training data in Musite before conducting any machine learning methods.

220

Qiuming Yao et al.

2.1.2 Feature Selection and Machine Learning

The phosphorylation prediction is formulated as a binary classification problem, i.e., distinguishing the serine threonine, or tyrosine centered peptide as either “can be phosphorylated” or “cannot be phosphorylated”. This is modeled and solved by a machinelearning framework in Musite. K-Nearest Neighbor (KNN) scores, disorder scores, and amino acid frequencies were used as the main features for the training purpose. The serine, threonine, or tyrosine centered flanking sequences were used as peptide samples to extract these features. Practically, the length of the flanking sequences is not fixed and can be set by users. Different sizes of the flanking peptides represent different scales of the local information, which can be informative in identifying the intrinsic properties. KNN score is the ratio between the numbers of positive and negative peptides around the candidate peptide. It is understandable that if the neighborhood contains more hits from the positive data, this peptide is more likely phosphorylated. The range of neighborhood for the measurement of KNN needs to be predefined, which is set as a certain percentage of the total positive and negative peptides (the training population). The neighboring peptides around the candidate peptide is calculated, sorted and ranked based on the similarity score between the candidate peptide and all others in the training population, taking into account the amino acid similarity described by BLOSUM matrices (usually BLOSUM62 is selected as the default). The default settings for the range of the neighborhood are 0.25 %, 0.5 %, 1 %, 2 %, and 4 % of the overall population, respectively; therefore, five different KNN scores are obtained for each candidate peptide. The length of the flanking sequence in calculating KNN score is set to 13, with 6 amino acids at each side of the centered residue. Disorder score is a feature to measure the stability of the local structure. It needs to be precalculated by VSL2B [12], which is a predictor of protein disorder from sequence only. The disorder feature of a given peptide is calculated as the average disorder score over its all amino acids to smoothen the nearby information. Musite often uses the sequence length of the disorder calculation with 1, 5, and 13, so that there are three disorder scores for each candidate peptide. Amino acid frequency reflects the amino acid composition or preference in phosphorylated peptides [13] compared to the nonphosphorylated ones. It is represented as a vector of length of 20 given that there are 20 different amino acids, and it contains the normalized counts for each. The length of the peptide for calculating amino acid frequency is usually 13 in Musite. As mentioned above, bootstrapping is applied to solve the unbalanced problem of the positive and negative sites. Since the negative sites dominate the whole dataset, each time only the same amount of the negative sites were randomly sampled and formed a balanced set with the positive data for training. Then support

Phosphorylation Site Prediction in Plants

221

vector machine (SVM) is used to train on each sampled dataset. The final prediction score is aggregated by averaging the outputs of all the SVM classifiers. 2.2 Methodology in PhosPhAt

3

The prediction tool under PhosPhAt is based on a training data set of experimentally identified phosphorylation sites. Nonphosphorylated peptides were used as a negative control. To avoid abundance bias towards the negative class (unphosphorylated serine sites), the raw set of 49,314 true negative serine sites was reduced by randomly eliminating sites from the set until the negative set was no more than twice as large as the positive set. This final datasets served as the true-negative set. We used the svm-light package developed by Joachims and coworkers [14]. The feature vector (FV) used for the Support Vector Machines consisted of the sequence of amino acids and their chemical-physical properties. The sequence information part was represented by a vector consisting of 240 elements (12 × 20 with 6 residues on either side of the central serine and 20 amino acid types). Each component of the vector was set to 1 in case of an occurrence of the particular amino acid type in the respective position. For the amino acid property part of the FV, we utilized data from the collection of 530 commonly used indices provided by the AAindex database [15] including hydrophobicity, solvent accessibility preferences, secondary and tertiary structure preferences, polarity, volume, solvent accessibility, as well as structural disorder indices. The resulting vector consisted of 530 × 12 elements representing every index and position around the central serine. Optimal parameters for the kernel decision function, as judged by the highest obtained CC value, have been determined by using the built-in Leave-One-Out (LOO) test for all possible parameter combinations for degree of the polynomial function with degrees ranging from 2 to 4 and error weighting values (cost factor) ranging from 1 to 2.5 in 0.25 increments (21 possible parameter combinations) [16].

Prediction Tools and Function Modules In this section, we will describe the usage and functionalities of three current prediction tools: Musite, PhosphAt, and PlantPhos.

3.1

Musite

3.1.1 Desktop Version

The Musite desktop version is written in Java and needs to be installed in a Java running environment (JRE). It can be run on both Mac and PC. It provides standalone prediction, which does not need Internet connection. The pre-trained models are stored in the package. The user can also train a prediction model based on his/her own dataset. The installment package and the source code

222

Qiuming Yao et al.

are available at http://musite.sourceforge.net/. The following shows the steps to run Musite: 1. Go through “Tool → Feature Extraction → Disorder Prediction” to get the disorder score for the samples (see Note 1). 2. Train your model by open the training panel through “Tool → Prediction Model Training”. The data can be uploaded in the Musite XML format or FASTA format (see Note 2). The PTM type and related amino acid need to be set (see Note 3). 3. Click the button of “Advanced Options”, and then the user can customize the features that are needed and the training settings for bootstrap (see Note 4). 4. After the model is properly trained, the user should be able to see and select the model in the drop list of “Select a Model File”, and conduct predictions for query sequences. 5. The user can type in or paste the FASTA sequences in the edit box, or upload a FASTA file or a Musite XML file (see Note 5). 6. After the prediction is done, a new panel showing the results will be popped out (see Note 6). The results can be saved to a Musite XML format, which contains the prediction score at every candidate position. In the desktop version, Musite provides several powerful tools, which can help preprocess the data as well as obtain some statistics for the analysis. Musite provides functionalities to convert multiple file formats to or from the standard Musite XML format. It also can help collect all the accession numbers from the training dataset, and pile up the flanking peptides in the query sequences. Musite can be used to filter the protein data set by organism, accessions, PTM type, and other annotations. Furthermore, Musite provides strong statistical tool sets to count sites or overlap of sites among samples. Remember that in the data preprocessing procedure, a nonredundant dataset is usually needed with below 50 % of the sequence identify, which can also be done by using Musite. In many cases when having multiple data resources, Musite can be used to merge them based on sequence comparison. Musite is an open-source tool, which can be changed or customized for a specific purpose. The users can customize the data set for training purposes usually from their own wet-lab experiment to provide a unique prediction model for specific species or data application. In addition, any users who are familiar with the Java programming language can download the source code and refine the statistical methods or improve the feature extraction. The main modules in the codes include the multiple data module (which defines the data structures of prediction and prediction results), the input/output (IO) module (which supports reading

Phosphorylation Site Prediction in Plants

223

from or writing to files of various formats including FASTA, Musite XML, and UniProt XML), the feature extraction module (which defines the feature types and extraction procedure), the classifier module (which defines the binary classifier), the user interface (UI) module (which provides a friendly graphical user interface (GUI) to various functionalities) and the utility module (which defines some common usage functions). The open-source framework allows researchers in the bioinformatics or plant phosphorylation community to make contribution to and improve Musite. 3.1.2 Web Interface

A Web application is available for Musite at http://musite.net. This web implementation does not provide customized training, and it is for prediction only. Specific models for plant phosphorylation are provided at the website, and they are pre-trained systematically with the P3DB database and thus have high quality. This plant specific phosphorylation prediction tool can also be accessed through the P3DB toolkits (http://p3db.org/prediction.php). The procedure of using the Musite website is described as follows: 1. Submit the protein list by providing the Uniprot Accessions directly, or paste a set of FASTA-formatted sequences in the edit box. 2. Select a prediction model and click “submit”. 3. The result panel will appear as a new tab. Multiple sequences will be separated by stacked bars in the panel. Click one of the bars to obtain the detailed results (see Note 7). 4. Download the result for all the predicted sites or with a filtering setting. The website provides an API (Application Programming Interface) for other websites to access our prediction services. An example format of the API link is http://musite. net/?seq=%s&model = overall.sixplant.ser.thr.model (where %s is the protein sequence, and “model = overall.sixplant.tyr.model” represents the tyrosine phosphorylation prediction model). Our in-house P3DB database has links to Musite in this way.

3.2

PhosPhAt

1. Go to the “Prediction” field in the left tab. 2. Paste either an AGI code or a protein sequence into the respective query field and submit. 3. The protein prediction tab will open and show the protein sequence with highlighted phosphorylation sites. Mouse-over will give the respective prediction scores. Score >0 indicates a positive prediction. 4. Predictions can be exported using the “export” function in the upper right.

224

3.3

Qiuming Yao et al.

PlantPhos

PlantPhos is available at http://csb.cse.yzu.edu.tw/PlantPhos/ Predict.html. A prediction can be done with the following procedure: 1. Paste the FASTA sequence or upload a FASTA file with sequences. 2. Select the target amino acid type for the prediction. 3. The results will be listed as a table showing the amino acids at different positions with different scores of possible phosphorylation. If the score is negative, it is non-phosphorylated site; if it is positive, it is a potential phosphorylation site. If the score is close to zero, there is more uncertainty of the prediction. This prediction model is pre-calculated by HMM (Hidden Markov Model) and shows the matched motifs.

4

Performance of the Different Prediction Tools We compared the performance of the above three software tools (Musite, PhosPhAt, and PlantPhos) using a benchmark dataset that we created. The dataset is combined from P3DB (version 3.5) and Uniprot-KB (June 2014) for Arabidopsis. The redundant proteins are removed at a 50 % of the sequence identify by the CD-Hit clustering. The data are divided into the training and testing sets. The training set contains 4,875 phosphorylated proteins out of 16,668 proteins for the training set and we created a balanced testing set with 484 phosphorylated proteins out of 968 total proteins. The number of sites (S, T, Y) are even more imbalanced and thus we implemented 2,000 bootstraps in our Musite training. The same testing set is also used to evaluate the performance based on the ROC curves for Musite, PhosPhAt and PlantPhos. A curve located more at the top-left side indicates better performance, in which the specificity and the sensitivity are larger. Figure 1 shows that overall Musite outperforms PhosPhAt, and PhosphAt outperforms PlantPhos. But in some cases, such as for the tyrosine prediction, when the specificity is low, the sensitivity of Musite is lower than PhosPhAt and PlantPhos. Needless to say, this comparison is based only on specificity and sensitivity for the assessment. Other factors should be considered too. For example, PhosPhAt is especially fcoused on Arabidopsis and integrates extensive information (e.g., domain identification) with the prediction tool. PlantPhos provides the HMM motif matching as an additional output information. So these tools provide complementary values with each other and users can choose any of them depending on particular application and need.

Phosphorylation Site Prediction in Plants

a

225

Arabidopsis Phosphorylation S/T 1 0.9 0.8

Sensitivity

0.7 0.6 0.5 0.4 Musite

0.3

PhosPhoAt 0.2 PlantPhos 0.1 0

b

0

0.1

0.2

0.3

0.4 0.5 0.6 1-Specificity

0.7

0.8

0.9

1

0.9

1

Arabidopsis Phosphorylation Y 1 0.9 0.8

Sensitivity

0.7 0.6 0.5 0.4 Musite

0.3

PhosPhoAt

0.2

PlantPhos 0.1 0

0

0.1

0.2

0.3

0.4 0.5 0.6 1-Specificity

0.7

0.8

Fig. 1 Reciever operating characteristics curves of the predictions of (a) serine and threonine phosphorylation and (b) tyrosine phosphorylation by different plant specific prediction engines Musite, PhosPhAt, and PlantPhos

226

5

Qiuming Yao et al.

Conclusions With improvement of large-scale phosphorylation identification, in-silico prediction tools are also becoming more and more popular. Multiple plant protein prediction tools (Musite, PhosPhAt, and PlantPhos) are illustrated in this chapter, and their procedures, results, and specialties are also described. Although different tools have different performance and features, the general methodology and user experience are similar. By applying these tools, the phosphorylation site identification can be enhanced and the whole proteome-wide study will be more practical with little cost or time. For example, proteome-wide prediction of phosphorylation sites was applied in a study of the effect of single nucleotide polymorphisms on protein phosphorylation sites [17]. A similar study was conducted for tissues affected by different disease in human cells [18]. These examples indicate the value of prediction data in largescale biology.

6

Notes 1. If the disorder score is already included in the Musite XML format, there is no need to run the disorder prediction beforehand. The disorder results will be automatically written into the standard Musite XML format, and then it can be directly used for training purpose. 2. For the training from the FASTA file, because there is no way to include the disorder score in this FASTA format, the user can upload a disorder score prediction result file separately. 3. The user can select multiple amino acid types together, and they will be predicted together in a single model. Usually serine and threonine are selected together and trained into a single model for phosphorylation prediction, while tyrosine is trained in a different model. This is due to the fact that the serine and threonine phosphorylation events share major similarities from sequence-based information, kinase characteristics, and functionalities. 4. This provides much flexibility in training without changing the code. The parameters will control the peptide length, KNN calculation, and other properties. Parameters can also be tuned for the SVM training. 5. We recommend using the Musite XML format in this tool because it is more convenient to see the results and features, and much easier to do other operations. 6. The results panel will show the prediction results with colorcoded background on each residue that needs to be predicted.

Phosphorylation Site Prediction in Plants

227

The color gradient shows the prediction under certain specificity. The scroll bar at the bottom can be tuned and different numbers of sites will be seen as phosphorylated under different specificity threshold. A common setting is at the 95 % specificity. 7. This is an interactive interface. The potential phosphorylation sites are highlighted with different colors coded for specificity values under a specificity threshold. The user can choose to hide or show the sequence and the specificity scroll bar. By clicking the “graph”, the specificity levels of the candidate amino acids at different positions are shown with colors coded for specificity. By clicking the “table”, the results are shown in a tabular format. References 1. Pawson T (2004) Specificity in signal transduction: from phosphotyrosine-SH2 domain interactions to complex cellular systems. Cell 116(2):191–203 2. Pawson T, Gish GD (1992) SH2 and SH3 domains: from structure to function. Cell 71: 359–362 3. Wang H, Chevalier D, Larue C, Ki Cho S, Walker JC (2007) The protein phosphatases and protein kinases of Arabidopsis thaliana. Arabidopsis Book 5:e0106. doi:10.1199/tab.0106 4. Grimsrud PA, den Os D, Wenger CD, Swaney DL, Schwartz D, Sussman MR, Ane JM, Coon JJ (2010) Large-scale phosphoprotein analysis in Medicago truncatula roots provides insight into in vivo kinase activity in legumes. Plant Physiol 152(1):19–28 5. Yao Q, Ge H, Wu S, Zhang N, Chen W, Xu C, Gao J, Thelen JJ, Xu D (2014) P3DB 3.0: from plant phosphorylation sites to protein networks. Nucleic Acids Res 42:D1206–D1213 6. Zulawski M, Braginets R, Schulze WX (2013) PhosPhAt goes kinases—searchable protein kinase target information in the plant phosphorylation site database PhosPhAt. Nucleic Acids Res 41(D1):D1176–D1184 7. Yao Q, Gao J, Bollinger C, Thelen JJ, Xu D (2012) Predicting and analyzing protein phosphorylation sites in plants using musite. Front Plant Sci 3:186. doi:10.3389/fpls.2012.00186 8. Gao J, Thelen JJ, Dunker AK, Xu D (2010) Musite, a tool for global prediction of general and kinase-specific phosphorylation sites. Mol Cell Proteomics 9(12):2586–2600

9. Lee TY, Bretana NA, Lu CT (2011) PlantPhos: using maximal dependence decomposition to identify plant phosphorylation sites with substrate site specificity. BMC Bioinformatics 12:261. doi:10.1186/1471-2105-12-261 10. UniProt: a hub for protein information (2014) Nucleic Acids Res. doi:10.1093/nar/gku989 11. Fu L, Niu B, Zhu Z, Wu S, Li W (2012) CD-HIT: accelerated for clustering the nextgeneration sequencing data. Bioinformatics 28(23):3150–3152. doi:10.1093/bioinformatics/bts565 12. Obradovic Z, Peng K, Vucetic S, Radivojac P, Dunker AK (2005) Exploiting heterogeneous sequence properties improves prediction of protein disorder. Proteins 61(Suppl 7):176–182. doi:10.1002/prot.20735 13. Iakoucheva LM, Radivojac P, Brown CJ, O'Connor TR, Sikes JG, Obradovic Z, Dunker AK (2004) The importance of intrinsic disorder for protein phosphorylation. Nucleic Acids Res 32(3):1037–1049. doi:10.1093/nar/gkh253 14. Joachims T (1999) Making large-scale SVM learning practical. In: Advances in kernel methods—support vector learning. MIT Press, Boston 15. Kawashima S, Kanehisa M (2000) AAindex: amino acid index database. Nucleic Acids Res 28(1):374 16. Durek P, Schmidt R, Heazlewood JL, Jones A, MacLean D, Nagel A, Kersten B, Schulze WX (2010) PhosPhAt: the Arabidopsis thaliana phosphorylation site database. An update. Nucleic Acids Res 38:D828–D834

228

Qiuming Yao et al.

17. Riano-Pachon DM, Kleessen S, Neigenfind J, Durek P, Weber E, Engelsberger WR, Walther D, Selbig J, Schulze WX, Kersten B (2010) Proteome-wide survey of phosphorylation patterns affected by nuclear DNA polymorphisms in Arabidopsis thaliana. BMC Genomics 11(1):411

18. Ren J, Jiang C, Gao X, Liu Z, Yuan Z, Jin C, Wen L, Zhang Z, Xue Y, Yao X (2010) PhosSNP for systematic analysis of genetic polymorphisms that influence protein phosphorylation. Mol Cell Proteomics 9(4): 623–634

Phosphorylation site prediction in plants.

Protein phosphorylation events on serine, threonine, and tyrosine residues are the most pervasive protein covalent bond modifications in plant signali...
287KB Sizes 1 Downloads 7 Views