Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Q1 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

Contents lists available at ScienceDirect

Journal of Theoretical Biology journal homepage: www.elsevier.com/locate/yjtbi

Chou's pseudo amino acid composition improves sequence-based antifreeze protein prediction Sukanta Mondal n, Priyadarshini P. Pai Department of Biological Sciences, Birla Institute of Technology and Science – Pilani, K.K. Birla Goa Campus, Zuarinagar 403726, Goa, India

H I G H L I G H T S

 Antifreeze protein (AFP) prediction is challenging because of their diversity.  Support vector machines using sequence order explored for AFP prediction.  Sequence order information encoded using pseudo amino acid composition.  Introduction of sequence order effect enhances AFP prediction.

G R A P H I C A L

A B S T R A C T

Antifreeze Proteins Pseudo Amino Acid Composition

QUERY Sequence(s)

AND

OR Non-Antifreeze Proteins

Support Vector Machines

art ic l e i nf o

a b s t r a c t

Article history: Received 25 February 2014 Received in revised form 28 March 2014 Accepted 2 April 2014

Antifreeze proteins (AFP) in living organisms play a key role in their tolerance to extremely cold temperatures and have a wide range of biotechnological applications. But on account of diversity, their identification has been challenging to biologists. Earlier work explored in this area has yet to cover introduction of sequence order information which is known to represent important properties of various proteins and protein systems for prediction purposes. In this study, the effect of Chou's pseudo amino acid composition that presents sequence order of proteins was systematically explored using support vector machines for AFP prediction. Our findings suggest that introduction of sequence order information helps identify AFPs with an accuracy of 84.75% on independent test dataset, outperforming approaches such as AFP-Pred and iAFP. The relative performance calculated using Youden's Index (SensitivityþSpecificity  1) was found to be 0.71 for our predictor (AFP-PseAAC), 0.48 for AFP-Pred and 0.05 for iAFP. We hope this novel prediction approach will aid in AFP based research for biotechnological applications. & 2014 Published by Elsevier Ltd.

Keywords: Convergent evolution Support vector machines Sequence order effect 10-Fold cross-validation Youden's Index

1. Introduction Antifreeze proteins (AFPs) are a diverse group of polypeptides found in animals, plants, microbes, especially in fish inhabitants of ice-laden sea water, which provide protection from freezing in extremely cold environments (Se-Kwon, 2013). This defense is bestowed upon the organisms at cellular levels as AFPs have

n

Corresponding author. Tel.: þ 91 832 258 0149; fax: þ91 832 255 7031. E-mail addresses: [email protected], [email protected] (S. Mondal).

a unique ability to adsorb onto the surface of ice (Davies and Hew, 1990). Depending upon the surrounding, organisms adopt two strategies namely freeze tolerance and freeze avoidance to survive at low and subzero temperatures (Lewitt, 1980; Sformo et al., 2009; Chou, 1992), which may account for the diversity observed among various species. Analyses of AFPs from fish, insects and plants have shown that there is no consensus sequence or structure for an ice-binding domain. Such an insight, at sequence or structural level, is important for understanding protein–ice interactions and freeze tolerance mechanisms of AFPs. Since these proteins hold a promising scope for wide range of biotechnological applications in

http://dx.doi.org/10.1016/j.jtbi.2014.04.006 0022-5193/& 2014 Published by Elsevier Ltd.

Please cite this article as: Mondal, S., Pai, P.P., Chou's pseudo amino acid composition improves sequence-based antifreeze protein prediction. J. Theor. Biol. (2014), http://dx.doi.org/10.1016/j.jtbi.2014.04.006i

S. Mondal, P.P. Pai / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

industry, medicine, food technology, cell lines and organ preservation, cryosurgery and transgenics, gaining knowledge into their functional mechanisms has become increasingly essential (Kandaswamy et al., 2011). With the enormous amount of genomic data available today, a rapid, specific and highly precise automated approach is desirable for identification and annotations of AFPs. Researchers, encouraged by the overwhelming success of machine learning methods in protein classification and function prediction (Kandaswamy et al., 2011; Anand et al., 2008; Cai et al., 2004; Chou, 2001, 2005; Chou and Cai, 2005; Chou and Shen, 2009a; Huang et al., 2009; Xiaowei et al., 2012; Yu and Lu, 2011), developed sequence based solutions such as AFP-Pred (Kandaswamy et al., 2011), AFP_PSSM (Xiaowei et al., 2012) and iAFP approach (Yu and Lu, 2011). These methods explored physicochemical properties (Kandaswamy et al., 2011), evolutionary information (Xiaowei et al., 2012), n-peptide composition based coding schemes (Yu and Lu, 2011) for prediction. However, the scope for truly reflecting the intrinsic correlation of sequence representatives with the object to be predicted by machine learning remained. In this regard, sequential and discrete models were formulated to represent various proteins. But using straightforward sequential models was not so fruitful in preserving the sequence order information. So, to address this issue, the concept of pseudo amino acid composition (pseAAC) was proposed (Chou, 2009). In pseAAC, protein sequences are represented as discrete models yet without completely losing the sequence order information. Ever since this idea of pseAAC was proposed, many models for addressing various kinds of problems in proteins and protein related systems have been put forth. Further, different modes of optimal pseAAC composition are known to correspond to different protein attributes. Subsequently, the process of key components selection from the trivial ones for obtaining its pseAAC has become challenging. However, as pseAAC gives important direction for further improvement of the quality of protein attribute prediction, it has captured the interest of biologists (Chou, 2009). In this study, we have explored the effect of introducing sequence order information in prediction of AFPs by using Chou's pseudo amino acid composition based protein features extracted from AFP dataset (Kandaswamy et al., 2011) followed by classification using support vector machines (Joachims, 1999). The prediction approach has been designed on the basis of previously reported successful studies (Chen et al., 2013, 2012b; Min et al., 2013; Xiao et al., 2013; Qiu et al., 2014; Chou, 2011; Mondal et al., 2006; Dhole et al., 2014). Aspects of example selection, influence of large numbers of negative examples and pseAAC modes were searched for development of the AFP predictor. Training was done using 10-fold cross-validation (Schaffer, 1993) for optimization of parameters followed by rigorous check using jackknife test (Mondal et al., 2006; Dhole et al., 2014). Consequently, testing on independent test dataset was done to analyze if sequence order information improved the overall prediction accuracy of AFPs. The findings of our study can facilitate AFP based studies and applications.

2. Materials and methods 2.1. Dataset We obtained AFP dataset used for the development of AFP-Pred (Kandaswamy et al., 2011) constructed as follows: antifreeze protein sequences were collected from seed proteins of the Pfam database (Sonnhammer et al., 1997), enriched by performing Position Specific Iteration-Basic Local Alignment Search Tool (PSI-BLAST) (Altschul et al., 1997) with a string threshold (E-value)

of 0.001, and followed by manual inspection for presence of AFPs. Finally non-redundant datasets comprising of 481 AFPs (positive examples) and 9493 non-AFPs (negative examples) were obtained. 2.2. Pseudo amino acid composition Representing protein sequences with sequence order information herein is done using pseudo amino acid composition (Chou, 2009, 2011). To develop an effective predictor, a powerful prediction algorithm with an effective mathematical expression, truly representing the protein sequence in correlation with the object to be predicted, is required. Ever since the concept of pseAAC was proposed in 2001 (Chou, 2001), it has penetrated into various areas of computational proteomics (Chou, 2011; Mondal et al., 2006; Xu et al., 2013; Ding et al., 2009; Lin et al., 2009, 2013b; Du and Li, 2006; Chen et al., 2012a; Feng et al., 2013a). Owing to its diverse potential applications, beginning with software such as PseAAC web server (Shen and Chou, 2008) to advanced powerful open access packages (Du et al., 2012, 2014; Cao et al., 2013; Guo et al., 2014; Feng et al., 2013b) have been devised. These have demonstrated effectiveness in relevant fields, altogether, suggesting broad-range successful applications of both traditional and advanced forms of pseAAC. Different properties of amino acids in the proteins correspond to different modes of pseAAC composition. Thus, to derive the discrete models from the proteins for our study, the following mathematical functions were used (Chou, 2005, 2009). The composition of a given protein P containing amino acid residues of length L is depicted as P ¼ R1 R2 R3 R4 R5 …RL

ð1Þ

where R1 represents the 1st residue, R2 represents the 2nd residue, …, and RL represents the L-th residue, and they each belong to one of the 20 native amino acids. In the classical mode (Type 1), amino acid composition is expressed as P ¼ ½p1 ; p2 ; …; p20 ; p20 þ 1 ; …; p20 þ λ T ;

ðλ oLÞ

where the 20 þλ components are given by 8 fu > ð1 ru r 20Þ < ∑20 f þ ω∑ λ τ ; i ¼ 1 i k ¼ 1 k pu ¼ ωτu  20 > ; ð20 þ 1 ru r 20 þ λÞ : ∑20 f þ ω∑λk ¼ 1 τk i ¼ 1 i

ð2Þ

ð3Þ

where ω is the weight factor and τk is the k-th tier correlation factor that reflects the sequence order correlation between all the k-th most contiguous residues as formulated by τk ¼

2  2 1 L  k 1 n H 1 ðRi þ k Þ  H 1 ðRi Þ þ H 2 ðRi þ k Þ  H 2 ðRi Þ ∑ Lk i ¼ 1 3  2 o þ MðRi þ k Þ  MðRi Þ

ð4Þ

where H1(Ri), H2(Ri) and M(Ri) are respectively the normalized hydrophobicity value, hydrophilicity value and the side chain mass for amino acid residue Ri; while H1(Ri þ k), H2(Ri þ k) and M(Ri þ k) are those for amino acid residue Ri þ k. In the amphiphilic mode (Type 2), the given protein P can be represented as: P ¼ ½p1 ; p2 ; …; p20 ; p20 þ 1 ; …; p20 þ 2λ T ;

ðλ o LÞ

where the 20 þ2λ components are given by 8 f > < ∑20 f i þvω∑2λ τj ; ð1 rv r 20Þ i ¼ 1 j ¼ 1 pv ¼ ωτv > ; ð20 þ 1 rv r 20 þ 2λÞ 2λ : ∑20 f þ ω∑ τ i ¼ 1 i j ¼ 1 j

ð5Þ

ð6Þ

Please cite this article as: Mondal, S., Pai, P.P., Chou's pseudo amino acid composition improves sequence-based antifreeze protein prediction. J. Theor. Biol. (2014), http://dx.doi.org/10.1016/j.jtbi.2014.04.006i

S. Mondal, P.P. Pai / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

and τj can be represented as follows: Ll

τm ¼

1 1 1 ∑ h ðRi Þh ðRi þ l Þ L l i ¼ 1

ð7Þ

τn ¼

1 Ll 2 2 ∑ h ðRi Þh ðRi þ l Þ Ll i ¼ 1

ð8Þ

where m ¼1, 3, 5, …, (2λ  1); n¼ 2, 4, 6, …, 2λ and l ¼1, 2, 3, 4, …λ (λ oL). The values of h1(Ri) and h2(Ri) represent the normalized hydrophobicity and hydrophilicity properties of amino acid residue Ri; while h1(Ri þ l) and h2(Ri þ l) are those for amino acid residue Ri þ l. Multiple combination of λ in [5, 10, 15, 20, 30, 40] and ω in [0.05, 0.10, 0.30, 0.50, 0.70] were explored using the PseAAC web server (Shen and Chou, 2008) available at http://www.csbio.sjtu. edu.cn/bioinf/PseAAC/ for prediction. 2.3. Support vector machines To discriminate AFPs from non-AFPs using pseAAC features, a classifier popular for solving biological challenges related to prediction, classification and regression, the support vector machines (SVMs) was used. SVM is a supervised machinelearning tool based on the structural risk minimization principle of statistics learning theory. It looks for an optimal hyperplane which maximizes the distance between the hyperplane and the nearest samples from each of the two classes. Mathematically, a training vector xi A Rn, and class values yi A {  1, 1}, i¼1, …, N are used to solve the problems using the following equations: N 1 Minimize wT w þ C ∑ ξi 2 i¼1

Subject to yi ðwT xi þ bÞ Z1  ξi and ξi Z0

ð9Þ ð10Þ

where w is the normal vector perpendicular to the hyperplane and ξi are slake variables for permitting misclassifications. Balancing the trade-off between the margin and the training error is done using C (40), the penalty parameter (Joachims, 1999). The user can choose and optimize number of parameters and kernels (e.g., linear, polynomial, radial basis function and sigmoidal) or any user-defined kernel. In this study, we selected radial basis function for AFP prediction with grid searching of influencing parameters, C in [5–50] and γ in [0.00001–0.1] and developed models using SVMlight Version 6.02 package available at http://svmlight.joachims.org/. 2.4. Performance evaluation In statistical prediction, often three cross-validation methods are used to examine a predictor for its effectiveness in practical application: independent dataset test; subsampling test; and jackknife test. However of the three test methods, the jackknife test is deemed the least arbitrary that can always yield a unique test result for a given benchmark dataset as elaborated by Chou (2011) and demonstrated by Eqs. (28)–(30) therein. Accordingly the jackknife test has been increasingly used and widely recognized by investigators to examine the quality of various predictors (Xiao et al., 2013; Qiu et al., 2014; Mondal et al., 2006; Dhole et al., 2014; Guo et al., 2014; Feng et al., 2013c; Lin et al., 2013a; Mohabatkar, 2010; Sahu and Panda, 2010; Esmaeili et al., 2010; Fan et al., 2014; Liu et al., 2014). However, to reduce the computational time, 10-fold cross-validation has been adopted in this study for addressing parameter optimization. Once the optimal values of the parameters (ω, λ and pseAAC type) are determined, the rigorous jackknife test would be performed to finally evaluate the anticipated success rate of the predictor. In the jackknife test, each sequence in the training dataset is in turn

3

singled out as an independent test sample and all the ruleparameters are calculated without including the one being identified. Models of AFPs and non-AFPs were generated in various combinations of pseAAC and SVM parameters on randomly selected 300 positive and 300 negative examples (training dataset) followed by 10-fold cross-validation (Schaffer, 1993) and performance analysis using the following mathematical formulas for sensitivity (recall), specificity, accuracy and Matthew's correlation coefficient (MCC). Sensitivity ¼ TP=ðTP þ FNÞ

ð11Þ

Specificity ¼ TN=ðTN þ FPÞ

ð12Þ

Accuracy ¼ ðTP þ TNÞ=ðTP þFN þ TN þFPÞ

ð13Þ

MCC ¼ ððTP  TNÞ  ðFP  FNÞÞ=√ððTP þ FPÞ  ðTP þFNÞ ðTN þ FPÞ  ðTN þFNÞÞ

ð14Þ

true positives (TP): proteins correctly predicted as AFPs, false positives (FP): proteins incorrectly predicted as AFPs, true negatives (TN): proteins correctly predicted as non-AFPs and false negatives (FN): proteins incorrectly predicted as non-AFPs. Since the number of AFPs i.e., 481 positive examples were much lesser than 9423 negative examples, we investigated for bias in identification due to selection process, by keeping the number of positive examples constant to 300. Briefly, we selected 300 negative examples, three times randomly for model generation and evaluated the prediction using the above mentioned mathematical formulae after 10-fold cross-validation. Then, we explored the influence of number of negative examples on the AFP prediction also. We did this by including 900 negative examples instead of the original ratio 1:1 for development of models followed by performance assessment using same evaluation parameters as mentioned above after 10-fold cross-validation. Once we gained insights into the selection bias and influence of negative examples on prediction, the best performing models were selected for prediction on independent test dataset. This comprised of examples in Section 2.1 not included in training, i.e., 181 AFPs and 8293 non-AFPs. Subsequently, their performance was evaluated and compared with existing AFP prediction approaches.

3. Results and discussion 3.1. Optimal parameters for feature extraction Prediction of protein attributes requires selection of key representative parameters for improved quality. In this study, various combination of λ, ω and modes of pseudo amino acid composition and SVMs were generated for predictor development by grid searching on 300 positive and 300 negative examples. Fig. 1 shows the performance evaluation (MCC and accuracy) for various λ, ω and modes of pseAAC. MCC and accuracy were highest for ω¼0.05. Further, it can be seen that with increase in ω, the quality of prediction is compromised. The best performing model was achieved for ω¼ 0.05 and λ¼5; in the amphiphilic mode, closely followed by performance with ω¼0.05 and λ¼10; in the amphiphilic mode. Optimal features i.e., ω¼0.05 and lower values of λ in the amphiphilic mode (Type 2) were reserved for development of the predictor. 3.2. Bias during selection of negative examples To understand if selecting 300 examples from a 9493 proteins biased the prediction, we performed random selection of negative examples three times on non-overlapping datasets. Since the number of positive examples was limited, we considered the same set of 300 antifreeze proteins as positive examples for all training

Please cite this article as: Mondal, S., Pai, P.P., Chou's pseudo amino acid composition improves sequence-based antifreeze protein prediction. J. Theor. Biol. (2014), http://dx.doi.org/10.1016/j.jtbi.2014.04.006i

4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

S. Mondal, P.P. Pai / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Fig. 1. Exploring optimal pseudo amino acid composition parameters (Types, λ and ω) on the training dataset for the development of the AFP predictor.

purposes. As shown in Table 1 from the prediction performance, the average of mathematical parameters of evaluation mentioned in Section 2.4 for all the generated models suggested the best values for λ ¼10. The highest prediction accuracy was 89.69 (0.706)%, MCC ¼0.800 (0.0095), sensitivity ¼ 88.89 (1.835)% and specificity ¼91.00 (0.330)%. The standard deviation observed in the performance evaluation parameters over prediction performed three times was negligible. Further, upon conducting the jackknife test, our model obtained 88.56 (1.019)%, MCC ¼0.771 (0.0203), sensitivity ¼87.89 (1.261)% and specificity ¼89.22 (0.840)%. Therefore, it can be said that the bias associated with selection of negative examples was minimally influential, if at all.

included 900 non-AFPs (three times higher than the number of positive examples ¼ 300) during the prediction. This showed increase in accuracy from 89.69% to 91.25%; specificity from 91.00% to 96.78%; decrease in MCC from 0.800 to 0.762; and compromised sensitivity from 88.89% to 77.67%. However, if analyzed in relation with positive and negative examples used in the ratio 1:1, the performance of the predictor models was better with balanced dataset, as can be seen in Section 3.2. Therefore, 1:1 ratio of examples in dataset was maintained throughout the predictor development process.

3.3. Bias on account of number of negative examples

After selection of optimal parameters and search for possible bias as mentioned in above sections, the best model was selected for AFP predictor development and named as AFP-PseAAC. Precisely, this included pseAAC parameters ω¼0.05 and λ¼ 10 in

After exploring the selection bias, we investigated into the influence of the number of negative examples on prediction. We

3.4. Performance evaluation and comparison with AFP-Pred and iAFP

Please cite this article as: Mondal, S., Pai, P.P., Chou's pseudo amino acid composition improves sequence-based antifreeze protein prediction. J. Theor. Biol. (2014), http://dx.doi.org/10.1016/j.jtbi.2014.04.006i

S. Mondal, P.P. Pai / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

5

Table 1 Influence of negative examples during AFP prediction. Dataset

λ

MCC

Accuracy (%)

Sensitivity (%)

Specificity (%)

300 AFPs and 300 non-AFPsa

05 10 15 20

0.797 (0.0035) 0.800 (0.0095) 0.796 (0.0148) 0.786 (0.0078)

89.61 (0.344) 89.69 (0.706) 89.61 (0.919) 89.17 (0.440)

88.45 88.89 89.22 88.89

91.00 (0.670) 91.00 (0.330) 90.52 (0.790) 90.33 (0.665)

300 AFPs and 900 non-AFPs

05 10 15 20

0.755 0.762 0.775 0.773

90.92 91.25 91.83 91.83

78.67 77.67 76.67 77.33

a

96.0 96.78 97.11 97.22

Format: average evaluation parameter (standard deviation) upon random selection of negative examples three times.

Table 2 Performance of AFP-PseAAC compared with AFP-Pred (Kandaswamy et al., 2011) and iAFP (Yu and Lu, 2011) on independent test dataset. Predictor

Accuracy (%)

Sensitivity (%)

Specificity (%)

Youden's Indexa

AFP-PseAAC AFP-Pred iAFP

84.75 69.86 95.46

86.19 78.45 7.18

84.72 69.67 97.38

0.71 0.48 0.05

a

(1.072) (1.835) (1.575) (1.018)

Youden's Index (Youden, 1950) ¼ Sensitivity þSpecificity  1.

amphiphilic mode; SVM parameters C¼ 25 and γ ¼0.0005 and 1:1 ratio of training dataset. Upon screening of protein examples excluded from training of the predictor (independent test dataset) an accuracy of 84.75%, sensitivity of 86.19% and specificity of 84.72% was obtained as shown in Table 2. These findings encouraged us to compare the performance of our method with that of other previously published methods. AFP-PseAAC was assessed in relation with AFP-Pred and iAFP to gain insights into its relative prediction power on the independent test dataset. While AFP-PseAAC achieved an accuracy and specificity of (84.75% and 84.72%); these values seen for AFP-Pred and iAFP were (69.86%, 69.67%) and (95.46%, 97.38%) respectively. The high accuracy and specificity achieved for iAFP, i.e., above 95%, was notably accompanied with extremely low sensitivity, i.e., 7.18%. On the contrary, AFP-PseAAC showed sensitivity of 86.19%, followed by AFP-Pred which reached close to 78.45%. Clearly, AFP-PseAAC outperformed AFP-Pred and iAFP. Additionally, to quantify the relative performance of AFP-PseAAC we applied the Youden's Index (J¼SensitivityþSpecificity 1) (Youden, 1950) used for gaining insights into the relative performance of tests in general. Youden's Index gives the probability of an informed decision and is advantageous as it offers comparison of AFP prediction quality by means of a single informative parameter. AFPPseAAC showed J¼ 0.71, followed by AFP-Pred with J¼0.48 and then iAFP with J¼0.05, as shown in Table 2. Since J value for AFP-PseAAC is much higher than the two approaches, it can be suggested that introduction of sequence order effect using Chou's pseAAC enhances AFP identification by SVMs.

3.5. Standalone package Since user-friendly and publicly accessible web-servers represent the future direction for developing practically more useful models, simulated methods, or predictors (Chou and Shen, 2009b; Lin and Lapointe, 2013), we shall make efforts in our future work to provide a web-server for the method presented in this paper. Currently, we have user-friendly and freely-available software for interested users at https://sites.google.com/site/sukantamondal/ software.

4. Conclusion Biological diversity makes accurate identification of antifreeze proteins difficult. Earlier work reported sequence based solutions but sequence order effect which is known to improve prediction of protein attributes remained to be explored. Since Chou's pseudo amino acid composition features represents diverse proteins as discrete models yet without entirely losing the sequence order information, we hoped to develop a novel effective approach AFPPseAAC for prediction of antifreeze proteins using pseAAC and SVMs. The overall performance shown by AFP-PseAAC was better than previous AFP predictors and has been provided for interested users as a freely-available software package. We anticipate this predictor to facilitate faster and broader applications of AFPs in biotechnology.

Acknowledgment The authors thank BITS – Pilani, K.K. Birla Goa Campus, for providing the necessary support towards conducting this research and the anonymous reviewers for their valuable suggestions. References

Q2

Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402. Anand, A., Pugalenthi, G., Suganthan, P.N., 2008. Predicting protein structural class by SVM with class-wise optimized features and decision probabilities. J. Theor. Biol. 253, 375–380. Cai, Y.D., Ricardo, P.W., Jen, C.H., Chou, K.C., 2004. Application of SVM to predict membrane protein types. J. Theor. Biol. 226, 373–376. Cao, D.S., Xu, Q.S., Liang, Y.Z., 2013. Propy: a tool to generate various modes of Chou's PseAAC. Bioinformatics 29, 960–962. Chen, W., Feng, P., Lin, H., 2012a. Prediction of ketoacyl synthase family using reduced amino acid alphabets. J. Ind. Microbiol. 39, 579–584. Chen, W., Lin, H., Feng, P.M., Ding, C., Zuo, Y.C., Chou, K.C., 2012b. iNuc-PhyChem: a sequence-based predictor for identifying nucleosomes via physicochemical properties. PLoS One 7, e47843. Chen, W., Feng, P.M., Chou, K.C., 2013. iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Res. 41, e68. Chou, K.C., 1992. Energy-optimized structure of antifreeze protein and its binding Q3 mechanism. J. Mol. Biol., 509–517. Chou, K.C., 2001. Prediction of protein cellular attributes using pseudo amino acid composition. Proteins: Struct. Funct. Genet. 43, 246–255. Chou, K.C., 2005. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21, 10–19. Chou, K.C., 2009. Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. Curr. Proteomics 6, 262–274. Chou, K.C., 2011. Some remarks on protein attribute prediction and pseudo amino acid composition. J. Theor. Biol. 273, 236–247. Chou, K.C., Cai, Y.D., 2005. Prediction of membrane protein types by incorporating amphipathic effects. J. Chem. Inf. Model. 45, 407–413. Chou, K.C., Shen, H.B., 2009a. Recent advances in developing web-servers for predicting protein attributes. Nat. Sci. 1, 63–92. Chou, K.C., Shen, H.B., 2009b. Review: recent advances in developing web-servers for predicting protein attributes. Nat. Sci. 2, 63–92. Davies, P.L., Hew, C.L., 1990. Biochemistry of fish antifreeze proteins. FASEB J. 4, 2460–2468.

Please cite this article as: Mondal, S., Pai, P.P., Chou's pseudo amino acid composition improves sequence-based antifreeze protein prediction. J. Theor. Biol. (2014), http://dx.doi.org/10.1016/j.jtbi.2014.04.006i

6

1 2 3 4 5 6 7 Q4 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Q5 29 30 31 32 33 34 35 36 37 38 39

S. Mondal, P.P. Pai / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Dhole, K., Singh, G., Pai, P.P., Mondal, S., 2014. Sequence-based prediction of protein–protein interaction sites with L1-logreg classifier. J. Theor. Biol. 348, 47–54. Ding, H., Luo, L., Lin, H., 2009. Prediction of cell wall lytic enzymes using Chou's amphiphilic pseudo amino acid composition. Protein Pept. Lett. 16, 351–355. Du, P., Li, Y., 2006. Prediction of protein submitochondria locations by hybridizing pseudo-amino acid composition with various physicochemical features of segmented sequence. BMC Bioinform. 7, 518. Du, P., Wang, X., Xu, C., Gao, Y., 2012. PseAAC-Builder: a cross-platform stand-alone program for generating various special Chou's pseudo-amino acid compositions. Anal. Biochem. 425, 117–119. Du, P., Gu, S., Jiao, Y., 2014. PseAAC-General: fast building various modes of general form of Chou's pseudo-amino acid composition for large-scale protein datasets. Int. J. Mol. Sci. 15, 3495–3506. Esmaeili, M., Mohabatkar, H., Mohsenzadeh, S., 2010. Using the concept of Chou's pseudo amino acid composition for risk type prediction of human papillomaviruses. J. Theor. Biol. 263, 203–209. Fan, Y.N., Xiao, X., Min, J.L., Chou, K.C., 2014. iNR-Drug: predicting the interaction of drugs with nuclear receptors in cellular networking. Int. J. Mol. Sci. 15, 4915–4937. Feng, P.M., Chen, W., Lin, H., Chou, K.C., 2013a. iHSP-PseRAAC: Identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. Anal. Biochem. 442, 118–125. Feng, P.M., Ding, H., Chen, W., Lin, H., 2013b. Naïve Bayes classifier with feature selection to identify phage virion proteins. Comput. Math. Methods Med. 2013, 530696. Feng, P.M., Lin, H., Chen, W., 2013c. Identification of antioxidants from sequence information using naïve Bayes. Comput. Math. Methods Med. 2013, 567529. Guo, S.H., Deng, E.Z., Xu, L.Q., Ding, H., Lin, H., Chen, W., Chou, K.C., 2014. iNucPseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition. Bioinformatics , http: //dx.doi.org/10.1093/bioinformatics/btu083. Huang, R.B., Du, Q.S., Wei, Y.T., Pang, Z.W., Wei, H., Chou, K.C., 2009. Physics and chemistry-driven artificial neural network for predicting bioactivity of peptides and proteins and their design. J. Theor. Biol. 256, 428–435. Joachims, T., 1999. Making large-scale SVM learning practical. In: Schö lkopf, B., Burges, C., Smola, A. (Eds.), Advances in Kernel Methods – Support Vector Learning. MIT-Press. Kandaswamy, K.K., Chou, K.C., Martinez, T., Mӧller, S., Suganthan, P.N., Sridharan, S., Pugalenthi, G., 2011. AFP-Pred: a random forest approach for predicting antifreeze proteins from sequence-derived properties. J. Theor. Biol. 270, 56–62. Lewitt, J., 1980. Responses of Plants to Environmental Stresses, vol. 1. Academic Press, New York. Lin, H., Wang, H., Ding, H., Chen, Y.L., Li, Q.Z., 2009. Prediction of subcellular localization of apoptosis protein using Chou's pseudo amino acid composition. Acta Biotheor. 57, 321–330. Lin, H., Chen, W., Yuan, L.F., Li, Z.Q., Ding, H., 2013a. Using over-represented tetrapeptides to predict protein submitochondria locations. Acta Biotheor. 61, 259–268.

Lin, H., Chen, W., Ding, H., 2013b. AcalPred: a sequence-based tool for discriminating between acidic and alkaline enzymes. PLoS One 8, e75726. Lin, S.X., Lapointe, J., 2013. Theoretical and experimental biology in one. J. Biomed. Eng. 6, 435–442. Liu, B., Zhang, D., Xu, R., Xu, J., Wang, X., Chen, Q., Dong, Q., Chou, K.C., 2014. Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection. Bioinformatics 30, 472–479. Min, J.L., Xiao, X., Chou, K.C., 2013. iEzy-drug: a web-server for identifying the interaction between enzymes and drugs in cellular networking. BioMed Res. Int., 701317. Mohabatkar, H., 2010. Prediction of cyclin proteins using Chou's pseudo amino acid composition for protein structural class prediction. Comput. Biol. Chem. 34, 320–327. Mondal, S., Bhavna, R., Mohan, B.R., Ramakumar, S., 2006. Pseudo amino acid composition and multi-class support vector machines approach for conotoxin superfamily classification. J. Theor. Biol. 243, 252–260. Qiu, W.R., Xiao, X., Chou, K.C., 2014. iRSpot-TNCPseAAC: identify recombination spots with trinucleotide composition and pseudo amino acid components. Int. J. Mol. Sci. 15, 1746–1766. Sahu, S.S., Panda, G., 2010. A novel feature representation method based on Chou's pseudo amino acid composition for protein structural class prediction. Comput. Biol. Chem. 34, 320–327. Schaffer, C., 1993. Selecting a classification method by cross-validation. Mach. Learn. 13, 135–143. Se-Kwon, K., 2013. Marine Proteins and Peptides: Biological Activities and Applications, first ed. John Wiley & Sons, United Kingdom. Sformo, T., Kohl, F., McIntyre, J., Kerr, P., Duman, J.G., Barnes, B.M., 2009. Simultaneous freeze tolerance and avoidance in individual fungus gnats, Exechia nugatoria. J. Comp. Physiol. B 179, 897–902. Shen, H.B., Chou, K.C., 2008. PseAAC: a flexible web-server for generating various kinds of protein pseudo amino acid composition. Anal. Biochem. 373, 386–388. Sonnhammer, E.L., Eddy, S.R., Durbin, R., 1997. Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 28, 405–420. Xiao, X., Min, J.L., Wang, P., Chou, K.C., 2013. iCDI-PseFpt: identify the channel–drug interaction in cellular networking with PseAAC and molecular fingerprints. J. Theor. Biol. 337, 71–79. Xiaowei, Z., Zhiqiang, M., Minghao, Y., 2012. Using support vector machine and evolutionary profiles to predict antifreeze protein sequences. Int. J. Mol. Sci. 13, 2196–2207. Xu, Y., Ding, J., Wu, L.Y., Chou, K.C., 2013. iSNO-PseAAC: predict cysteine S-nitrosylation sites in proteins by incorporating position specific amino acid propensity into pseudo amino acid composition. PLoS One 8, e55844. Youden, Y.W., 1950. Index for rating diagnostic tests. Cancer 3, 32–35. Yu, C.S., Lu, C.H., 2011. Identification of antifreeze proteins and their functional residues by support vector machine and genetic algorithms based on n-peptide compositions. PLoS One 6, e20445.

Please cite this article as: Mondal, S., Pai, P.P., Chou's pseudo amino acid composition improves sequence-based antifreeze protein prediction. J. Theor. Biol. (2014), http://dx.doi.org/10.1016/j.jtbi.2014.04.006i

Chou's pseudo amino acid composition improves sequence-based antifreeze protein prediction.

Antifreeze proteins (AFP) in living organisms play a key role in their tolerance to extremely cold temperatures and have a wide range of biotechnologi...
899KB Sizes 0 Downloads 3 Views