An Improved Protein Structural Classes Prediction Method by Incorporating Both Sequence and Structure Information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNB.2014.2352454, IEEE Transactions on NanoBioscience

1

An Improved Protein Structural Prediction Method by Incorporating Both Sequence and Structure Information Leyi Wei1, Minghong Liao1*, Xing Gao1, and Quan Zou2, Member, IEEE

1.School of Software, Xiamen University, Xiamen, China 2. School of Information Science and Technology, Xiamen University, Xiamen, China *corresponding author

Abstract—Protein structural classes information is beneficial for secondary and tertiary structure prediction, protein folds prediction, and protein function analysis. Thus, predicting protein structural classes is of vital importance. In recent years, several computational methods have been developed for low-sequence-similarity (25% - 40%) protein structural classes prediction. However, the reported prediction accuracies are actually not satisfactory. Aiming to further improve the prediction accuracies, we propose three different feature extraction methods and construct a comprehensive feature set that captures both sequence and structure information. By applying a random forest (RF) classifier to the feature set, we further develop a novel method for structural classes prediction. We test the proposed method on three benchmark datasets (25PDB, 640, and 1189) with low sequence similarity, and obtain the overall prediction accuracies of 93.5%, 92.6%, and 93.4%, respectively. Compared with six competing methods, the accuracies we achieved are 3.4%, 6.2%, and 8.7% higher than that achieved by the best-performing methods, showing the superiority of our method. Moreover, due to the limitation of the size of the three benchmark datasets, we further test the proposed method on three updated large-scale datasets with different sequence similarity (40%, 30%, and 25%). The proposed method achieves above 90% accuracies for all the three datasets, consistency with the accuracies on the above three benchmark datasets. Experimental results suggest our method as an effective and promising tool for structural classes prediction. Currently, a webserver that implements the proposed method is available on http://121.192.180.204:8080/RF_PSCP/Index.html. Index Terms—Protein structural classes, Feature extraction, Random forest

I. INTRODUCTION

P

rotein structural classes information is one of the most important features that characterize protein structures, and plays an important role in several aspects, including secondary and tertiary structure prediction, protein folds prediction, and protein function analysis, etc [1]. In 1976, the concept of protein structural classes is originally proposed by Levitt et al. [2] on a visual inspection of polypeptide chain topologies in a dataset of 31 globular proteins, which are categorized into four structural classes: all- , all- , , and . Although the

latest SCOP (Structural Classification of Proteins) database [3, 4] have further categorized proteins into seven structural classes, about 90% of proteins still belong to the four major structural classes. Due to the importance of structural classes, it is of the great value to develop structural classes prediction methods. Earlier studies focus on experimental methods to make predictions. However, they are time and cost consuming, which is unbearable for the requirements of fast prediction. To overcome this limitation, a number of computational methods, therefore, have been developed in the last 30 years [5-33]. To our knowledge, a majority of computational methods are machine learning-based methods, considering the protein structural classes prediction as a four-class classification problem. For a machine-learning method, feature extraction and classifier selection, are considered as two important procedures that greatly influence the performance of a prediction method. Based on this observation, computational efforts for structural classes prediction, therefore, mainly focus on the above two aspects to develop high-accuracy methods. Feature extraction is a procedure of representing amino acid sequences into a fixed feature vector. Existing feature extraction methods are broadly categorized into two classes: (1) sequence-based, and (2) structure-based. For the former class, it mainly focuses on the content and distribution of amino acids in sequences. Several commonly used sequence-based features include amino acid composition, pseudo amino acid composition (PseAAC), and amino functional domain composition, etc. Dissimilar to the above features extracted directly from amino acid sequences, some researchers recently attempt to extract features from PSI-BLAST profile containing evolution information [13, 23]. Liu et al. [23] extract a group of features that incorporate sequence order information and evolutionary information derived from the PSI-BLAST profile, through combining position-specific score matrix (PSSM) with auto covariance (AC) transformation; while, Dehzangi et al. [13] develop a set of local discriminatory features embedded in PSSM, through employing segmented feature extraction technique based on the distribution of amino acids and auto-covariance transformation. For the latter class, the methods in this class mainly focus on the content, the spatial arrangement, and the distribution of the secondary structure

1536-1241 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.


2 elements, etc. Dai et al. [12] propose some position-based features of predicted secondary structural elements. With the help of these features, they achieve improved accuracies (80% 86%) on some benchmark datasets. In fact, several previous studies [12, 14, 24, 29, 31] have revealed that the structure information provides a promising way to improve the abilities of protein structural class prediction. Meanwhile, a number of classification algorithms have been applied to the structural classes prediction, such as Support Vector Machine (SVM) [6-9, 12-14, 18, 23-25, 29-31], Neural Network (NN) [5, 26], Bayesian classification [28], fuzzy clustering [27], logistic regression [17, 19], and ensembles of classifiers [7, 9, 15, 16]. Of these classification algorithms, SVM is the most popular and also powerful classifier, and therefore is widely used in the literature. Recent efforts to extract informative feature sets and select effective classification algorithms have contributed to a significant improvement of the prediction accuracy for the structural classes prediction. In particular, Dehzangi et al. [13] recently propose a SVM-based method, namely PSSM-S, successfully improving the prediction accuracy to 90.1% on a benchmark dataset 25PDB. It is also the highest accuracy ever reported. However, with the exponential growth of the number of proteins being discovered, it is evident that such an accuracy is still not satisfactory. As a result, the main goal of our research is to further improve prediction accuracy for the structural classes prediction. The content of this study is concluded as follows, 1). We propose three different feature extraction methods to construct a comprehensive feature set comprising of sequence-based features, structure-based features, and sequence-structure-based features. In particular, we first introduce the concept of sequence-structure features, which captures sequence and local structure information simultaneously. 2). We first employ the random forest (RF) algorithm, a powerful classification algorithm, for protein structural classes prediction. The RF algorithm has its intrinsic advantage for handling the proposed high-dimensional feature set as compared to other classifiers. 3). We propose a high-accuracy and reliable method for protein structural class prediction based on the proposed feature set and the RF classifier. Jackknife tests on three benchmark datasets (25PDB, 640, and 1189) show that our method significantly outperforms other competing methods for low-similarity protein structural classes prediction. Moreover, 10-fold cross validation tests on three updated large-scale datasets with varying sequence similarities (40%, 30%, and 25%) further confirm that our method is a promising tool for predicting the low-similarity protein structural classes. Currently, a webserver that implements the proposed method is available on [34].

II. RESULTS AND DISCUSSION A. Comparison with existing methods on three benchmark datasets To evaluate the effectiveness of the proposed method, six competing methods [12-14, 18, 29, 31] are selected to compare with the proposed method on the three benchmark datasets: 25PDB, 640, and 1189. The comparison results for all the three datasets are listed in Table I. All the results listed in Table I are evaluated by jackknife tests. It is worth to note that the prediction accuracies of the six compared methods are directly taken from their corresponding references. As shown in Table I, the overall accuracies of the proposed method for the three benchmark datasets are all above 92.5%. To be specific, the proposed method achieves the overall accuracies of 93.5%, 92.6%, and 93.4% for 25PDB, 640, and 1189, respectively. As compared with six competing methods, the proposed method significantly outperforms other methods, 3.4%, 6.2%, and 8.7% higher than the best-performing methods for the three datasets, respectively. Specially, we also observe that there is a significant improvement made by the proposed method particularly for the class. We get improved by 13.3% on the dataset 25PDB, and 14.1% on the dataset 1189, respectively. On the dataset 640, our accuracy for this class is slightly lower than that achieved by SCPRED, but our accuracies for the other classes are significantly higher, which contribute to the improved overall accuracy of our method. On the other hand, we can also see that PSSM-S is the only method (besides our proposed method) that obtains over 90.1% overall accuracy on 25PDB dataset, but on 1189 dataset, PSSM-S achieves a significantly declined accuracy of 80.2%, which is the lowest among other methods. This implies that the method PSSM-S may be not robust for structural classes prediction. On the contrary, our proposed method and other methods keep relatively reliable prediction performance for all the three datasets. Importantly, the overall accuracies we achieved are the highest among the reliable methods, suggesting our proposed method as a promising and high-efficacy tool for structural classes prediction. In summary, it can be concluded by experimental results shown above that our proposed method is not only more effective but also more reliable for protein structural classes prediction than existing methods. B. Accuracy on updated large-scale datasets To further evaluate the robust of the proposed method, the proposed method is performed on three updated large-scale datasets: Z10172, Z8177, and Z7140, respectively. The three datasets are derived from the latest SCOP, but with different sequence similarities (40%, 30%, and 25%, respectively). More details of this dataset can be seen in Section ―Data sets‖. The performance of the proposed method is evaluated with 10-fold cross validation. Table II lists the detailed prediction results of the proposed method on the three datasets. From Table II, we observe that the proposed method achieves high accuracies in terms of SE, SP, and MCC for all the three datasets. For instance, we achieve 91.1%, 96.5%, and 0.876 high in terms of SE, SP, and MCC, respectively, for the set Z10172 comprising of 10,172 proteins sequences. The



3 performance for such large-scale datasets is consistency with that for the three benchmark datasets with limited sizes, further suggesting the robust of the proposed method for protein structural classes prediction. Further, we investigate the influence of prediction accuracies for the four classes by varying sequence similarity of the datasets. As we can see in Fig. 1, along with the similarity decreases, the prediction accuracy for the all– class almost does not change, whereas the prediction accuracies for the other classes significantly increase or decrease as the similarity increases. In general, the overall accuracy slightly declines by 0.9% as the sequence similarity decreases from 40% to 25%. We notice that the declined overall accuracy is still above 90%, which is actually satisfactory for the proposed method on such large-scale datasets. C. Feature contribution analysis In this study, we propose three different feature extraction methods to represent protein sequences, and extract 8,400 sequence-based features (denoted as ―SF‖), 9 structure-based features (denoted as ―STF‖), and 540 sequence-structure-based features (denoted as ―SSTF‖). In this section, we will investigate the contributions of different feature type combinations to the prediction accuracies. Table III presents 6 different feature combinations and their corresponding prediction accuracies evaluated with 10-fold cross validation on the three benchmark datasets. As shown in Table III, we observe the following two major aspects: (1) Among the three individual feature sets, the sequence-based feature set (SF) significantly outperforms the other two feature sets (SSTF and STF), achieving 92.6%, 92.8%, and 92.1% high accuracies for the datasets: 25PDB, 640, and 1189, respectively. Higher accuracies achieved with SF demonstrate that SF is more effective than other two feature sets for protein structural classes prediction. To our knowledge, the accuracies achieved by the proposed SF are even higher than that achieved by most of existing feature extraction methods using either evolution information from sequences or structure information. (2) When combining the other feature sets (SSTF and STF) one after another, the combined feature set achieves steadily increased accuracies. In particular, when combined with both the other two types of feature sets, the sequence-based feature set increases 1.3% up to 93.9% accuracy for the set 25PDB, 1.0% up to 92.8% accuracy for the set 640, and 1.6% up to 93.7% accuracy for the set 1189, respectively. This demonstrates that the sequence-structure feature set and structure feature set are complement to the sequence based feature set, and contribute to the further improvement of accuracies for structural classes prediction. It may be concluded that the three feature sets make their own positive contribution to the predictions of protein structural classes. Moreover, an additional experiment is also conducted to investigate the importance of each feature in the proposed feature set for the predicted accuracy. Here, we list the top 20 ―important‖ features in Table IV. As shown in Table IV, there are 6 features from STF, 2 features from SSTF, and 12 from SF, respectively. This observation further confirms that the three

types of features make positive contributions to each other. Specially, more than a half of the 20 features from SF implies that the prediction accuracy may be largely influenced by SF. On the other hand, nearly all SFs in the table contain the amino acid ―L‖, suggesting that ―L‖ is meaningful for structural classes prediction. In summary, it is expected that the importance analysis for the proposed feature set is able to help researchers to select some ―important‖ features in their specific situations. D. Distinguishing the classes Like previous studies [14, 18, 29], we further compare the ability of the proposed method and other existing methods to distinguish proteins only between the classes. The three benchmark datasets (25PDB, 640, and 1189) are also employed, but the datasets used in this section have been revised, which contain only the protein sequences belonging to the classes. Table V lists the prediction accuracies achieved by our proposed method and some existing methods. As we can see from the table, compared with other methods, the proposed method achieves a significant improvement of overall accuracies for the classes prediction on all the three datasets. For instance, as for the dataset 25PDB, we get an improved overall accuracy of 91.2%, which is 5.4% higher than that achieved by the best-performing method proposed by Ding et al. [14]. In particular, when considering the accuracy for the individual class ( or ), the accuracies obtained by the proposed method are also the highest among other methods, giving an accuracy of 87.9% and 93.9% for the two classes, respectively. The above observations demonstrate that the proposed method is more effective than existing methods for distinguishing the classes. Experimental results also show that our feature set is able to precisely capture discriminatory information from the classes. Thus, we may conclude that the proposed feature set is valuable for the proposed method to achieve good performance. E. Selection of optimal sequence-based feature subset As described in Section ―Feature Extraction Method‖, a n-gram model is employed to extract our sequence-based features directly from amino acid sequences. We extracted three groups of sequence-based features: 20 1-gram features, 400 2-gram features, and 8,000 3-gram features, when considering n=1, 2, and 3 in the n-gram model, respectively. In order to find an optimal sequence-based feature subset, we investigate the effect of all seven possible combinations (listed in Table VI) of the above three groups of features, below. To eliminate the influence of different numbers of decision trees (t) used in the RF classifiers, we employ two RF classifiers with the parameter t = 260 and t = 460, and train both of them based on all seven combinations of features, respectively. 10-fold cross validation tests are performed on the three benchmark datasets: 25PDB, 640, and 1189, respectively. The evaluation results are listed in Table VI. We can see in the table that as for the combination of 2-gram and 3-gram features, the RF classifier with t = 460 obtains 92.6%, 92.8%, and 92.1% accuracy high for the datasets: 25PDB, 640, and 1189,



4 respectively; whereas the RF classifier with t = 260 obtains 92.3%, 93.0%, and 91.1% accuracy high for the datasets: 25PDB, 640, and 1189, respectively. The accuracies achieved by the two RF classifiers are the highest on all the three datasets, suggesting the effectiveness of this combination of features. Consequently, the combination of 2-gram and 3-gram features, comprising of 8,400 features, is chosen as our sequence-based feature set. On the other hand, we observe in the table that the accuracies achieved by the RF classifier with t = 460 for the corresponding feature combinations and datasets are higher than that achieved by the RF classifier with t = 260. This further confirms that the RF classifier with the parameter t = 460 outperforms that with different parameters. F. Optimization of the RF classifier In Breiman’s research, the performance of RF classifier is highly dependent on the selection of the number of decision trees [35]. To select appropriate number of decision trees for RF classifier used in the proposed method, 11 decision trees in range of 10 to 510 are tested. We evaluate the proposed method with different number of trees on the three benchmark datasets: 25PDB, 640, and 1189, respectively. The evaluation results are measured with a 10-fold cross validation test, and are illustrated in Fig. 2. More details of the evaluation results can be found in the supplement A. As illustrated in Fig. 2, three plots show significant upward tendency when the number of decision trees increases to 60. Then, as the number of trees increases, the plots flatten out gradually. Specially, when the number of trees increases to 460, the plots corresponding to the three datasets achieve their peaks, respectively. This suggests that the RF classifier performs best when the number of trees is set to 460, generating the highest accuracies of 93.9%, 92.8%, and 93.7% for the dataset 25PDB, 640, and 1189, respectively (see the supplement A). Thus, 460 is selected as the optimal number of decision trees for the RF classifier used in the proposed method. G. Comparison with other widely used classifiers Table VII lists the prediction results of the RF classifier and other 4 classifiers widely used on the three benchmark datasets. As we can see from Table VII, the RF classifiers ( with default and optimized parameters, respectively) achieve significantly better performance as compared with other individual classifiers on all the three datasets. For example, the RF classifiers ( with default and optimized parameters, respectively) achieve 87.3% and 93.9% accuracies on the dataset 25PDB, which are 6.2% and 13.8% higher than that achieved by the best-performing method (SVM), respectively. On the other hand, we further investigate the influence of ensemble strategies for the prediction accuracies, since ensemble strategies are reported to be more effective than individual classifiers in most of cases. Three commonly used ensemble strategies (grading [36], majority vote [37], and stackingC [38]) are chosen to combine the above five classifiers (RF, SMO, NB, J48, and SVM). The prediction accuracies for the three ensemble strategies are also presented in Table VII, where we found that the RF classifier performs better than

ensemble classifiers with different strategies. It may be because that the super-high dimensional feature set limits the accuracies of individual classifiers except the RF classifier, and thus influences the accuracies of ensemble strategies. It demonstrates that the individual classifier such as RF classifier may be more effective than ensemble classifiers when handling high-dimensional features. H. Conclusion In this study, we have proposed three different feature extraction methods (sequence-based, structure-based, and sequence-structure-based) to extract a comprehensive feature set that sufficiently explores discriminatory sequence and structure information embedded in the protein sequence and its secondary structure. By applying the RF classifier to the feature set, we developed a high-accuracy method (RF-PSCP) for low-sequence-similarity protein structural classes prediction. Experimental results showed that the proposed feature set helps our proposed method achieve a significant and reliable improvement of prediction accuracies for structural classes prediction. The overall accuracies achieved by our method are 3.4%, 6.2%, and 8.7% higher than those achieved by the best-performing methods for the three benchmark datasets: 25PDB, 640, and 1189, respectively. To be specific, our proposed method outperforms other competing methods for each of the four classes prediction. Particularly, our method achieved improved accuracies for distinguishing only the classes. The accuracies for the two classes are 4.4%, 10.9%, and 7.5% higher than that given by the state-of-the-art methods on the datasets: 25PDB, 640, and 1189, respectively. Through a feature contribution analysis, we found that the three types of feature sets are complement to each other, and make positive contributions to the improved accuracies for structural classes prediction. In summary, we believe that the proposed method is a promising tool for accurate protein structural classes prediction. In our future work, we will attempt to incorporate evolution information into the proposed feature set, and investigate whether the protein structural classes prediction accuracy can be further improved.

III. MATERIALS AND METHODS A. Data sets Three benchmark datasets. In order to make fair comparisons with existing methods, we used three popular benchmark datasets in this study to evaluate our proposed method: 25PDB, 1189, and 640, which have been widely used in previous studies. The first dataset, called as 25PDB, is originally constructed by Kurgan et al. [19]. It contains 1,673 protein sequences, with lower than 25% sequence similarity. Of these sequences, 443 sequences are from all- class; 443 sequences are from all- class; 346 sequences are from class; and, 441 sequences are from class. The second dataset, called as 1189, is first created by Wang et al. [28]. In current version of this set, it contains 1,092 protein sequences, which consist of 223 from all- , 294 from all- , 334 from , 241 from . Similar to the set 25PDB, the sequences have



5 low sequence similarity as well, of about 40%. The third dataset, called as 640, is first presented by Chen et al. [10]. It is reported that this set contains 640 protein sequences belonging to four classes, with 25% sequence similarity. However, we found that there exists one sequence in the set 640, with an identifier referred to as ―1yua_1‖, which is no longer classified into the four classes. The sequence ―1yua_1‖ is removed from further consideration. There remains a total of 639 sequences in the set 640. More details of these three datasets are summarized in Table VIII. Three updated large-scale datasets. In order to evaluate the ability of the proposed method for those low-similarity protein sequences newly discovered , we consider the following aspects to construct our new datasets: (1) proteins in the datasets should be derived from the latest version of Astral SCOP (release 1.75B) database; (2) proteins in the datasets should have low similarity between each other, since high sequence similarity of a dataset tends to results in a positive bias for the prediction results. Accordingly, we constructed three new datasets, referred to as Z10172, Z8177, and Z7140, with sequence similarities of 40%, 30%, and 25%, respectively. For instance, the dataset Z10172 is constructed as follows. First, we downloaded the database Astral SCOP 1.75 with sequence similarity of 40%, which contains a total of 11,211 protein sequences. Second, those sequences with lengths less than 30 residues are further removed, since PSI-BLAST cannot perform on short sequences. After removal, a total of 11,136 sequences are remaining. Of the remaining sequences, those from four major structural classes are collected to yield the dataset, we denote as Z10172, which finally contains a total of 10,172 protein sequences. Similarly, the other two datasets (Z8177, and Z7140) are constructed as the above procedure. More details of these three sets are summarized in Table VIII. B. Feature Extraction Method Discriminated features are beneficial to achieve improved performance for computational methods. Therefore, it is important to extract informative features reflecting the difference of protein sequences between structural classes. In this study, we propose three powerful feature extraction methods, which consider the following three aspects, respectively: 1) primary sequence based, 2) secondary structure based, and 3) sequence and structure based. For convenience, we present a query protein sequence with amino acid residues as,

1) Sequence based features (SF) n-gram frequency features. N-gram frequency features are computed directly from the amino acid composition of a query protein sequence. In this study, a n-gram model [39] is employed on protein sequences to compute N-gram frequency features, which converting sequences with different lengths into a fix feature vector. In the n-gram model, a n-gram is defined as a subsequence consisting of spatially consecutive elements in , which can be denoted as , thus resulting in possible combinations. For any possible combination, the appearance frequency of which in the sequence is computed, thus forming a dimensional (-D) n-gram frequency feature vector. For instance, if n = 2, a 2-gram is denoted as , where is a subsequence of . Since both and , there are possible combinations for a 2-gram. Thus, we can initialize a 400-D feature vector . The scheme of the n-gram model applied into extract n-gram frequency features is described as follows, Step 1. Initialize a n-gram frequency feature vector for a query sequence . We initialize a feature vector for : , where the dimension of and is equal to 0, and combinations of a n-gram,

is , represents the value represents one of all possible

Step 2. Extract n-grams from . To extract n-grams from , we use a -size window to slide on from the position to the position. As the window slides on each position, the occurrence of each encountered n-gram is recorded in the corresponding position in . By doing this way, is thus encoded into a count feature vector, we denote as : ,

(3)

where represents the number of occurrences of the combination in , . Step 3. Normalize the count feature vector into the frequency feature vector , we denote as . The normalization of is done according to the following formula, =

,

(2)

,

(4)

(1)

where represents the residue at the position 1, represents the residue at position 2, and so forth. In the sequence each residue belongs to a set of 20 different amino acids, ordered alphabetically as . This representation of the sequence will be used in the following subsections.

where the divisor is the sliding steps of the -size window on . is the output of the -gram model, comprising of -gram frequency features. In this study, Only and 3 are considered, because when , the n-gram model will generate super-high dimensional features , probably leading to a decreased prediction accuracy. To this end, we extract 20-D features for 1-gram , 400-D features for 2-gram , 8,000-D features for 3-gram ,



6 when is set to 1, 2, and 3, respectively. In summary, the feature combination of and are chosen, to yield our sequence-based feature set, referred to as SF. The reason of how to select optimal sequence-based feature subset will be discussed in Section ―Selection of optimal sequence-based feature subset‖.

c.

Two features, denoted as and , measure the normalized maximum lengths of segments consisting of spatially consecutive E and H in respectively. They are formulated as: ,

2) Secondary structure based features (STF) Recently, several studies have proved that secondary structure based features are helpful in structural classes prediction. Here, we also propose some structure-based features. To predict the secondary structure of a protein sequence, here we employ a popular secondary structure prediction tool – PSIPRED [40], which predicts protein secondary structure on the basis of PSI-BLAST [41]. Due to its effectiveness, it has been extensively applied for sequence structure prediction in previous studies. In a structure prediction, every amino acid residue in a query protein sequence is predicted into one of the following three states: H (helix), E (strand), and C (coil). For simplicity, we denote the secondary structure of as , which is represented as, ,

(5)

where , . Structure-based features are extracted based on the structure sequence . The details of these features we extracted are described as follows: a.

Three features measure the occurrence frequency of the three states (H, E, and C) in , respectively. They are formulated as: ,

,

,

Three location-related features, first proposed by Kurgan et al. [18], measure the spatial arrangement of the corresponding three states in . The three features are denotes as, , , and , respectively. They are formulated as: ∑

,

∑

, (7)

∑

. Zhang et al. [31] introduce the definition of a segment sequence (we denote as ), consisting of helix segments and strand segments (denoted by and , respectively). Through ignoring coils segments in , they successfully transform a structure sequence in to a segment sequence . For instance, given a structure sequence : CCCEEECCHHHHEECCCEEECCHHHH, the transformed segment representation of is . Based on an observation that and are likely to be separated in proteins but be interspersed in proteins [31], we design a new feature, aiming to effectively differentiate the two classes ( ). d.

One feature measures the occurrence frequency of the segment in the segment sequence . It is formulated as: , where the segment

(9)

represents the number of occurrences of in .

(6)

where L is the length of , represents the number of occurrences of the state H in , represents the number of occurrences of the state E in , and represents the number of occurrences of the state C in . b.

(8)

,

where and are the position index of the H, E, and C in , respectively. , , and are the total number of H, E, and C in , respectively.

To this end, 9 secondary structure-based features are extracted and yield our structure-based feature set, referred to as STF. 3) Sequence-structure based features (SSF) Our investigations show that the distributions of local consecutive sub-structures are significantly distinct in four classes. Based on these observations, here we first introduce local sub-structures combined with primary sequence information to characterize the proteins in the four classes. The features we proposed focus on the information of every consecutive states in the structure sequence and its corresponding residue in the primary sequence. Given a predicted secondary structure, it is known that the structure consists of only three states: ―H‖, ―E‖, and ―C‖. For every spatially consecutive 3 states, there are possible combinations (ie., ―HHH‖, ―HEH‖, and ―HEC‖, etc.). We consider the residue in the protein sequence corresponding to the middle of the 3 in its structure, thus forming possible sequence-structure combinations, which are represented as ―AHHH‖, ―LHHH‖, etc. For simplicity, we denotes these combinations as tri-sequence-structure (tri-ss) elements. The scheme of representing a protein sequence and its corresponding secondary structure into a feature vector are described as the following steps,



7

Step 1. Convert a query protein sequence and its secondary structure sequence into sequence-structure elements. These elements are represented as ,

(10)

where and represent the position of the protein sequence and its secondary structure sequence predicted by PSIPRED, respectively; . Note that we ignore the state ―C‖ on two flanks of , since the other two states are more meaningful in secondary structures. Step 2. Initialize a 540-D feature vector represented as , where initially set as 0;

, which is

(11)

represents the value in dimension, represents one of 540 tri-ss elements.

Step 3. Compute the occurrence frequency of each in . We count the number of occurrences of each in , which we denotes as . The occurrence frequency of is calculated by dividing with , where is the number of elements in . Finally, we obtain 540-D sequence-structure features, referred to as SSTF, containing local structure information combined with sequence information. For reader’s convenience, an example of generating SSTF is illustrated in Fig. 3. C. Classification algorithm selection Breiman et al. [35] proposes the RF algorithm that generates an ensemble of decision tree classifiers. A powerful ensemble strategy, referred to as bagging [42], is employed. In bagging, each base classifier is trained on a bootstrap sample of the training data, which is a sampling with replacement from the training data. After the process of training, predictions are made by using a majority voting strategy for base classifiers. To improve the efficiency of bagging, it is further developed in RF algorithm, where decision trees are employed as base classifiers. Specially, not like the traditional bagging algorithm using all features to train each classifier, RF employs a random feature selection technique that randomly selects a subset of features to split at each node when growing a tree. The number of features at each node is determined through computing the generalization error, classifier strength and dependence. By doing this, each tree (base classifier) is grown based on different feature subsets, enhancing the diversity of the ensemble classifier which probably leads to an improvement of prediction accuracy. On the other hand, just like the traditional bagging, each tree is grown also using a bootstrap sampling of the training data. Accordingly, the RF algorithm is considered as a powerful tool to handle high-dimensional feature set, large-scale input data, and imbalance datasets. Here, we employ the RF

algorithm that is implemented in a data mining tool – WEKA (Waikato Environment for Knowledge Analysis) [43]. WEKA is an ensemble package of machine learning algorithms. In this study, we select 460 to be the number of decision trees in our RF algorithm, because it can helps the RF algorithm achieve best performance. The details of how to determine the optimal number of trees can be seen in Section ―Optimization of the RF classifier‖. D. The proposed method Fig. 4 illustrates the overall framework of our proposed method, referred to as RF-PSCP. The predicted process of a query protein sequence in RF-PSCP can be generally divided into two steps. In the first step, the query sequence is first submitted into the procedure of feature extraction. As a result, a comprehensive feature set, which consists of a total of 8,949 features (8,400 sequence-based, 540 sequence-structure-based, and 9 structure-based), is obtained. More details on feature extraction can be seen in Section ―Feature extraction method‖ above. In the second step, the resulting feature set are then fed into the RF classifier to predict which protein structural class the query sequence belongs to. A webserver that implements the proposed method (RF-PSCP) is available on [34]. Fig. 5 illustrates an example of the output of RF_PSCP. As shown in Fig. 5, RF_PSCP provides not only predicted structural class the query sequence belongs to, but also its secondary structure predicted by PSIPRED and its predicted confidence. On the other hand, unlike most of existing servers [10, 29] that provides the prediction for only one query sequence each time, our server (RF_PSCP) is able to predict the structural classes for (unlimited) multiple sequences simultaneously. E. Measurement Jackknife test is widely used in previous methods [12-14, 18, 29, 31]. Thus, to make fair comparisons with these competing methods, the jackknife test is also employed here. 10-fold cross validation is also employed in this study to evaluate the performance of the proposed method on large-scale test sets. For comprehensive evaluation, some individual metrics for each of four structural class are also considered; they are Sensitivity ( ), Specificity ( ), and Matthew correlation coefficients ( ), respectively. In addition, the overall accuracy ( ) is employed to evaluate the performance of the proposed method on the entire dataset. These metrics are formulated as,

∑ ∑

,

(12)

,

(13)

,

(14)

√


,

(15)


8 where , , , , and represent the number of true positives, true negatives, false positives, false negatives and protein sequences in the corresponding structural class , respectively.

ACKNOWLEDGMENT The work is supported by the Natural Science Foundation of China (61370010), the Scientific Research Foundation of China Mobile (MCM20122081, MCM20130221), the Technology Project of Xiamen City (3502Z2013301), and the National Technology Support Project (2013BAH44F01).

REFERENCES [1] [2] [3] [4]

[5] [6] [7]

[8]

[9]

[10]

[11] [12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

K. C. Chou, "Prediction of protein structural classes and subcellular locations," Curr Protein Pept Sci, vol. 1, pp. 171-208, Sep 2000. M. Levitt and C. Chothia, "Structural patterns in globular proteins," Nature, vol. 261, pp. 552-8, Jun 17 1976. J. M. Chandonia, et al., "The ASTRAL Compendium in 2004," Nucleic Acids Res, vol. 32, pp. D189-92, Jan 1 2004. A. G. Murzin, et al., "SCOP: a structural classification of proteins database for the investigation of sequences and structures," J Mol Biol, vol. 247, pp. 536-40, Apr 7 1995. Y. Cai and G. Zhou, "Prediction of protein structural classes by neural network," Biochimie, vol. 82, pp. 783-5, Aug 2000. Y. D. Cai, et al., "Support vector machines for predicting protein structural class," BMC Bioinformatics, vol. 2, p. 3, 2001. C. Chen, et al., "Dual-layer wavelet SVM for predicting protein structural class via the general form of Chou's pseudo amino acid composition," Protein Pept Lett, vol. 19, pp. 422-9, Apr 2012. C. Chen, et al., "Using pseudo-amino acid composition and support vector machine to predict protein structural class," Journal of Theoretical Biology, vol. 243, pp. 444-448, 2006. C. Chen, et al., "Predicting protein structural class with pseudo-amino acid composition and support vector machine fusion network," Anal Biochem, vol. 357, pp. 116-21, Oct 1 2006. K. Chen, Kurgan, L. A. and Ruan, J., "Prediction of protein structural class using novel evolutionary collocation-based sequence representation," J. Comput. Chem., vol. 29, pp. 1596– 1604, 2008. X.-Y. Cheng, et al., "A global characterization and identification of multifunctional enzymes," PloS one, vol. 7, p. e38979, 2012. Q. Dai, et al., "Comparison study on statistical features of predicted secondary structures for protein structural class prediction: From content to position," BMC Bioinformatics, vol. 14, p. 152, 2013. A. Dehzangi, et al., "Exploring Potential Discriminatory Information Embedded in PSSM to Enhance Protein Structural Class Prediction Accuracy," in Pattern Recognition in Bioinformatics. vol. 7986, A. Ngom, et al., Eds., ed: Springer Berlin Heidelberg, 2013, pp. 208-219. S. Ding, et al., "A novel protein structural classes prediction method based on predicted secondary structure," Biochimie, vol. 94, pp. 1166-71, May 2012. K. Y. Feng, et al., "Boosting classifier for predicting protein domain structural class," Biochem Biophys Res Commun, vol. 334, pp. 213-7, Aug 19 2005. K. D. Kedarisetti, et al., "Classifier ensembles for protein structural class prediction with varying homology," Biochem Biophys Res Commun, vol. 348, pp. 981-8, Sep 29 2006. L. Kurgan and K. Chen, "Prediction of protein structural class for the twilight zone sequences," Biochem Biophys Res Commun, vol. 357, pp. 453-60, Jun 1 2007. L. Kurgan, et al., "SCPRED: accurate prediction of protein structural class for sequences of twilight-zone similarity with predicting sequences," BMC Bioinformatics, vol. 9, p. 226, 2008. L. A. Kurgan and L. Homaeian, "Prediction of structural classes for protein sequences and domains—Impact of prediction algorithms,

[20] [21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34] [35] [36]

[37]

[38]

[39] [40]

[41]

[42] [43]

sequence representation and homology, and test procedures on accuracy," Pattern Recognition, vol. 39, pp. 2323-2343, 2006. C. Lin, et al., "Hierarchical Classification of Protein Folds Using a Novel Ensemble Classifier," PloS one, vol. 8, 2013. B. Liu, et al., "Using Amino Acid Physicochemical Distance Transformation for Fast Protein Remote Homology Detection," PloS one, vol. 7, p. e46633, 2012. B. Liu, et al., "A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis," BMC Bioinformatics, vol. 9, p. 510, 2008. T. Liu, et al., "Accurate prediction of protein structural class using auto covariance transformation of PSI-BLAST profiles," Amino Acids, vol. 42, pp. 2243-2249, 2012/06/01 2012. T. Liu and C. Jia, "A high-accuracy protein structural class prediction algorithm using predicted secondary structural information," Journal of Theoretical Biology, vol. 267, pp. 272-275, 2010. J. D. Qiu, et al., "Using support vector machines for prediction of protein structural classes based on discrete wavelet transform," Journal of computational chemistry, vol. 30, pp. 1344-1350, 2009. S. S. Sahu and G. Panda, "A novel feature representation method based on Chou's pseudo amino acid composition for protein structural class prediction," Comput Biol Chem, vol. 34, pp. 320-7, Dec 2010. H.-B. Shen, et al., "Using supervised fuzzy clustering to predict protein structural classes," Biochemical and Biophysical Research Communications, vol. 334, pp. 577-581, 2005. Z.-X. a. Y. Wang, Z, " How good is prediction of protein structural class by the component-coupled method?," Proteins, vol. 38, pp. 165–175, 2000. J. Y. Yang, et al., "Prediction of protein structural classes for low-homology sequences based on predicted secondary structure," BMC Bioinformatics, vol. 11 Suppl 1, p. S9, 2010. J. Y. Yang, et al., "Prediction of protein structural classes by recurrence quantification analysis based on chaos game representation," J Theor Biol, vol. 257, pp. 618-26, Apr 21 2009. S. Zhang, et al., "High-accuracy prediction of protein structural class for low-similarity sequences based on predicted secondary structure," Biochimie, vol. 93, pp. 710-4, Apr 2011. Q. Zou, et al., "Identifying Multi-Functional Enzyme by Hierarchical Multi-Label Classifier," Journal of Computational and Theoretical Nanoscience, vol. 10, pp. 1038-1043, 2013. Q. Zou, et al., "BinMemPredict: a Web Server and Software for Predicting Membrane Protein Types," Current Proteomics, vol. 10, pp. 2-9, 2013. "RF_PSCP: http://121.192.180.204:8080/RF_PSCP/Index.html." L. Breiman, "Random forests," Machine learning, vol. 45, pp. 5-32, 2001. A. K. Seewald and J. Fürnkranz, "An evaluation of grading classifiers," in Advances in Intelligent Data Analysis, ed: Springer, 2001, pp. 115-124. J. Kittler, et al., "On combining classifiers," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 20, pp. 226-239, 1998. A. K. Seewald, "How to make stacking better and faster while also taking care of an unknown weakness," in Proceedings of the nineteenth international conference on machine learning, 2002, pp. 554-561. S. H. Manning C, "Foundations of statistical natural language processing," MIT Press, vol. 78, 1999. D. T. Jones, "Protein secondary structure prediction based on position-specific scoring matrices," J Mol Biol, vol. 292, pp. 195-202, Sep 17 1999. S. F. Altschul, et al., "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs," Nucleic Acids Res, vol. 25, pp. 3389-402, Sep 1 1997. L. Breiman, "Bagging predictors," Machine learning, vol. 24, pp. 123-140, 1996. M. Hall, et al., "The WEKA data mining software: an update," ACM SIGKDD Explorations Newsletter, vol. 11, pp. 10-18, 2009.



9

Leyi Wei received the BSc degree in Computing Mathematics in 2010, the MSc degree in Computer Science in 2013, from Xiamen University, China. He is currently working for the Ph.D. degree also at Xiamen University. His research interests include non-coding RNA, microRNA mining, and protein structural classes prediction.

Minghong Liao received the BSc degree in Computer Science in 1986 from the University of Huaqiao, China, the MSc and Ph.D. degrees in Computer Science and Engineering from Harbin Institute of Technology, China, in 1988 and 1993, respectively. He is currently the Dean of Software School of Xiamen University, China. His research interests include software engineering, database technique, and embedded system.

Xing Gao received the BSc degree in Computer Science in 2004 from China University of Mining and Technology, and received Ph.D. Degree in 2009 from the School of Computer Science and Technology in Harbin Institute of Technology, China. He is currently an associate professor in Software School of Xiamen University, China. His research interests include data mining and machine learning.

Quan Zou received the Bachelor Degree in Computer Science and Economics at the same time in 2004, and received Ph.D. Degree in 2009 from the School of Computer Science and Technology in Harbin Institute of Technology, China. He is an assistant professor in the Department of Computer Science at Xiamen University now.



10

TABLES TABLE I COMPREHENSIVE PERFORMANCE COMPARISON OF THE PROPOSED METHOD AND OTHER COMPETING METHODS ON THREE BENCHMARK DATASETS . Dataset

Method

Prediction Accuracy (%) allall25PDB SCPRED (2008) [18] 92.6 80.1 74.0 71.0 RKS-PPSC (2010) [29] 92.8 83.3 85.8 70.1 Zhang et al. (2011) [31] 95.0 85.6 81.5 73.2 Ding et al. (2012) [14] 95.03 81.26 83.24 77.55 PSCP-PSSE (2013) [12] 98.65 85.78 79.19 79.82 PSSM-S (2013) [13] 93.8 92.8 92.6 81.7 This paper 80.6 100 95.7 95 640 SCPRED (2008) [18] 90.6 81.8 66.7 85.9 RKS-PPSC (2010) [29] 89.1 85.1 88.1 71.4 Ding et al. (2012) [14] 94.93 76.62 89.27 74.27 PSCP-PSSE (2013) [12] 97.1 81.17 89.27 79.53 This paper 83.5 97.8 92.9 97.2 1189 SCPRED (2008) [18] 89.1 86.7 89.6 53.8 RKS-PPSC (2010) [29] 89.2 86.7 82.6 65.6 Zhang et al. (2011) [31] 92.4 87.4 82.0 71.0 Ding et al. (2012) [14] 93.72 84.01 83.53 66.39 PSCP-PSSE (2013) [12] 97.76 86.39 84.73 70.54 PSSM-S (2013) [13] 93.3 85.1 77.6 65.6 This paper 99.6 96.9 92.2 85.1 The accuracies are evaluated by jackknife tests. Note that the prediction accuracies are taken directly from their method.

Overall 79.7 82.9 83.9 84.34 86.25 90.1 93.5 80.8 83.1 83.44 86.41 92.6 80.6 81.3 83.2 81.96 84.71 80.2 93.4 corresponding references, besides our proposed

TABLE II PERFORMANCE OF THE PROPOSED METHOD TESTED ON THREE UPDATED LARGE-SCALE DATASETS WITH 10-FOLD CROSS VALIDATION. Dataset Z10172 (Sequence similarity of 40%)

Classes allall-

Z8177 (Sequence similarity of 30%)

Overall allall-

Z7140 (Sequence similarity of 25%)

Overall allall-

Overall

SE (%) 99.5 96.8 84.8 86.7 91.1 99.4 96.3 79.5 89.9 90.6 99.6 93.8 79.0 90.4 90.2

SP (%) 99.9 99.7 94.5 93.5 96.5 99.9 99.7 95.5 92.0 96.4 99.9 99.4 95.4 91.9 96.3

MCC 0.994 0.973 0.797 0.793 0.876 0.993 0.969 0.772 0.795 0.871 0.995 0.946 0.765 0.798 0.867

TABLE III PREDICTION ACCURACIES OBTAINED BY DIFFERENT COMBINATIONS OF FEATURE SUBSETS. Features

Prediction accuracy (%) 25PDB 640 1189 Weighted Average SF 92.6 92.8 92.1 92.5 STF 80.6 81.5 81.9 81.2 SSTF 78.5 80.4 76.8 78.3 STF + SSTF 79.7 82.5 78.9 80.0 SF+ SSTF 92.9 92.6 92.8 92.8 SF + SSTFV + STF 93.9 92.8 93.7 93.6 Note that SF denotes 8400 sequence-based features; STF denotes 9 structure-based features; SSTF denotes 540 sequence-structure-based features.



11 TABLE IV TOP 20 MOST IMPORTANT FEATURES OF THE PROPOSED FEATURE SET. Rank 1 2 3 4 5 6 7 8 9 10

IGa Rank Features IGa 0.96961 11 AL 0.59531 0.95603 12 LA 0.59069 0.89372 13 LV 0.59021 0.87348 14 DL 0.58798 EL 0.64933 15 VEEE 0.58755 LP 0.64173 16 LD 0.58164 LHHH 0.62789 17 0.5808 VL 0.61061 18 0.57677 LL 0.61059 19 SL 0.57278 LE 0.59921 20 AE 0.57111 a IG denotes the information gain of the features. Features

TABLE V ACCURACIES OF THE PROPOSED METHOD AND EXISTING METHODS FOR THE PREDICTION OF THE Dataset

Method

AND

CLASSES.

Prediction Accuracy (%)

25PDB

SCPRED (2008) [18] 76.0 83.2 RKS-PPSC (2010) [29] 86.4 82.8 Ding et al. (2012) [14] 85.0 88.2 This paper 87.9 93.9 640 SCPRED (2008) [18] 89.3 77.2 RKS-PPSC (2010) [29] 88.1 83.6 Ding et al. (2012) [14] 89.3 83.6 This paper 97.7 97.1 1189 SCPRED (2008) [18] 88.6 63.1 RKS-PPSC (2010) [29] 83.8 81.3 Ding et al. (2012) [14] 85.9 73.9 This paper 92.8 86.7 Note that the three datasets only contains the protein sequences belonging to the

Overall 80.1 84.4 86.8 91.2 83.3 85.9 86.5 97.4 77.9 82.8 80.9 90.3 and classes.

TABLE VI PREDICTION ACCURACIES OBTAINED BY THE RF CLASSIFIER ON THE THREE BENCHMARK DATASETS WITH SEVEN COMBINATIONS OF SEQUENCE -BASED FEATURES. Features Prediction accuracy (%) ta = 260 ta = 460 25PDB 640 1189 25PDB 640 1189 1-gram 76.9 67.6 73.1 77.2 70.7 73.6 2-gram 92.0 92.0 90.7 92.2 92.2 90.9 3-gram 91.0 83.1 87.2 91.3 83.9 86.7 1-gram + 2-gram 90.2 89.5 88.0 90.6 90.3 88.1 2-gram + 3-gram 92.3 93.0 91.1 92.6 92.8 92.1 1-gram + 3-gram 91.3 89.0 89.7 90.9 89.7 89.7 1-gram + 2-gram + 3-gram 91.3 91.9 89.7 92.1 92.3 91.1 a The number of decision trees.

TABLE VII PREDICTION ACCURACIES OF THE RF CLASSIFIER AND OTHER WIDELY USED CLASSIFIERS ON THE THREE BENCHMARK DATASETS Dataset

Prediction accuracy (%)

Ensemble strategies Naï ve J48 SVMc Grading Majority StackingC Bayes vote 87.3 93.9 79.1 70.7 74.6 81.1 89.4 88.9 93.8 25PDB 80.4 92.8 71.2 78.6 72.1 82.9 89.2 87.6 91.9 640 85.9 93.7 75.5 74.7 76 81.4 89.6 88.6 93.3 1189 Remark: RFa denotes the RF classifier with the default parameter t = 10; RFb denotes that the RF classifier with the optimized parameter t =460; SVM c denotes the SVM classifier with optimized parameters (c=8 , and g= 2) for the dataset 25PDB, (c=128, and g=0.125) for the dataset 640, (c=8, and g=2) for the dataset 1189, respectively. RFa

RFb

SMO



12 FIGURES

100

Prediction Accuracy (%)

95 90 85

40% similarity

80

30% similarity

75

25% similarity

70 all-a

all-b

a/b

a+b

Overall

Classes Fig. 1. Accuracies achieved by the proposed method on three updated large-scale datasets with different sequence similarity. Note that all-a denotes the all – class, all-b denotes the all – class, a/b denotes the class, a+b denotes the class.

1

Prediction Accuracy

0.95 0.9

0.85 25PDB 640 1189

0.8 0.75 10

60

110 160 210 260 310 360 410 460 510 The number of Decision Trees

Fig. 2. Selection of the optimal tree number for RF classifier.



13

Fig. 3. The flowchart of generating sequence-structure based features (SSTF). The secondary structure of a query protein is first predicted by PSIPRED. The protein sequence and its predicted structure are then transformed into sequence-structure elements. The final output (our sequence-structure based feature vector) is computed by calculating the occurrence frequency of each possible element.

Fig. 4. The framework of the proposed method RF-PSCP. The proposed feature set is enclosed in a green box. More details of the extracted features can be seen in Section ―Feature Extraction Method‖.



14

Fig. 5. An example of the output of the web server RF_PSCP


Improved de novo structure prediction in CASP11 by incorporating coevolution information into Rosetta.

Sequence-based antigenic change prediction by a sparse learning method incorporating co-evolutionary information.

Prediction of Protein Structural Classes for Low-Similarity Sequences Based on Consensus Sequence and Segmented PSSM.

Sequence analysis, structure and binding site prediction of Sigma 1 receptor protein by in silico method.

Prediction of protein structural classes based on feature selection technique.

High-accuracy prediction of protein structural classes using PseAA structural properties and secondary structural patterns.

FLEXc: protein flexibility prediction using context-based statistics, predicted structural features, and sequence information.

Prediction of protein structure classes with flexible neural tree.

Improved structural method for T-cell cross-reactivity prediction.

Highly Accurate Prediction of Protein-Protein Interactions via Incorporating Evolutionary Information and Physicochemical Characteristics.

Incorporating auxiliary information for improved prediction using combination of kernel machines.

Towards an automatic method of predicting protein structure by homology: an evaluation of suboptimal sequence alignments.

PalmPred: an SVM based palmitoylation prediction method using sequence profile information.

Discriminating protein structure classes by incorporating Pseudo Average Chemical Shift to Chou's general PseAAC and Support Vector Machine.

RBO Aleph: leveraging novel information sources for protein structure prediction.

Combining Evolutionary Information and an Iterative Sampling Strategy for Accurate Protein Structure Prediction.

Bayesian model of protein primary sequence for secondary structure prediction.

Prediction of protein structural classes for low-similarity sequences using reduced PSSM and position-based secondary structural features.

A Novel Peptide Binding Prediction Approach for HLA-DR Molecule Based on Sequence and Structural Information.

An extended genovo metagenomic assembler by incorporating paired-end information.

An assessment of protein secondary structure prediction methods based on amino acid sequence.

PDNAsite: Identification of DNA-binding Site from Protein Sequence by Incorporating Spatial and Sequence Context.

A Bayesian Framework to Improve MicroRNA Target Prediction by Incorporating External Information.

sDFIRE: Sequence-specific statistical energy function for protein structure prediction by decoy selections.