Genomics 103 (2014) 292–297

Contents lists available at ScienceDirect

Genomics journal homepage: www.elsevier.com/locate/ygeno

Novel structure-driven features for accurate prediction of protein structural class Liang Kong a, Lichao Zhang b,⁎ a b

College of Mathematics and Information Technology, Hebei Normal University of Science and Technology, Qinhuangdao 066004, PR China College of Marine Life Science, Ocean University of China, Yushan Road, Qingdao 266003, PR China

a r t i c l e

i n f o

Article history: Received 9 November 2013 Accepted 7 April 2014 Available online 18 April 2014 Keywords: Protein domains Secondary protein structure Protein sequence homology Support vector machines

a b s t r a c t Prediction of protein structural class plays an important role in inferring tertiary structure and function of a protein. Extracting good representation from protein sequence is fundamental for this prediction task. In this paper, a novel computational method is proposed to predict protein structural class solely from the predicted secondary structure information. A total of 27 features rationally divided into 3 different groups are extracted to characterize general contents and spatial arrangements of the predicted secondary structural elements. Then, a multi-class nonlinear support vector machine classifier is used to implement prediction. Various prediction accuracies evaluated by the jackknife cross-validation test are reported on four widely-used low-homology benchmark datasets. Comparing with the state-of-the-art in protein structural class prediction, the proposed method achieves the highest overall accuracies on all the four datasets. The experimental results confirm that the proposed structure-driven features are very useful for accurate prediction of protein structural class. © 2014 Elsevier Inc. All rights reserved.

1. Introduction It is commonly believed that the biological function of a protein is essentially associated with its tertiary structure, which is determined by its amino acid sequence via the process of protein folding [1]. Since the structural class of a protein presents an intuitive description of its overall folding process, predicting the structural class of a protein is an important aspect in the identification of tertiary structure. For example, the knowledge of the protein structural class can significantly reduce the search space of possible conformations of the tertiary structure [2]. In addition, the knowledge of the protein structural class also plays an important role in protein function analysis, drug design and many other applications [3]. Based on the type, amount and arrangement of the secondary structural elements, a protein can be classified into several structural classes. Structural Classification of Proteins (SCOP) [4] is a manually annotated database and has been regarded as the most accurate classification of protein structural class. The current version of the SCOP database includes eleven structural classes, and approximately 90% of the protein domains belong to the four major classes (all-α, all-β, α/β, α + β). With the rapid development of the genomics and proteomics, traditional experimental methods regarding complex and time-consuming apparently cannot cope with the demand for rapid classification of protein structural class. Therefore, it is essential to develop automated and accurate computational methods to help speed up this process. ⁎ Corresponding author. E-mail address: [email protected] (L. Zhang).

http://dx.doi.org/10.1016/j.ygeno.2014.04.002 0888-7543/© 2014 Elsevier Inc. All rights reserved.

Numerous computational methods have been developed for identifying protein structural class during the past three decades. These methods typically extract specific features to represent a protein and then perform classification by using different types of machine learning algorithms. One of the most abundant protein information available is its amino acid sequence, and various features have been proposed that turn a varying length of protein amino acid sequence into a fixed length feature vector. This fixed length vector is also known as sequencedriven features [5], such as amino acid composition (AAC) [6], pseudo amino acid composition (PseAA) [7], polypeptides composition [8], functional domain composition [9], and PSIBLAST profile [10]. One of the deficiencies in sequence-driven feature based methods is low accuracy for low-homology datasets, such as the widely-used 25PDB and 1189 datasets with sequence similarities lower than 25% and 40% respectively. Realizing the localization, there have been many attempts to use the secondary structure information to derive features to improve prediction accuracy for low-homology datasets [11,16,15,14,12,13,17]. Accordingly, we denote these features by structure-driven features. The available structure-driven features can be mainly categorized into 3 different types (1) content-related, (2) order-related, and (3) distance-related features. The contents of secondary structural elements and the normalized counts of secondary structural segments are widely used content-related features. Second order composition moment of secondary structural elements can be considered as order-related features. The maximal and average lengths of secondary structural segments are important distance-related features. Novel computational prediction methods with structure-driven features have achieved favorable overall accuracies on several low-homology benchmark datasets.

L. Kong, L. Zhang / Genomics 103 (2014) 292–297

However, the predictions for the α/β and α + β classes are still of low quality especially for the α + β class when compared with the predictions for the all-α and all-β classes. It has been a deficiency in the current protein structural class prediction methods. In this paper, we focus on the challenging problem of identifying protein structural class solely from the information of the predicted secondary structure. The main contribution is extracting comprehensive structure-driven features especially for distance-related features to reflect general contents and spatial arrangements of the predicted secondary structural elements of a given protein sequence effectively. A 27-dimensional feature vector is selected based on a wrapper feature selection algorithm, and a multi-class nonlinear support vector machine (SVM) classifier is applied to predict protein structural class. The prediction performance is evaluated by a jackknife cross-validation test on four widely-used low-homology datasets (25PDB, 1189, 640, FC699). The experimental results show that the proposed feature vector results in significantly improved ability of the predictor to separate protein structural classes and our method provides better predictions when compared with modern and competing methods. 2. Materials and methods According to recent research [18], to establish a useful statistical predictor for a protein system, the following procedures should be considered: (1) selection of valid benchmark datasets to train and test the predictor, (2) representation of protein samples to reflect their intrinsic correlation with the target to be predicted, (3) selection of the classification algorithm to operate prediction, and (4) selection of the crossvalidation tests to objectively evaluate the anticipated accuracy of the predictor. Below, we will give concrete details about how to deal with these steps. 2.1. Datasets Sequence homology has a significant impact on the prediction accuracy of protein structural class. Datasets with sequence homology ranging between 20%–40% tend to obtain more reliable and robust results [13]. In this paper, four low-homology datasets (25PDB, 1189, 640, FC699) are used to design and assess the proposed method. All the four datasets have been widely used as benchmark datasets in previous studies [11,16,15,14,12,13,17]. More details of these datasets are shown in Table 1. 2.2. Structure-driven features for protein representation To be used effectively in our method, every amino acid residue in a protein sequence need to be first transformed into one of the following three secondary structural elements: H(helix), E(strand) and C(coil). The string of secondary structural elements is also known as protein secondary structure sequence (SSS) which can be obtained from protein structure prediction server PSIPRED [19]. In order to sufficiently reflect the general contents and spatial arrangements of the predicted secondary structural elements of a given protein sequence especially for αhelix and β-strand, another two simplified sequences are proposed based on SSS. One sequence is a segment sequence (SS), which is composed of helix segments and strand segments [13,15,17]. First, every H, Table 1 The number of proteins belonging to different structural classes and homology level of the datasets. Dataset

All-α

All-β

α/β

α+β

Total

Homology level

25PDB 1189 640 FC699

443 223 138 130

443 294 154 269

346 334 177 377

441 241 171 82

1673 1092 640 858

25% 40% 25% 40%

293

E and C segment in SSS is respectively replaced by the individual letters H, E and C. Then, all of the letters C are removed and SS is obtained. The other sequence is obtained by removing all of the letters C from SSS, and the new sequence is denoted by E-H [16]. For example, given a secondary structure sequence SSS: EECEEECCEECCCCHHHHCCHHHCCCEEEC CHHHCEE, the corresponding SS and E-H are EEEHHEHE and EEEE EEEHHHHHHHEEEHHHEE, respectively. Based on the above three sequences, several structure-driven features are rationally constructed. The details of these features are given as follows: 1. The contents of secondary structure elements are the most widelyused structure-driven features, and have been proved significantly helpful in improving prediction accuracy of protein structural class [11]. They are formulated as: pðxÞ ¼

NðxÞ ; x∈fH; E; Cg N1

ð1Þ

where N(x) is the number of secondary structural element H, E or C in SSS; N1 denotes the sequence length of SSS. This type of features has been extended to SS [13]. Here we further reuse them in E-H. 2. Biosequence patterns usually reflect some important functional or structural elements in biosequences such as repeated patterns [20]. In SSS, the 2-symbol repeated patterns are considered here, such as HH, EE, HE and EH. Hence, the contents of repeated patterns are proposed as follows: pðxxÞ ¼

NðxxÞ ; xx∈fHH; EE; HE; EHg N1

ð2Þ

where N(xx) is the number of 2-symbol repeated patterns HH, EE, HE or EH. Here we extended these features to SS and E-H. 3. The normalized counts of α-helices and β-strands in SSS [16], another important structure-driven features, are given below: NCountSegðxÞ ¼

CountSegðxÞ ; x∈fH; Eg N1

ð3Þ

where CountSeg(x) is the number of H or E segments. These features have been reused in E-H [16]. Here we further extended to SS. The 25 features shown above characterize the contents of the predicted secondary structure from different aspects. They can be categorized into content-related structure-driven features. Below, we will further extract other types of structure-driven features such as order-related and distance-related features. 4. Second order composition moment of H, E and C are specially proposed to reflect the spatial arrangement of secondary structural elements in SSS [14], which are formulated as: XNðxÞ CMVðxÞ ¼

j¼1

nx j

N1 ðN 1 −1Þ

; x∈fH; E; Cg

ð4Þ

where nx j is the jth order (or position) of the corresponding secondary structural element in SSS. As these features reflect the orderrelated characteristic of secondary structure, they can be categorized into order-related structure-driven features. This type of features has been reused in E-H [16]. Here we further extended to SS. 5. Classification of protein structures is based on the contents and spatial arrangements of secondary structural elements especially for the α/β and α + β classes. While proteins in the α/β and α + β classes contain both α-helices and β-strands, they are usually separated in the α/β class but are usually interspersed in the α + β class. The distribution information of secondary structure segments will be helpful to inferring spatial arrangement of secondary structural elements. As distance information of secondary structural elements can reflect the distributions of α-helices and β-strands, we propose several distance-related structure-driven features. The length of

294

L. Kong, L. Zhang / Genomics 103 (2014) 292–297

α-helices or β-strands can be considered as a type of distance in the same secondary structural segment. Thus normalized maximal, minimal and average lengths of secondary structural segments and variance of α-helices (β-strands) lengths are proposed as follows: NMaxSegðxÞ ¼

MaxSegðxÞ N1

ð5Þ

NMinSegðxÞ ¼

MinSegðxÞ N1

ð6Þ

adjacent α-helices. As there are only letters H and E in SS and E-H, the similar features of NMaxD(xx), NMinD(xx), NAvgD(xx) and NVarD(xx) in SS and E-H are always 0. Hence, we only extend another 8 distance-related structure-driven features (Eqs. (5)–(12)) to SS and E-H. A total of 88 structure-driven features are proposed here. Among these features, 25 of them belong to content-related features, 7 of them belong to order-related features, and 56 of them are distancerelated. In addition, 56 out of 88 features are first proposed in this paper. The details are summarized in Table 2. 2.3. Feature selection

NAvgSegðxÞ ¼

AvgSegðxÞ N1

ð7Þ

NVarSegðxÞ ¼

VarSegðxÞ N1

ð8Þ

where x ∈ {H, E}, MaxSeg(x) and MinSeg(x) are the lengths of the longest and shortest α-helices (β-strands) and AvgSeg(x) and VarSeg(x) denote the mean and variance of lengths of α-helices (β-strands), respectively. Similarly, we consider the distance between the same secondary structural segment, and the features are defined as: NMaxDðxÞ ¼

MaxDðxÞ N1

ð9Þ

NMinDðxÞ ¼

MinDðxÞ N1

ð10Þ

AvgDðxÞ NAvgDðxÞ ¼ N1

NVarDðxÞ ¼

VarDðxÞ N1

Table 2 Summary of the original features and the feature selection results. Features

Novel

Sequence

Reused

Sequence

CR-all

p(x), x ∈ {H,E} p(xx), xx ∈ {HH,EE,HE, EH} p(xx), xx ∈ {HE, EH} p(xx), xx ∈ {HH,EE} NCountSeg(x), x ∈ {H,E} p(H), p(xx), xx ∈ {HH,EE} p(xx), xx ∈ {EE, HE, EH}

E-H SSS

p(x), x ∈ {H, E, C} p(x), x ∈ {H,E}

SSS SS

SS E-H SS

SS E-H SSS,E-H

E-H SSS

OR-all

CMV(x) x ∈ {H, E}

SS

OR

CMV(x) x ∈ {H, E}

SS

DR-all

NMaxSeg(x) x ∈ {H, E}

SS

NMinSeg(x) x ∈ {H, E}

ALL

NAvgSeg(x) x ∈ {H, E}

SS

NVarSeg(x) x ∈ {H, E}

SS

p(xx), xx ∈ {HH,EE} p(xx),xx ∈ {HE, EH} NCountSeg(x), x ∈ {H,E} p(C), NCountSeg(E) p(E) p(EH) CMV(x) x ∈ {H, E, C} CMV(x) x ∈ {H, E} CMV(x) x ∈ {E, C} CMV(E) NMaxSeg(x) x ∈ {H, E} NAvgSeg(x) x ∈ {H, E} NVarSeg(x) x ∈ {H, E} NMaxD(xx) xx ∈ {HE, EH}

NMaxD(x) x ∈ {H, E} NMinD(x) x ∈ {H, E} NAvgD(x) x ∈ {H, E} NVarD(x) x ∈ {H, E} NMinD(xx) xx ∈ {HE, EH} NAvgD(xx) xx ∈ {HE, EH} NVarD(xx) xx ∈ {HE, EH} NVarSeg(E), NAvgD(H) NMinSeg(H), NMinD(E) NMaxD(E) NAvgD(xx) xx ∈ {HE, EH}

ALL ALL ALL ALL SSS SSS SSS SS E-H SSS SSS

ð11Þ

ð12Þ

where x ∈ {H, E}, MaxD(x) and MinD(x) are the maximal and minimal distances between adjacent α-helices (β-strands) and AvgD(x) and VarD(x) denote the mean and variance of distances between adjacent α-helices (β-strands), respectively. In addition, the distance between different secondary structural segments is further considered. The normalized maximal, minimal and average, variance of the distances between adjacent α-helices and β-strands are computed by the following formulas: MaxDðxxÞ N1

ð13Þ

NMinDðxxÞ ¼

MinDðxxÞ N1

ð14Þ

NAvgDðxxÞ ¼

AvgDðxxÞ N1

ð15Þ

NVarDðxxÞ ¼

VarDðxxÞ N1

ð16Þ

NMaxDðxxÞ ¼

Feature selection is the process of identifying and removing as much irrelevant and redundant features as possible. This will enable a more efficient prediction model, and helps speed up the computational analysis time. Many feature selection methods have been used in a wide range of bioinformatics studies [21]. These methods can be described in two main groups: filter and wrapper. Due to combining the feature selection method with a specific classifier, feature wrappers often achieve better results than filters. Thus, a wrapper approach which is based on the best first search algorithm is adopted to choose a subset of original features in this paper. In order to operate faster, feature selection process is respectively performed on 3 types of structure-driven feature subsets. Moreover, 10-fold cross-validation on 25PDB dataset with SVM classifier as described in Section 2.4 is adopted to avoid overfitting. As a result, 27 structure-driven features with 15 features first proposed are selected from the original 88 features. Among the

CR

DR

where xx ∈ {HE, EH}; HE denotes segment from α-helices to the adjacent β-strands, and EH denotes segment from β-strands to the

NAvgSeg(H) NMaxSeg(E) NVarSeg(E) NMaxD(xx) xx ∈ {HE, EH}

SSS SS E-H SSS E-H SSS E-H SSS,E-H SSS,E-H SSS,E-H SSS

E-H E-H E-H SSS

The “CR-all” and “CR” in the first column respectively denote the content-related feature subsets of original and selected features, and “OR-all” and “OR”, “DR-all” and “DR” have similar meanings. The “Novel” and “Reused” columns respectively show features which are newly designed and reused in the feature subsets listed in the first column. The two “Sequence” columns show sequence types from which the corresponding features are extracted. See text for the notations of SSS, SS and E-H, and “ALL” denotes all the three types of sequences.

L. Kong, L. Zhang / Genomics 103 (2014) 292–297

295

selected features, 10 of them belong to content-related features, 5 of them belong to order-related features, and 12 of them are distancerelated. For convenience, the 3 types of feature subsets are denoted by CR, OR and DR, respectively. The details of the selected features are listed in Table 2. Finally, a 27-dimensional structure-driven feature vector is extracted and then used to predict protein structural class.

class Cj, respectively. The MCC value ranges between −1 and 1 with 0 denoting random prediction and higher absolute values denoting more accurate predictions.

2.4. Classification algorithm construction

3.1. Prediction accuracies for four benchmark datasets

In machine learning and statistics, classification is the problem of identifying to which of a set of categories a new observation belongs, on the basis of a training set of data containing observations whose category membership is known. Although there are advantages and disadvantages of each type of classification algorithms, support vector machine (SVM) is one of the most widely used classification algorithms for identifying protein structural class [14–17]. SVM uses different kernel functions to map the input data to a higher dimensional space where it seeks a hyperplane to separate the training protein samples by their classes. Kernels commonly used include linear function, polynomial function, sigmoid function and Gaussian radial basis function (RBF). Here, the publicly available LibSVM library [22] with Gaussian RBF is employed. The penalty parameter C and kernel parameter γ are found by 10-fold cross-validation with a grid search strategy for each dataset. The parameters C and γ are searched exponentially in the ranges of [2 − 5, 215 ] and [2 − 15 , 2 5 ], respectively, with a step size of 21 to probe the highest classification rate.

The results of jackknife test performed on four low-homology benchmark datasets are listed in Table 3. From Table 3, it can be seen that the overall accuracies are above 83% for all the four datasets, and three of them are even above 85%. To be specific, the overall accuracies of 85.0%, 85.2%, 83.9% and 94.5% are achieved for 25PDB, 1189, 640 and FC699 datasets, respectively. Comparing the prediction accuracies of four structural classes with each other, the all-α class has the highest MCC values, and the Sens and Spec values are often the best (with accuracies above 90% for all the datasets). It indicates that the prediction for the all-α class is most reliable. Meanwhile, the results of the all-β and α/β classes are also satisfactory with prediction accuracies more than 80%. In contrast, the prediction accuracies of the α + β class are inferior to those of other three classes (with accuracies about 75% for all the datasets). The similar trend can be observed for all protein structural class prediction methods [16]. This may be due to the non-negligible overlap with the other classes [18].

2.5. Prediction assessment

3.2. Analysis of the proposed feature vector

In statistical prediction, prediction assessment is an important process to evaluate classification accuracy and generalization abilities of the predictors. There are three cross-validation methods which are often used to examine a predictor for its effectiveness in practical application: independent dataset test, subsampling test, and jackknife test [2]. However, of the three test methods, the jackknife test is deemed the least arbitrary that can always yield a unique result for a given benchmark dataset as elucidated in [23]. Accordingly, the jackknife test has been widely recognized and increasingly used to examine the quality of various predictors [14–17]. In this paper, we choose to perform jackknife test to examine the power of our method. For comprehensive evaluation, the individual sensitivity (or accuracy, denoted by Sens), the individual specificity (Spec) and Matthew's correlation coefficient (MCC) over each of the four structural classes, as well as the overall accuracy (OA) over the entire dataset are reported. These parameters are detailed as follows [17]:

As mentioned earlier, among the selected 27 structure-driven features, 15 of them are newly designed to reflect the general content and spatial arrangement of the predicted secondary structural elements of a given protein sequence. To validate the contribution of the 15 novel features, predictions with only the remaining 12 reused features are performed on four datasets. The comparison of the accuracies between our method that included all 27 features and only 12 reused features is shown in Table 4. From Table 4, the overall accuracies are significantly improved by adding novel features with increment of 2.1%–5.7%. As for each of the protein structural classes, improvement of prediction accuracies range 2.7%–4.6%, 1.3%–4.3%, 0.3%–5.7% and 1.6%–11.2%, respectively. Moreover, the highest two increments (11.2% for 1189 dataset and 8.5% for FC699 dataset) appear at the α + β class. Therefore, combination of these newly designed features have significant positive effect to the protein structural class prediction, especially for proteins from the α + β class.

Sens j ¼

TP j TP j ¼  TP j þ FN j C  j

TN j TN j ¼X Spec j ¼ FP j þ TN j jCk j

ð17Þ

3. Results and discussion

Table 3 The prediction quality of our method on four benchmark datasets. Dataset

Class

Sens(%)

Spec(%)

MCC(%)

25PDB

All-α All-β α/β α+β OA All-α All-β α/β α+β OA All-α All-β α/β α+β OA All-α All-β α/β α+β OA

94.1 87.1 84.1 74.4 85.0 93.7 86.1 86.8 73.9 85.2 91.3 77.3 91.5 76.0 83.9 96.9 94.8 97.1 78.0 94.5

97.0 96.6 94.6 91.9

90.4 84.7 77.3 66.9

96.9 97.9 93.9 91.5

88.8 86.3 80.6 64.6

97.8 99.0 92.7 88.7

89.3 82.5 81.7 63.4

99.2 98.5 96.0 98.3

95.5 93.7 92.9 78.6

ð18Þ

k≠ j

1189

TP j  TN j −FP j  FN j ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi MCC j ¼ r    ffi FP j þ TP j TP j þ FN j TN j þ FP j TN j þ FN j X OA ¼ X

j

TP j

j

jC j j

ð19Þ 640

ð20Þ

where TNj, TPj, FNj, FPj and |Cj| are the number of true negatives, true positives, false negatives, false positives and proteins in the structural

FC699

296

L. Kong, L. Zhang / Genomics 103 (2014) 292–297

Table 4 Comparison of the accuracies between our method that included 27 features and only 12 reused features. Dataset

25PDB 1189 640 FC699

Features

All features Reused features All features Reused features All features Reused features All features Reused features

Table 6 Performance comparison of different methods on four benchmark datasets. Dataset

Method

Reference

Accuracy (%) All-α

All-β

α/β

α+β

Overall

94.1 91.4 93.7 89.2 91.3 87.7 96.9 92.3

87.1 82.8 86.1 84.0 77.3 76.0 94.8 93.3

84.1 82.7 86.8 81.1 91.5 88.7 97.1 96.8

74.4 72.8 73.9 62.7 76.0 73.7 78.0 69.5

85.0 82.4 85.2 79.5 83.9 81.4 94.5 92.4

To represent a protein, 3 different types of structure-driven features are extracted. In order to further investigate how these features subsets contribute to the prediction performance, another experiment is performed. Table 5 lists the overall accuracies obtained with all the possible combinations of feature subsets. For every combination, the comparisons of the accuracies between our method that included all features and reused features are shown. From Table 5, it can be seen that when the feature subsets are used individually, the overall accuracies for CR and DR are higher than OR, and those of CR are often the best. As more features are involved in the prediction, the overall accuracies are shown to increase steadily. For instance, when 1189 dataset is tested, the prediction accuracy with the feature subset CR is 81.3%. If the feature subset OR is added, the accuracy increases to 83.1%. If the feature subset DR is further added, the accuracy increases by 2.1% up to 85.2%. Moreover, with the removal of newly designed features, the overall accuracies have significantly declined for most combinations of feature subsets (The only two exceptional cases occur when 640 dataset with feature subsets DR and OR + DR are tested, in which the accuracies increase by 0.1% and 1.5%, respectively). Take 25PDB dataset for example, the decrease of overall accuracies ranges from 0.6% to 6.1%. Therefore, we may conclude that the 3 types of structure-driven features (CR, OR and DR) can make complementary contributions to each other to the protein structural class prediction and the newly designed features are still effective to predict protein structural class with feature subset.

3.3. Comparison with other prediction methods In this section, we compare our method with the recently reported competing protein structural class prediction methods on the same datasets. Among the competing methods, RKS-PPSC [13], Liu and Jia [14], Zhang et al. [15], Ding et al. [16] and Zhang et al. [17] are prediction methods solely based on information of the predicted secondary structure. In addition, the renowned methods SCPRED [11] and MODAS [12] often used as baseline for comparison are also included. In SCPRED, 9 features are selected where 8 features are based on the predicted secondary structure. In MODAS, the predicted secondary structure information and evolutionary profiles are employed to perform the prediction. The comparison results are shown in Table 6. According to Table 6, our method outperforms all other methods with overall accuracies on four

25PDB

1189

640

FC699

SCPRED MODAS RKS-PPSC Liu and Jia Zhang et al. Ding et al. Zhang et al. Our method SCPRED MODAS RKS-PPSC Zhang et al. Ding et al. Zhang et al. Our method SCPRED RKS-PPSC Ding et al. Our method SCPRED Liu and Jia Our method

[11] [12] [13] [14] [15] [16] [17] This paper [11] [12] [13] [15] [16] [17] This paper [11] [13] [16] This paper [11] [14] This

Accuracy (%) All-α

All-β

α/β

α+β

Overall

92.6 92.3 92.8 92.6 95.0 95.0 95.7 94.1 89.1 92.3 89.2 92.4 93.7 92.4 93.7 90.6 89.1 94.9 91.3 – 97.7 96.9

80.1 83.7 83.3 81.3 85.6 81.3 80.8 87.1 86.7 87.1 86.7 87.4 84.0 84.4 86.1 81.8 85.1 76.6 77.3 – 88.0 94.8

74.0 81.2 85.8 81.5 81.5 83.2 82.4 84.1 89.6 87.9 82.6 82.0 83.5 84.4 86.8 85.9 88.1 89.3 91.5 – 89.1 97.1

71.0 68.3 70.1 76.0 73.2 77.6 75.5 74.4 53.8 65.4 65.6 71.0 66.4 73.4 73.9 66.7 71.4 74.3 76.0 – 84.2 78.1

79.7 81.4 82.9 82.9 83.9 84.3 83.7 85.0 80.6 83.5 81.3 83.2 82.0 83.6 85.2 80.8 83.1 83.4 83.9 87.5 89.6 94.5

The best results are highlighted in bold face.

datasets. Specifically, the overall accuracies on 25PDB, 1189, 640 and FC699 datasets are 0.7%, 1.6%, 0.5% and 4.9% higher than previous bestperforming results, respectively. With the all-α class, the proposed method performs best on 1189 dataset. As for proteins from the all-β class, our method achieves the highest prediction accuracies among half of the four test datasets (25PDB and FC699). Referring to the α/β and α + β classes, although the prediction accuracies are not always the highest, significant improvements can be seen from Table 6. For instance, when 640 dataset is tested, the proposed method obtained the highest prediction accuracies for the α/β and α + β classes, which are 2.2% and 1.7% higher than the second best Ding's method. Our main goal in this study is to improve the structural class prediction performance by mining spatial structure information implied in the protein secondary structure sequence. However, to improve accuracy further, combining secondary structure information with other protein sequence information for feature extraction will be an effective strategy which has been proved in many literatures [12,24,25]. In order to show the effectiveness of the proposed method in differentiating between the α/β and α + β classes, another experiment is performed and the results are listed in Table 7. Similar to methods RKS-PPSC and Ding et al., we generate a subset for each benchmark dataset by removing all the proteins in the all-α and all-β classes to avoid any potential outside effects. Then we predict the accuracies of the α/β and α + β classes on these reduced subsets instead of the whole dataset. To differentiate with 25PDB, FC699, 1189 and 640 datasets, 25PDBs, FC699s, 1189s and 640 s are used to denote the corresponding subsets. As can be seen from Table 7, our method obtains the highest

Table 5 Comparison of the overall accuracies between our method that included all features and only reused features with different combinations of features subsets. Dataset

Features

CR

OR

DR

CR + OR

CR + DR

OR + DR

CR + OR + DR

25PDB

All features Reused features All features Reused features All features Reused features All features Reused features

81.6 76.9 81.3 75.7 82.2 76.9 88.6 86.5

75.6 73.9 75.5 73.8 79.1 75.8 86.8 83.4

81.5 80.9 79.9 78.4 79.1 79.2 92.2 91.5

83.6 77.5 83.1 76.9 83.4 78.9 91.3 86.8

84.3 82.0 84.5 79.2 83.4 80.5 93.6 92.7

82.1 81.1 83.0 79.6 80.5 82.0 92.7 92.1

85.0 82.4 85.2 79.5 83.9 81.4 94.5 92.4

1189 640 FC699

L. Kong, L. Zhang / Genomics 103 (2014) 292–297

Acknowledgment

Table 7 The accuracies of differentiating between the α/β and α + β classes. Dataset

25PDBs

1189s

640 s

FC699s

Method

SCPRED RKS-PPSC Ding et al. Our method SCPRED RKS-PPSC Ding et al. Our method SCPRED RKS-PPSC Ding et al. Our method Our method

Reference

[11] [13] [16] This paper [11] [13] [16] This paper [11] [13] [16] This paper This paper

297

Accuracy (%) α/β

α+β

Overall

76.0 86.4 85.0 83.5 88.6 83.8 85.9 88.9 89.3 88.1 89.3 89.8 98.9

83.2 82.8 88.2 88.4 63.1 81.3 73.9 84.6 77.2 83.6 83.6 84.8 86.6

80.1 84.4 86.8 86.3 77.9 82.8 80.9 87.1 83.3 85.9 86.5 87.4 96.7

The datasets comprise only proteins in the α/β and α + β classes. The best results are highlighted in bold face.

overall accuracies and individual class accuracies on both 1189s and 640 s datasets. With 25PDBs dataset, the overall accuracy is 0.5% lower than the highest Ding's result, but still 6.2% and 1.9% higher than those of SCPERD and RKS-PPSC. In addition, prediction of the α + β class performs best in 25PDBs dataset. For FC699s dataset, the overall accuracy of 96.7% is achieved by our method. Table 7 clearly shows that our proposed method is essential to achieve good prediction performance for the α/β and α + β classes, especially for the α + β class. This may be due to the newlydesigned distance-related features which focus on the level of separation and aggregation about α-helices and β-strands in the predicted secondary structure sequence.

4. Conclusions In this paper, we have introduced a novel computational method for predicting protein structural class solely using the predicted secondary structure information. The 27 structure-driven features which are rationally divided into three groups (CR, OR and DR) are extracted to reflect general contents and spatial arrangements of the predicted secondary structural elements of a given protein sequence. Based on a comprehensive comparison with other existing methods on four widely-used lowhomology benchmark datasets, the proposed method is shown to be an effective computational tool for protein structural class prediction. As for the intrinsically disordered proteins which contain regions with no stable structure and may have specific sequence characteristics, it would be more difficult to predict their structural class. Therefore, investigations about how the proposed method performs on the lowsimilarity as well as disordered protein datasets will constitute an interesting subject for our future work. In addition, since user-friendly and publicly accessible web-servers represent the future direction for developing practically more useful models, simulated methods, or predictors [11,13,26], we shall make efforts in our future work to provide a webserver for the method presented in this paper.

The program file can be obtained by e-mail from the corresponding author. We express our thanks to the anonymous referees for their many valuable suggestions that have improved this manuscript. References [1] C. Anfinsen, Principles that govern the folding of protein chains, Science 181 (1973) 552–558. [2] K.C. Chou, C.T. Zhang, Prediction of protein structural classes, Crit. Rev. Biochem. Mol. Biol. 30 (1995) 275–349. [3] G.P. Zhou, N. Assa-Munt, Some insights into protein structural class prediction, Proteins 44 (2001) 57–59. [4] A. Murzin, S. Brenner, T. Hubbard, C. Chothia, SCOP: a structural classification of protein database for the investigation of sequence and structures, J. Mol. Biol. 357 (1995) 536–540. [5] J.K. Kim, S.Y. Bang, S. Choi, Sequence-driven features for prediction of subcellular localization of proteins, Pattern Recognit. 39 (2006) 2301–2311. [6] K.C. Chou, A novel approach to predicting protein structural classes in a (20-1)-D amino acid composition space, Proteins 21 (1995) 319–344. [7] K.C. Chou, Prediction of protein cellular attributes using pseudo amino acid composition, Proteins 43 (2001) 246–255. [8] X.D. Sun, R.B. Huang, Prediction of protein structural classes using support vector machines, Amino Acids 30 (2006) 469–475. [9] K.C. Chou, Y.D. Cai, Predicting protein structural class by functional domain composition, Biochem. Biophys. Res. Commun. 321 (2004) 1007–1009. [10] T. Liu, X. Zheng, J. Wang, Prediction of protein structural class for low-similarity sequences using support vector machine and PSI-BLAST profile, Biochimie 92 (2010) 1330–1334. [11] L.A. Kurgan, K. Cios, K. Chen, SCPRED: accurate prediction of protein structural class for sequences of twilight-zone similarity with predicting sequences, BMC Bioinforma. 9 (2008) 226. [12] M.J. Mizianty, L. Kurgan, Modular prediction of protein structural classes from sequences of twilight-zone identity with predicting sequences, BMC Bioinforma. 10 (2009) 414. [13] J. Yang, Z. Peng, X. Chen, Prediction of protein structural classes for low-homology sequences based on predicted secondary structure, BMC Bioinforma. 11 (2010) S9. [14] T. Liu, C. Jia, A high-accuracy protein structural class prediction algorithm using predicted secondary structural information, J. Theor. Biol. 267 (2010) 272–275. [15] S. Zhang, S. Ding, T. Wang, High-accuracy prediction of protein structural class for low-similarity sequences based on predicted secondary structure, Biochimie 93 (2011) 710–714. [16] S. Ding, S. Zhang, Y. Li, T. Wang, A novel protein structural classes prediction method based on predicted secondary structure, Biochimie 94 (2012) 1166–1171. [17] L. Zhang, X. Zhao, L. Kong, A protein structural class prediction method based on novel features, Biochimie 95 (2013) 1741–1744. [18] L.A. Kurgan, L. Homaeian, Prediction of structural classes for protein sequences and domains-impact of prediction algorithms, sequence representation and homology, and test procedures on accuracy, Pattern Recognit. 39 (2006) 2323–2343. [19] D.T. Jones, Protein secondary structure prediction based on position-specific scoring matrices, J. Mol. Biol. 292 (1999) 195–202. [20] L. Chen, W. Liu, Frequent patterns mining in multiple biological sequences, Comput. Biol. Med. 43 (2013) 1444–1452. [21] Y. Saeys, I. Inza, P. Larranaga, A review of feature selection techniques in bioinformatics, Bioinformatics 23 (2007) 2507–2517. [22] C.C. Chang, C.J. Lin, LIBSVM: a library for support vector machines, ACM Trans. Intell. Syst. Technol. 2 (2011) 1–27 (software available at http://www.csie.ntu.edu.tw/ wcjlin/libsvm ). [23] K.C. Chou, Some remarks on protein attribute prediction and pseudo amino acid composition (50th Anniversary Year Review), J. Theor. Biol. 273 (2011) 236–247. [24] A. Ahmadi Adl, A. Nowzari-Dalini, B. Xue, V.N. Uversky, X. Qian, Accurate prediction of protein structural classes using functional domains and predicted secondary structure sequences, J. Biomol. Struct. Dyn. 29 (2012) 1127–1137. [25] S. Ding, Y. Li, Z. Shi, S. Yan, A protein structural classes prediction method based on predicted secondary structure and PSI-BLAST profile, Biochimie 97 (2014) 60–65. [26] Y.K. Chen, K.B. Li, Predicting membrane protein types by incorporating protein topology, domains, signal peptides, and physicochemical properties into the general form of Chou's amino acid composition, J. Theor. Biol. 318 (2013) 1–12.

Novel structure-driven features for accurate prediction of protein structural class.

Prediction of protein structural class plays an important role in inferring tertiary structure and function of a protein. Extracting good representati...
243KB Sizes 0 Downloads 3 Views