Interdiscip Sci Comput Life Sci (2015) 7: 65–77 DOI: 10.1007/s12539-014-0257-2

Improved Feature-based Prediction of SNPs in Human Cytochrome P450 Enzymes

1

Li LI1 , Yi XIONG1∗ , Zhuo-Yu ZHANG1$ , Quan GUO1 , Qin XU1 , Hien-Haw LIOW2,3 , Yong-Hong ZHANG1,4∗ , Dong-Qing WEI1

(State Key Laboratory of Microbial Metabolism and College of Life Science and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China) 2 (Center for Genome Sciences and Systems Biology, Washington University in St. Louis, St. Louis, MO 63130, USA) 3 (Department of Mathematics, Washington University in St. Louis, St. Louis, MO 63130, USA) 4 (Medicine Engineering Research Center and School of Pharmacy, Chongqing Medical University, Chongqing 400016, China)

Received 25 December 2014 / Revised 2 March 2015 / Accepted 6 March 2015

Abstract: Single nucleotide polymorphisms (SNPs) make up the most common form of mutations in human cytochrome P450 enzymes family, and have the potential to bring with different drug responses or specific diseases in individual patients. Here, based on machine learning technology, we aim to explore an effective set of sequencebased features for improving prediction of SNPs by using support vector machine algorithms. The features are derived from the target residues and flanking protein sequences, such as amino acid types, sequences composition, physicochemical properties, position-specific scoring matrix, phylogenetic entropy and the number of possible codons of target residues. In order to deal with the imbalance data with a majority of non-SNPs and a minority of SNPs, a preprocessing strategy based on fuzzy set theory was applied to the datasets. Our final model achieves the performance of 93.8% in sensitivity, 88.8% in specificity, 91.3% in accuracy and 0.971 of AUC value, which is significantly higher than the previous DNA sequence-based or protein sequence-based methods. Furthermore, our study also suggested the roles of individual features for prediction of SNPs. The most important features consist of the amino acid type, the number of available codons, position-specific scoring matrix and phylogenetic entropy. The improved model will be a promising tool for SNP predictions, and assist in the research of genome mutation and personalized prescriptions. Key words: single nucleotide polymorphisms, support vector machine, prediction.

1 Introduction The cytochrome P450 (CYP450) enzymes family are heme-containing proteins which play a key role in physical metabolism of more than 90% therapeutic drugs. In recent years, a number of studies have focused on mutations, or polymorphisms in human CYP450, in which the single nucleotide polymorphisms (SNPs) have been shown great importance, particularly in pharmacogenomics (McGraw and Waller 2012). SNPs are one single nucleotide (A, T, C, or G) variations, which must occur in at least 1% of population, in the genome sequences. The SNPs may change the gene expression profiles of CYP450 enzymes by alternating the coded ∗ Corresponding authors. E-mail: [email protected], [email protected] $ Zhuo-Yu ZHANG is a summer student from No 3 High School of Wuhan and Quan Guo is a summer student from Xiamen University

amino acids, transcription frame, or gene splicing, and further change the catalytic activities of CYP450s and finally lead to the individual variations in disease susceptibility and drug response (Hirschhorn and Daly 2005; McCarthy and Hilfiker 2000). Therefore, the identification of SNPs is important for understanding the structure and evolution of the human genome, and their linkage to the diseases. Due to the importance of SNPs, there is a great demand to computationally predict the SNP sites fast and accurately from the franking sequences. Over the past decades, most studies related to the prediction of SNPs focused on “the functions of SNPs”. For example, the Sorting Intolerant from Tolerant (SIFT) algorithm (Sim et al., 2012; Kumar et al., 2009) uses sequence homology to compute the likelihood to predict the potential impact of an amino acid substitution on protein function. The series of methods Plyphen (Adzhubei et al., 2010; Ramensky et al., 2002)

66

focus on the computational predictions of the impact of protein sequence variants. SNAP (Johnson et al., 2008) have been developed as a web server for querying SNPs to identify and annotate nearby SNPs in linkage disequilibrium (proxies) based on HapMap. The SilVA method (Buske et al., 2013) aims to the automated prioritization of harmful synonymous variants based on a variety of features, including sequence conservation, splice sites, splice-regulatory motifs, codon frequency, CpG content and RNA secondary structure energy. The SNPsnap web server (Pers et al., 2015) efficiently identifies sets of randomly drawn SNPs that are matched to a set of query SNPs based on allele frequency, distance to nearest gene, gene density, and the number of SNPs in linkage disequilibrium regions. However, few methods were designed to predict the occurrence of SNPs in the human genome sequences, that is, whether a site might possibly be a SNP. Previous studies mainly used DNA sequence as predictors because SNPs occur directly at the DNA level. Yan et al. (Yan et al., 2007) applied two major categories of approaches for identification of SNP sites from the flanking DNA sequences on human chromosome 21 and 22, and achieved the best prediction accuracy of 57.0%. The first type is based on pattern discovery algorithms using deterministic or probabilistic methods. The second type is based on machine learning techniques, such as random forests, and K-nearest neighbors. In addition to these flanking sequence features, the CG feature of target sites was also utilized in our earlier SVM training method which improved the accuracy to 75.6% (Li et al., 2012a). Due to the fact that there exists limited information derived from DNA sequences and that the mutations in protein sequences are more directly related to enzyme activities, we also took another way, that is, to develop prediction models based on features of protein sequences. Our previous SNP prediction model SCYPPred (Li et al., 2012b) utilized the features derived from the protein sequences instead of DNA sequences, and achieved the accuracy of 66.7%. Here, we propose an improved feature-based method to accurately predict SNPs in human CYP450 enzymes by using support vector machine (SVM), training not only on the conventional features of amino acids such as amino acid type, composition, physical and chemical properties, but also on evolutionary information including position-specific scoring matrix (PSSM) and phylogenetic entropy. Moreover, we adopted the number of available codons at target residues as a feature in our model, which is used for the first time in predicting SNPs, to our best knowledge. To eliminate the model bias caused by the imbalance datasets with a minority of positive samples (SNPs) and a majority of negative samples (non-SNPs), a new method based on fuzzy set theory was applied to balance the datasets (Li et al., 2010). Firstly, we analyze the predictive power of

Interdiscip Sci Comput Life Sci (2015) 7: 65–77

individual features derived from the single protein sequences, and compare the performance of the different combinations of the features. Next, we aim to evaluate the role of the evolutionary-based features in prediction of SNPs. Finally, we compare our best model to previously published methods, and conduct the case study on real biological data.

2 Results and Discussion 2.1

Features based on protein sequences

Based on the theory of central dogma, the effect of mutations in DNA sequences on enzyme activities is influenced by excessively many possible factors during the expression process. In comparison,the protein sequences are more directly related to the expressed enzymes. It is reasonable to assume information based on protein sequences to be quite informative in prediction of SNPs. This hypothesis is somewhat supported by the difference in amino acid preferences between SNPs and non-SNPs, as shown in Fig. 1. In Fig. 1(a), the probabilities of the twenty major amino acids in known SNPs and non-SNP sites are marked in blue and red, respectively. The differences between them were calculated and plotted in Fig. 1(b), in which 8 of the 20 amino acids have more than 1% difference in the probabilities for SNPs and non-SNPs. These discrepancies may suggest a different distribution of these amino acids in SNP sites and non-SNP sites. Therefore, in this work the models were developed based on features related to the protein sequences. In order to describe the characteristics of candidate sites as extensive as we could, six types of features were extracted from the protein sequences, including flanking sequence features, target site features and evolutionary information. These six features (S1, S2, T1, T2, E1 and E2) were hybridized into 16 combinations, as listed in Table 1, for evaluating the significance of different combinations of features and for optimizing the best prediction model. The 16 combinations can be summarized into four major categories. The four combinations in the first category, C1, C2, C3 and C4, include basic sequence information such as target site features (T1, T2) or flanking sequence features (S1, S2). Thus they are regarded as “base combinations”. In the other three categories, evolutionary information PSSM (E1) and phylogenetic entropy (E2) were added to the four base combinations individually and simultaneously in attempt to further improve the prediction performance. 2.2

Performance of combinations with only sequence features (base combinations)

Sequence features are conventional attributes used in bioinformatics applications, such as in the prediction of lysine methylation and lysine acetylation, mi-

Interdiscip Sci Comput Life Sci (2015) 7: 65–77 0.14

Amino acids in SNPs Amino acids in non-SNPs

0.12

4

0.10

3 Difference/%

Probability

67

0.08 0.06 0.04 0.02 0

Fig. 1

2 1 0 −1 −2

A C D E F G H I K L MN P Q R S T VWY 20 amino acids (a)

−3

A C D E F G H I K L MN P Q R S T VWY 20 amino acids (b)

The probabilities of amino acids in SNP sites and non-SNP sites. (a) The probabilities of the twenty major amino acids in SNPs sites and non-SNPs sites are marked in blue and red, respectively. (b) The differences between the probabilities of the amino acids in SNP sites and non-SNPs sites. Positive values mean higher preferences for the amino acids to be found in SNP sites. Table 1

The performance of models from 16 hybrid feature sets Hybrid features

Only Sequences features and Target site features

Incorporating with PSSM

Incorporating with phylogenetic entropy

Incorporating with both PSSM and phylogenetic entropy

Combinations

Sensitivity

Specificity

Accuracy

AUC

S1+S2

C1

0.825

0.675

0.750

0.830

S1+S2+T1

C2

0.925

0.675

0.800

0.925

S1+S2+T2

C3

0.813

0.938

0.875

0.920

S1+S2+T1+T2

C4

0.850

0.913

0.881

0.930

S1+S2+E1

C5

0.875

0.750

0.813

0.885

S1+S2+T1+E1

C6

0.888

0.95

0.919

0.956

S1+S2+T2+E1

C7

0.875

0.875

0.875

0.912

S1+S2+T1+T2+E1

C8

0.913

0.938

0.925

0.957

S1+S2+E2

C9

0.800

0.75

0.775

0.853

S1+S2+T1+E2

C10

0.875

0.800

0.838

0.865

S1+S2+T2+E2

C11

0.800

0.838

0.819

0.858

S1+S2+T1+T2+E2

C12

0.938

0.888

0.913

0.971

S1+S2+E1+E2

C13

0.913

0.713

0.813

0.899

S1+S2+T1+E1+E2

C14

0.875

0.713

0.794

0.891

S1+S2+T2+E1+E2

C15

0.900

0.900

0.900

0.948

S1+S2+T1+T2+E1+E2

C16

0.850

0.913

0.881

0.930

croRNA transcription starting site and mutual interaction of human transcription factors (Shi et al., 2012; Bhattacharyya et al., 2012; Schmeier et al., 2011). In our earlier studies on SNP prediction, an online tool SCYPPred was developed merely based on the flanking sequence features (S1 and S2), which improved the prediction accuracy to 66.7% (Li et al., 2010). Based on this experience, all the hybrid feature sets designed in this work have included these two features. SNP is a kind of point mutation on one base pair, so the characteristics of the target site residues should have higher weight than those of flanking sequences in the predictions. Hence, we put forward more features of the target site residues to describe them. Amino acid type (T1) classifies the side chains of target amino acids into charged, polar or hydrophobic ones, which may

reflect the strength of their electrostatic interactions with surrounding residues. We may expect stronger interactions to stabilize the local structures of target sites, and thus lower the probabilities for SNPs. This assumption may be partially supported by Fig. 1(b), where the highly charged amino acids tend to have lower values of difference between the probabilities in SNPs and in non-SNPs sites. This rule is not absolute, since other structural factors may play more important roles in some cases, for example, the specific structure of proline makes it generally a conserved residue. Compared with the simplest feature set S1+S2, the inclusion of feature T1 in the combination (S1+S2+T1) could improve the sensitivity from 82.5% to 92.5% and the accuracy from 75% to 80% accuracy, as shown in Fig. 2(a).

68

Interdiscip Sci Comput Life Sci (2015) 7: 65–77

80

70

60

Sensitivity

Specificity (a)

Accuracy

100

S1+S2 S1+S2+T2

90

Independent test performance/%

90

50

Fig. 2

100

S1+S2 S1+S2+T1

Independent test performance/%

Independent test performance/%

100

80

70

60

50

Sensitivity

Specificity (b)

Accuracy

S1+S2 S1+S2+T1+T2

90

80

70

60

50

Sensitivity

Specificity (c)

Accuracy

The performance comparison of models generated by only flanking features (S1+S2) and by both flanking features and target features (T1, T2).

A novel feature developed in this work is the number of available codons (T2). This feature describes the code degeneracy of a target site, where higher degeneracy means more tolerant point mutation, and gives rise to synonymous SNPs. For example, degeneracy at the core of bacterial stress regulation provides alternative solutions to a common evolutionary challenge in the research of Wang et al. (Wang et al., 2010). The latest study from Castle et al. (Castle 2011) reveals that within coding regions, SNP rates are highest and conservation is lowest at codon position three, and the fewest SNPs are found at codon position two, reflecting codon degeneracy for amino acid encoding. Consequently, code degeneracy should be a good feature to evaluate the possibility of a target site to be non-SNP. In Fig. 2(b), compared with S1+S2, the hybrid set S1+S2+T2 has a better performance with the specificity increased from 67.5% to 93.375% and the accuracy increased from 75% to 87.5%. Furthermore, the two features of target sites are simple and only one dimensional. It is not expensive to put both of them into the hybrid set. Fig. 2(c) indicates that the base combination of S1+S2+T1+T2 achieves 91.25% in sensitivity, 85% in specificity, and 88.13% in accuracy. All types of metric of the prediction performance are obviously improved. The result suggests that the target site features T1 and T2 generally have positive impact on predicting SNPs and should be included in the combination of prediction classifiers. Therefore, the base combination including all the four features above was determined as a baseline for further improving the prediction model, as discussed below.

2.3

Improving the performance of SNP prediction by features based on evolutionary information PSSM (E1) or phylogenetic entropy (E2)

PSSM profile, a form of encapsulated evolutionary information, provides more comprehensive information than single sequence features. The PSSM profile is a probability matrix with rows as the sites in the protein sequence and the columns as the 20 amino acids. Thus each element of this matrix represents the probability of a type of amino acid to occur at a specific site, from which the residue conservation in a given protein could be mapped in detail. PSSM shows great power in many prediction studies such as protein-DNA interface residue (Xiong et al., 2011a; Xiong et al., 2011b), and transcription factor binding sites (Pairo et al., 2012). In this work, PSSM was applied to improve the models trained from the four base combinations into those in the second category. As shown in Fig. 3, PSSM improved the accuracies of the S1+S2, S1+S2+T1 and S1+S2+T1+T2 models, but did not show significant improvement in S1+S2+T2. A possible reason is that T2 has already provided some information of mutation and evolution on target site. Adding the high dimensional feature E1 into the combination can supplement limited evolutionary information of the flanking sequences but at the same time greatly lower the weight of the information of target site, which should be more important in the situation of SNP prediction as discussed above. In summary, the best hybrid feature set in the second category is S1+S2+T1+T2+E1, the model trained on which achieves 91.25% sensitivity, 93.75% specificity, and 92.5% accuracy.

Interdiscip Sci Comput Life Sci (2015) 7: 65–77

80

70

60

100 Independent test performance/%

Independent test performance/%

90

50

Sensitivity

Specificity (a)

80

70

60

Sensitivity

80

70

60

100

S1+S2+T2 S1+S2+T2+E1

Specificity (c)

Accuracy

S1+S2+T1 S1+S2+T1+E1

90

50

Accuracy

90

50

Fig. 3

100

S1+S2 S1+S2+E1

Independent test performance/%

Independent test performance/%

100

69

Sensitivity

Specificity (b)

Accuracy

S1+S2+T1+T2 S1+S2+T1+T2+E1

90

80

70

60

50

Sensitivity

Specificity (d)

Accuracy

The performance comparison of models generated by base features and by base features plus PSSM (E1).

The results above exhibited the importance of evolutionary information for enhancing model performance. But at the same time, it also revealed the shortcoming of PSSM profile, which may be attributed to its large dimension, overwhelming the power of other features by taking a majority part of all the features. To solve this problem, one approach is to scale the dimensionality for equal partition between PSSM and other features, yet with the shortcoming that the underlying relation between all the features are ignored during the processes of mathematic equalizations. Another approach is to use other evolutionary feature with much less dimensions than those of PSSM, so as to maintain a performance similar to PSSM but avoiding the influences of the high dimensional feature. Consequently, phylogenetic entropy (E2) was designed by evaluating the residue conservation based on Shannon’s information theoretic entropy (Shannon’s entropy). In our work, phylogenetic entropy (feature E2) is constructed by combining classic Shannon’s entropy function and SIFT information (with details explained in the method

part). E2 has only one dimension and is much easier to compute compared with PSSM. Fig. 4 shows the performance of SVM models based on E2 plus base combinations, from which we can find that phylogenetic entropy enhance the accuracy of models using S1+S2+E2, S1+S2+T1+T2 and S1+S2+T1+T2. The model with the highest accuracy is based on the hybrid feature set of S1+S2+T1+T2+E2, achieving an accuracy of 91.3%, 93.8% in sensitivity and 88.8% in specificity. There are minor decreases in sensitivities or specificities in these three cases. And surprisingly, model using the combination S1+S2+T2+E2 demonstrates worse performance in all measurements than that using S1+S2+T2. These negative effects of E2 on the predictive power of the models are complicated and may be partially attributed to possible overlap between E2 and T2 in description of conservations at target site. The results above certified that PSSM (E1) and phylogenetic entropy (E2) may improve some of the prediction model separately, but the effects tend to be reduced when both of them were added to the four base com-

70

Interdiscip Sci Comput Life Sci (2015) 7: 65–77

80

70

60

100 Independent test performance/%

Independent test performance/%

90

50

Sensitivity

Specificity (a)

80

70

60

Sensitivity

Specificity (c)

Accuracy

70

60

Sensitivity

Specificity (b)

Accuracy

S1+S2+T1+T2 S1+S2+T1+T2+E2

90

80

70

60

50

Sensitivity

Specificity (d)

Accuracy

The performance comparison of models generated by base features and by base features plus phylogenetic entropy (E2).

binations. As shown in Fig. 5, it seems that E1 and E2 improved the performance when they were added to S1+S2 but did not have significant effect with other three base feature sets. The possible reason for the increase in Fig. 5(a) may be that the information of the classic features S1 and S2 are limited and basically not related to the evolutionary information E1 and E2. And in Fig. 5(b), E1+E2 did not show the summed effect of E1 and E2, and just resulted similar performance as the base combination S1+S2+T1+T2, which may suggests complicated relationship between the additional features T1, T2, E1 and E2, and the redundancy between features may be counter-productive to their effects on improving prediction performance. 2.4

80

100

S1+S2+T2 S1+S2+T2+E2

S1+S2+T1 S1+S2+T1+E2

90

50

Accuracy

90

50

Fig. 4

100

S1+S2 S1+S2+E2

Independent test performance/%

Independent test performance/%

100

Influences of the additional features and the selection of best model

The performances of SVM models with 16 hybrid feature combinations in this study are evaluated and summarized in Table 1. From the accuracies of feature com-

binations C1-C4, it could be observed that target features (T1, T2) would enhance prediction performance. The accuracy of C1-C8 suggest some positive effect on prediction performance of target features (T1, T2) and PSSM (E1), separately or together. Similar conclusion may be reached for T1, T2 and E2 from the results of C1-C4 & C9-C12, although the effects of E2 on improving base combinations does not seem significant. At last, according to the results of C5-C16, single evolutionary information (E1 or E2) achieves higher accuracy in SNP prediction than what they do together. As discussed above, the influences of the additional features T1, T2, E1 and E2 are complicated and may interfere with each other. From the performance of the models trained on the 16 hybrid feature sets, no simple conclusion about the importance of each feature could be reached. Here, our focus is to search for the best prediction model under this complicated situation. It is observed that the top three high accuracies are 92.5% by S1+S2+T1+T2+E1, 91.9% by

Interdiscip Sci Comput Life Sci (2015) 7: 65–77

80

70

60

100 Independent test performance/%

Independent test performance/%

90

50

Sensitivity

Specificity (a)

80

70

60

Sensitivity

Specificity (c)

80

70

60

100

S1+S2+T2 S1+S2+T2+E1+E2

Accuracy

S1+S2+T1 S1+S2+T1+E1+E2

90

50

Accuracy

90

50

Fig. 5

100

S1+S2 S1+S2+E1+E2

Independent test performance/%

Independent test performance/%

100

71

Sensitivity

Specificity (b)

Accuracy

S1+S2+T1+T2 S1+S2+T1+T2+E1+E2

90

80

70

60

50

Sensitivity

Specificity (d)

Accuracy

The performance comparison of models generated by base features and by base features plus PSSM (E1) + phylogenetic entropy (E2).

S1+S2+T1+E1, and 91.3% by S1+S2+T1+T2+E2. Although the first model has a little bit higher accuracy as 92.5%, the difference between them is too small to make any definite comparative conclusion about their performance. At the same time, the third model has the highest sensitivity as 93.8%, while the second model has the highest specificity as 95.0%. All the three models have different advantages in the static measurements. In addition to the three statistic parameters above, AUC values of 16 models were calculated and ROC curves of the top three models were diagramed (Fig. 6) to evaluate the overall performance. In these comparisons, top three models still exhibit similar prediction ability, with the third model a little higher in the AUC value, as 0.957. At last, considering certain extent of negative effect by the large dimensions of E1 on the computation expense and influence on other features, the model with S1+S2+T1+T2+E2 features was selected as the best one for SNP prediction.

2.5

Comparison with other methods

The accuracy of improved models in this work was compared with the results using other methods for SNP predictions, including six pattern discovery algorithms (Align ACE, ANN, Motif Sampler, YMF, Weeder and Projection) and two machine learning techniques (Random Forests and KNNs) (Yan et al., 2007) on flanking features of DNA sequences, one SVM method with CG features at target DNA sites, as well as two unsupervised machine learning techniques (self organizing maps and kmeans) and one supervised method (SVM) based on protein sequences (summarized in Table 2). The accuracies of earlier DNA sequences-based predictions were reported to be around fifty percent, except the result of the SVM method with additional feature as 75.6% (Li et al., 2012a). For the methods utilizing protein sequences, the two unsupervised machine learning techniques generated models with 52.8% and 62.1% in accuracy while the earlier SVM method

72

Interdiscip Sci Comput Life Sci (2015) 7: 65–77

SCYPPred achieved 66.7% in SNP prediction (Li et al., 2012b). In this work, prediction models were trained by SVM with additional protein sequence based features, amino acid type (T1), the number of available codons (T2), PSSM (E1) and phylogenetic entropy (E2). In the results, accuracies range from about 80% to over 90%, with the top 3 being 91.3%, 91.9% and 92.5%, prominently improved compared to those from other methods. In summary, the comparison above may Table 2

indicate two advantages of our method: the supervised SVM training method and the elaborately selected additional features based on protein sequences. Together with the fuzzy set learning method to balance the size of datasets, improved models could predict SNPs in human CYP450 with the highest accuracy by far as we know, which may bring useful information to the polymorphisms studies and personalized drugs.

Comparison with previous SNP prediction approaches

Approaches DNA sequence based

Protein sequence based

Accuracy

Basic sequence features [34]

Sensitivity Fig. 6

2.6

YMF

51.0%

Weeder

50.0%

Projection

49.7%

Random Forests

57.0% 54.0% 75.6%

Sequence features

SOM

52.8%

Sequence features

Kmeans

62.1%

Sequence features [17]

SVM (SCYPPred)

66.7%

Sequence features and evolutionary information

SVM (Improved models here)

91.3%

0.4

S1+S2+T1+E1 S1+S2+T1+T2+E1 S1+S2+T1+T2+E2 0.8

50.1%

KNNs

0.6

1.0

50.2%

MotifSampler

SVM

0.8

0

50.4%

ANN

Sequence features and CG feature [16]

1.0

0.2

Align ACE

0.6 0.4 Specificity

0.2

0

The ROC curves of top 3 models. The blue line is the roc plot of the model from S1+S2+T1+E1 features, the green line is from S1+S2+T1+T2+E1 features and the pink line is from S1+S2+T1+T2+E2 features.

Case study

In order to confirm our improved model, we did SNP prediction against real biological data produced in two

latest published papers by Dai et al. (Dai et al., 2014) and Dodgen et al. (Dodgen et al., 2013). Dai et al. (Dai et al., 2014) detected 20 novel SNPs in CYP2C9 by providing polymorphisms data in Han Chinese populations, and Dodgen et al. (Dodgen et al., 2013) identified 9 SNPs in CYP2D6 by the AmpliChip CYP450 Test to a South African cohort. In Table 3, the 29 new SNP sites reported in these two papers are summarized, and the corresponding prediction results using our selected model (with S1+S2+T1+T2+E2 features). It was encouraging that our predictions are correct on 25 sites among the 29 new sites. More importantly, all the 9 new SNPs sites in the second paper were successfully predicted. These results may be a good evidence of the high accuracy of SNP prediction using our improved model.

3 Materials and Methods 3.1

Datasets

The human CYP2 subfamily (2A6, 2A13, 2B6, 2C8, 2C9, 2C19, 2D6, 2E1, 2F1, 2J2, 2R1, 2S1, 2W1) is associated with most drug metabolism and show a large degree of variations related to the polymorphic drug metabolism. Besides, proteins sequences in CYP subfamilies have high homology, which implies that prediction model built in one subfamily could be applied

Interdiscip Sci Comput Life Sci (2015) 7: 65–77 Table 3

Prediction results on real biological data

Methods

SNP sites

Prediction results

Dai et al. [8]

D49G

SNP

G96A

SNP

G98V

SNP

F110S

SNP

Dodgen et al. [9]

K119R

SNP

R124Q/W

SNP

T130M

SNP

R132W

SNP

A149T

SNP

P163L

SNP

I207T

Non-SNP

I222V

SNP

P227S

Non-SNP

I284V

SNP

T299R

Non-SNP

P317S

SNP

S343R

SNP

L361I

SNP

I387V

SNP

N418T

Non-SNP

P34S

SNP

H94R

SNP

P267H

SNP

P268S

SNP

E278K

SNP

R296C

SNP

H478Q

SNP

S486T

SNP

L91M

SNP

to others. Thus, our research is focused on CYP2 subfamily, with protein sequences downloaded from NCBI database (http://www.ncbi.nlm.nih.gov). From these sequences, we collected 243 SNPs sites according to The Human Cytochrome P450 (CYP) Allele Nomenclature Database (http://www.cypalleles.ki.se) (May 2008), and the rest of 6184 sites were taken as nonSNPs. Then, the original data set (243 SNPs and 6184 non-SNPs) was divided into the training data set (163 SNPs and 6104 non-SNPs) and the test data set (the rest 80 SNPs and 80 non-SNPs). The training dataset was used to optimize the SVM parameters and to train the prediction models based on designed hybrid feature sets, while the test data set was used to evaluate the performance of the models. 3.2

Feature extraction

Protein function is basically determined by its three dimensional structure, implying that neighboring amino acids have impacts on the target site residues.

73

Therefore, the classic amino acid composition and physicochemical properties of flanking sequence (11mers, 5 residues upstream and downstream to the target site each) are used for prediction (S1 and S2, respectively). S1 encodes the amino acid composition of the 11-mer sequence; values 1 to 20 are corresponding to the 20 types of amino acids. S2 encodes seven physicochemical properties of the flanking sequence, including hydrophobicity, charge, polarity, volume, flexibility, isoelectric point and refractivity. For each property, the sum of the values 5 at each site is calculated by function S2 (m) = −5 Pi and the sum of correlation values are computed by func 5  tion S2 (m) = −4 Pi2 − Pi−1 × Pi+1 , where i is the ith site on sub-sequence, Pi is the property value at site i, m is the target site for SNP prediction. Together with neighborhood effect, we adopted amino acid type of target site (T1), and firstly put forward the number of available codons of target site (T2) as additional features to predict SNPs. T1 encodes types of target residues, with H, R, K, as positively, E, D as negatively charged, Q, T, S, N, C, Y, W as polar and G, F, L, M, A, I, P, V as hydrophobic. T2 encodes the number of available codons of target residues based on the theory of degeneracy. Code degeneracy refers to the fact that more than one codon can be used to match the same amino acid during gene translation, due to the larger number of codons than that of amino acids. Only 21 codes are required for 20 amino acids and the stop codon, but four types of nucleotides A, U, G and C gives 43 =64 possible codons in the triplet code. There are 3 amino acids (serine, leucine, arginine) encoded by six different codons, 5 amino acids (glycine, alanine, proline, threonine, valine) encoded by four codons, 1 amino acid (isoleucine) encoded by three codons, 9 amino acids (cysteine, asparagine, aspartic acid, lysine, glutamine, glutamic acid, histidine, phenylalanine, tyrosine) encoded by two codons, two amino acids (methionine, tryptophane) encoded by one codon, as well as 3 stop condons. In feature T2, the number of available codons (1, 2, 3, 4 and 6) is used to describe the degeneracy of target amino acids. Considering the close relationship between SNPs and evolution, we exploited evolutionary information (E1 and E2) to identify SNP sites. E1 are the values from position-specific scoring matrix (PSSM), obtained by conducting PSI-BLAST search against the nonredundant database of protein sequences compiled by NCBI (Altschul et al., 1997). For each sequence, the original PSSM profile is a matrix of n × 20 elements, where n is the length of protein sequence and 20 is 20 types of amino acids (including the one already in the protein sequence). Each element on position (i, j) in this matrix is the score of the amino acid type j at site i, which represents the evolutionary information, the log-likelihoods between 20 different amino acids at

74

Interdiscip Sci Comput Life Sci (2015) 7: 65–77

each site in a protein sequence. In addition, to incorporate the dependency and correlation of the neighbor residues to the target one, smoothed encoding scheme was adopted by adding evolutionary information of surrounding residues within a sliding window. That is, for a residue at position i on protein sequence, a feature matrix Vi = w × 20 was obtained. When the size of sliding window w equals 11, Vi includes 5 rows upstream and 5 rows downstream to the ith row (in the bigger red rectangle of the original PSSM profile). For the N-terminal and C-terminal of a protein, ZERO vectors, which consist of 20 zero elements, are appended to the head or tail of a PSSM profile (Cheng et al., 2008). Finally, the smoothed PSSM was normalized by the sigmoid function f (x) = 1/(1 + e−x ). E2 is the phylogenetic entropy (or information content), a feature evaluates possibility of mutations at target sites, which is derived from Shannon’s entropy and termed as Ci , Ci = f (i)lnf (i), where f (i) is the SIFT value at position i. Here, SIFT (Sorting Intolerant From Tolerant) value represents the degree of conservation of amino acid residues in sequence alignments (PSI-BLAST) (Ng and Henikoff 2001). Based on the amino acids appearing at each position in the alignment, SIFT gives scores between 0 and 1 to show the weighted average of the observed amino acid frequencies in the alignment and estimated unobserved frequencies (Komar 2007). 3.3

Datasets balancing by fuzzy set method

Our datasets have 243 SNPs and 6184 non-SNPs (details are in dataset part). Hence, the SNPs and nonSNPs datasets are extremely unbalanced between positive and negative examples with a ratio of about 1:26. In SVM training, the unbalance problem usually leads to a learning bias to the majority because algorithms tend to be overwhelmed by the large datasets and ignore the small ones. As a result, the classifier canImbalance Non-SNPs(A) majority class

Step 1: Build up Gaussian membership function α-cut

not give the correct partition between the two classes. Under-sampling is a common approach to reduce the large dataset by choosing some of its data to match the small one, at the risk of losing the full characteristics of majority dataset. There are many methods based on this idea, such as self organizing map, neural network and modified boosting procedure called DataBoost (Ma et al., 2011). Another widely used method is over-sampling, which duplicates small dataset several times to get a large number of data. But the duplicated data may lead to over-fitting (Schierz 2009). Group training is an under-sampling method, in which the large dataset is divided into several small ones, and each divided dataset was used for the model training together with the small dataset. Some other approaches generate sample according to distribution of datasets (Philip K. Chan 2001; Rong Yan 2003). This type of method is more powerful and meaningful than under-sampling and over-sampling, but it is limited by the pre-knowledge of the underlying distribution of datasets. In this work, “fuzzy set method” was carried out with the workflow illustrated in Fig. 7. Instead of assuming the distributions of majority and minority of the dataset, this learning method introduces fuzzy set theory (Li et al., 2010) to build up the possible fuzzy membership functions: for the larger dataset, based on the central limit theory in statistics Gaussian function is applied to reduce the size of the dataset using α-cut; for the smaller dataset, mega-trend diffusion (MTD) function proposed by Li et al. (Li et al., 2007) is introduced to amplify the size of the dataset. After balancing the datasets, a step called attribute extension was used to collect the omitted information during the reduction of the majority dataset and to create new attributes computed by the corresponding fuzzy membership functions of majority and minority dataset. Balance Reducing data size from A to C

Step 3: Merging

Data sets SNPs(B) minority class

Fig. 7

Step 2: Build up MTD membership function Virtual sample generation

Attribute extension

SVM model building

Increasing data size from B to C

The overflow of the fuzzy set method for balancing datasets.

The performance of fuzzy set learning method was compared with that of group training method using the same datasets with the simplest feature combination

only including flanking sequence features, S1 and S2. The prediction using group training method achieves sensitivity, specificity and accuracy at 65.57%, 66.32%

Interdiscip Sci Comput Life Sci (2015) 7: 65–77

and 66.31%, respectively. Whereas using fuzzy set balancing method, they are improved to 82.5%, 67.5%, and 75%. As shown in Fig. 8, it is obvious that fuzzy set balancing method is superior to group training, especially in sensitivity and accuracy.

Independent test performance/%

100

Accuracy =

80

70

60

Sensitivity

Specificity

Accuracy

The comparison of various prediction models using different strategies of balancing datasets.

The over-fitting problem was also taken into consideration in our predictions. This problem is mainly caused by two reasons: small dataset or over-learning leaded by too many features. In this work, the original datasets have 243 SNPs (positive samples) and 6184 non-SNPs (negative samples), and the datasets after fuzzy set balancing have almost 800 positive samples and 800 negative samples. In either situation, the dataset is large enough in general. The second possible reason might be a problem in models using feature E1, which introduces 220 additional dimensions. But the selected best prediction model using hybrid feature set S1+S2+T1+T2+E2 only have 28 dimensions, not inclined to lead to over-fitting problem. 3.4

TP represents the true positive, TN is the true negative, FP is the false positive, and FN is the false negative. Besides the threshold dependent measurements above, ROC (Receiver Operating Characteristic Curve) diagram was also used to provide a clear picture of the performances of top models. In this two-dimensional graph, the true positives rate is plotted on the y-axis and the false positives rate is plotted on the x-axis, so as to describe the relative tradeoffs between benefits (true positive) and costs (false positive). At last, the AUC (area under the ROC curve) is calculated to directly compare the overall performances of the models.

Acknowledgments We thank Michael Brent from Center for Genome Sciences and Systems Biology, Washington University in St. Louis for supporting this work and providing computational cluster. We thank Zeke Maier, Brian Haynes, Holly Brown and other members in Brent lab for helpful discussion and assistance. We thank Mengcheng Pu, Jingfang Wang and Tao Zhang for the help on cytochrome P450 background information.

SVM implementation

SVM was an effective method in non-liner regressions and binary classifications. It put data into a high-dimensional space by kernel functions, and then seeks a hyperplane that differentiates the two classes with maximal margin and minimal error. A public SVM package, LIBSVM (Chang and Lin 2001) (available at http://www.csie.ntu.edu.tw/∼cjlin/) was used to carried out all experiment analyses in this study. Radial basis function (RBF) kernel was selected as the kernel function and weight is modulated to 25. Parameters c and g were optimized by the program grid.py. 3.5

5-fold cross-validation was adopted by dividing the training dataset into five equal subgroups, and then each subgroup was regarded as the validation set in turn, with the other four subgroups as the training set in turn. To evaluate the classification performance, the measures of accuracy, sensitivity and specificity could be calculated as follows: TP + TN TP + FN + TN + FP TP Sensitivity = TP + FN TN Specif icity = TN + FP

Learning method 90

50

Fig. 8

Group training

75

Performance Measurements

Multiple methods were utilized to evaluate the predictive performance of the constructed models. Firstly,

References [1] Adzhubei, I.A., Schmidt, S., Peshkin, L., Ramensky, V.E., Gerasimova, A., Bork, P., Kondrashov, A.S., Sunyaev, S.R. 2010. A method and server for predicting damaging missense mutations. Nat Methods, 7 (4): 248-249. [2] Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, 25 (17): 3389-3402. [3] Bhattacharyya, M., Feuerbach, L., Bhadra, T., Lengauer, T., Bandyopadhyay, S. 2012. MicroRNA transcription start site prediction with multi-objective feature selection. Stat Appl Genet Mol Biol, 11 (1): Article 6.

76 [4] Buske, O.J., Manickaraj, A., Mital, S., Ray, P.N., Brudno, M. 2013. Identification of deleterious synonymous variants in human genomes. Bioinformatics, 29 (15): 1843-1850. [5] Castle, J.C. 2011. SNPs occur in regions with less genomic sequence conservation. PLoS One, 6 (6): e20660. [6] Chang, C., Lin, C. 2001. LIBSVM: a library for support vector machines. LIBSVM software website. Available: http://www.csie.ntu.edu.tw/∼cjlin/ libsvm/. Accessed 2011 May 2. [7] Cheng, C.W., Su, E.C., Hwang, J.K., Sung, T.Y., Hsu, W.L. 2008. Predicting RNA-binding sites of proteins using support vector machines and evolutionary information. BMC Bioinformatics, 9 Suppl 12: S6. [8] Dai, D.P., Xu, R.A., Hu, L.M., Wang, S.H., Geng, P.W., Yang, J.F., Yang, L.P., Qian, J.C., Wang, Z.S., Zhu, G.H., Zhang, X.H., Ge, R.S., Hu, G.X., Cai, J.P. 2014. CYP2C9 polymorphism analysis in Han Chinese populations: building the largest allele frequency database. The pharmacogenomics journal, 14 (1): 8592. [9] Dodgen, T.M., Hochfeld, W.E., Fickl, H., Asfaha, S.M., Durandt, C., Rheeder, P., Drogemoller, B.I., Wright, G.E., Warnich, L., Labuschagne, C., van Schalkwyk, A., Gaedigk, A., Pepper, M.S. 2013. Introduction of the AmpliChip CYP450 Test to a South African cohort: a platform comparative prospective cohort study. BMC Med Genet, 14: 20.

Interdiscip Sci Comput Life Sci (2015) 7: 65–77 [17] Li, L., Wei, D.Q., Wang, J.F., Chou, K.C. 2012b. SCYPPred: a web-based predictor of SNPs for human cytochrome P450. Protein Pept Lett, 19 (1): 57-61. [18] Ma, C., Wang, L., Xie, X.Q. 2011. Ligand Classifier of Adaptively Boosting Ensemble Decision Stumps (LiCABEDS) and its application on modeling ligand functionality for 5HT-subtype GPCR families. J Chem Inf Model, 51 (3): 521-531. [19] McCarthy, J.J., Hilfiker, R. 2000. The use of singlenucleotide polymorphism maps in pharmacogenomics. Nat Biotechnol, 18 (5): 505-508. [20] McGraw, J., Waller, D. 2012. Cytochrome P450 variations in different ethnic populations. Expert Opin Drug Metab Toxicol, 8 (3): 371-382. [21] Ng, P.C., Henikoff, S. 2001. Predicting deleterious amino acid substitutions. Genome Res, 11 (5): 863874. [22] Pairo, E., Maynou, J., Marco, S., Perera, A. 2012. A subspace method for the detection of transcription factor binding sites. Bioinformatics, 28 (10): 1328-1335. [23] Pers, T.H., Timshel, P., Hirschhorn, J.N. 2015. SNPsnap: a Web-based tool for identification and annotation of matched SNPs. Bioinformatics, 31 (3): 418-420. [24] Philip K. Chan , S.J.S. 2001. Toward Scalable Learning with Non-uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining: 164-168.

[10] Hirschhorn, J.N., Daly, M.J. 2005. Genome-wide association studies for common diseases and complex traits. Nat Rev Genet, 6 (2): 95-108.

[25] Ramensky, V., Bork, P., Sunyaev, S. 2002. Human non-synonymous SNPs: server and survey. Nucleic Acids Res, 30 (17): 3894-3900.

[11] Johnson, A.D., Handsaker, R.E., Pulit, S.L., Nizzari, M.M., O’Donnell, C.J., de Bakker, P.I. 2008. SNAP: a web-based tool for identification and annotation of proxy SNPs using HapMap. Bioinformatics, 24 (24): 2938-2939.

[26] Rong Yan, Y.L., Rong Jin 2003. On predicting rare classes with SVM ensembles in scene classification. in: 2003 IEEE International Conference, 3: III21- III24.

[12] Komar, A.A. 2007. Silent SNPs: impact on gene function and phenotype. Pharmacogenomics, 8 (8): 10751080. [13] Kumar, P., Henikoff, S., Ng, P.C. 2009. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat Protoc, 4 (7): 1073-1081. [14] Li, D.C., Liu, C.W., Hu, S.C. 2010. A learning method for the class imbalance problem with medical data sets. Comput Biol Med, 40 (5): 509-518. [15] Li, D.C., Wu, C.S., Tsai, T.I., Lina, Y.S. 2007. Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge. Computers and Operations Research, 34: 966-982. [16] Li, L., Chen, Q., Wei, D.Q. 2012a. Prediction and functional analysis of single nucleotide polymorphisms. Curr Drug Metab, 13 (7): 1012-1023.

[27] Schierz, A.C. 2009. Virtual screening of bioassay data. J Cheminform, 1: 21. [28] Schmeier, S., Jankovic, B., Bajic, V.B. 2011. Simplified method to predict mutual interactions of human transcription factors based on their primary structure. PLoS One, 6 (7): e21887. [29] Shi, S.P., Qiu, J.D., Sun, X.Y., Suo, S.B., Huang, S.Y., Liang, R.P. 2012. PLMLA: prediction of lysine methylation and lysine acetylation by combining multiple features. Mol Biosyst, 8 (5): 1520-1527. [30] Sim, N.L., Kumar, P., Hu, J., Henikoff, S., Schneider, G., Ng, P.C. 2012. SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic Acids Res, 40 (Web Server issue): W452-457. [31] Wang, L., Spira, B., Zhou, Z., Feng, L., Maharjan, R.P., Li, X., Li, F., McKenzie, C., Reeves, P.R., Ferenci, T. 2010. Divergence involving global regulatory gene mutations in an Escherichia coli population evolving under phosphate limitation. Genome Biol Evol, 2: 478-487.

Interdiscip Sci Comput Life Sci (2015) 7: 65–77 [32] Xiong, Y., Liu, J., Wei, D.Q. 2011a. An accurate feature-based method for identifying DNA-binding residues on protein surfaces. Proteins, 79 (2): 509-517. [33] Xiong, Y., Xia, J., Zhang, W., Liu, J. 2011b. Exploiting a Reduced Set of Weighted Average Features to Improve Prediction of DNA-Binding Residues from 3D Structures. PLoS One, 6 (12): e28440.

77 [34] Yan, R., Boutros, P.C., Jurisica, I., Penn, L.Z. 2007. Comparison of machine learning and pattern discovery algorithms for the prediction of human single nucleotide polymorphisms. Grc: 2007 IEEE International Conference on Granular Computing, Proceedings: 452-457.

Improved feature-based prediction of SNPs in human cytochrome P450 enzymes.

Single nucleotide polymorphisms (SNPs) make up the most common form of mutations in human cytochrome P450 enzymes family, and have the potential to br...
643KB Sizes 3 Downloads 8 Views