Interdiscip Sci Comput Life Sci (2013) 5: 241–246 DOI: 10.1007/s12539-013-0151-3

Identifying Novel Oncogenes: A Machine Learning Approach Ambuj KUMAR1† , Vidya RAJENDRAN1† ,

1

Rao SETHUMADHAVAN1 , Rituraj PUROHIT1,2∗

(Bioinformatics Division, School of Bio Sciences and Technology, Vellore Institute of Technology University, Vellore 632014, Tamil Nadu, India) 2 (Human Genetics Foundation, Torino, via Nizza 52, I-10126 Torino, Italy)

Received 4 November 2012 / Revised 17 December 2012 / Accepted 7 January 2013

Abstract: Genome sequencing has overflowed the databases with huge amount of SNP data. Although the amount of detected single nucleotide polymorphisms (SNPs) is rising exponentially every day, we still lag behind in characterization techniques. Implementing computational platforms to determine the pathogenecity associated with the SNPs can provide a probable solution to this problem. To improve the prediction quality for SNP characterization methods, we implemented machine learning support vector classification method. Total 557 non-synonymous amino acid variants were collected from CENP family proteins, excluding CENPE. Multivariate simulation of associated changes in biological phenomena’s for each SNPs was computed through available SNP analysis platforms. Support vector model was designed using training dataset and the raw classification data was subjected to the classification hyperplane. We observed multiple evidences of cancer associated genetic mutations in CENPI, CENPJ, CENPK, CENPL and CENPX protein. The former four proteins have showed positive hits in cosmic database for mutations in tumour samples, but CENPX has never been reported before for the cancer associated outcomes. Since CENPX has been recently classified and not much functional and pathological insight has been, the results obtained in this study will serve as a starting point for future investigation on cancer research in association to CENPX protein. Key words: cancer, centromere, machine learning, single nucleotide polymorphism, CENPX.

1 Introduction Aneuploidy and chromosomal instability (CIN) are now identified as hallmarks of most solid tumours (Kumar and Purohit, 2012a). Abnormal chromosomal segregation induced due to centromere instability, errors in mitotic checkpoint control and defective localization of checkpoint proteins at kinetochore is the leading cause of CIN. Centromere instability is characterized by dynamic formation of centromere breaks, deletions, isochromosomes and translocations, which are commonly observed in cancer (Kumar and Purohit, 2012b). Mutations in centromere protein coding genes have been observed in promoting aneuploidy and tumorigenesis (Kops et al., 2005; Baker et al., 2005). Significant progress is made in recent years to identify centromere proteins and to elucidate their roles in cancer. Deregulation in the activity of centromere protein such as CENPH and CENPF has been previously studied for their association with cancers (Guo et al., 2008; Cao et al., 2010). Our previous works have also contributed †

The authors contributed to the paper equally. Corresponding author. E-mail: [email protected]

towards detecting the cancer associated mutations in centromere protein CENP-E through computational investigations. Other studies have also suggested the role of centromere proteins in inducing cancer associated phenotype. Through course of time, more and more number of cancer associated genetic mutations in centromere associated proteins will further strengthen the understanding of its association with cancer. Proteins involved in spindle checkpoint maintenance, spindle-kinetochore assembly, chromosomal movements and other cell cycle progression pathways have been progressively reported in various cancer cases (Kumar and Purohit, 2012b). They form the most central component of cell cycle progression and genome stability. Further they serve as an epigenetic mark that propagates centromere identity through replication and cell division organizes arrays of centromere satellite DNA into a higher order structure and couples chromosome position to microtubule depolymerizing activity (Sekulic et al., 2010; Hu et al., 2011). After the completion of human genome project, several novel centromere associated protein coding genes have been reported, many of which are yet to be classified in terms of their pathological outcomes. Observing the significance of centromere associated proteins in regulating

242

the cell cycle progression and largely growing evidences of their involvement in multiple cancer cases, it becomes an evident approach to classifying each of them for their corresponding pathological outcomes. CENP family proteins contribute to the major section of centromere proteins. Proteins such as CENP-E, CENP-H, CENP-F and CENP-J have significant positive hits in cosmic database for cancer associated mutations (Bamford et al., 2004). As we know classifying these genes for all possible disease associated outcomes will require huge amount of time, effort and cost input. Thus on the basis of previously obtained clinical and wetlab data it becomes a feasible approach to detecting their disease associated potentials using statistical and machine learning approaches. Advance in high-throughput genotyping and next generation sequencing has generated a vast amount of human genetic variation data. Single nucleotide polymorphism (SNP) within protein coding regions is of particular importance owing to their potential to give rise to amino acid substitutions that affect protein structure and function which may ultimately lead to a disease state (Kumar and Purohit, 2012a, 2012b and 2012c; Kumar et al., 2012 and 2013). SNP refers to the variations in the nucleotide at the reference site from one nucleotide base to other. Non-synonymous SNPs occurring in coding regions result in single amino acid polymorphisms (SAPs) that may affect protein function and lead to pathogenic phenotypes. It has potential to alter the function of their corresponding protein, either directly or via disruption of structure (Purohit et al., 2008; Purohit and Sethumadhavan, 2009; Purohit et al., 2011a and 2012b; Rajendran et al., 2011; Kamraj and Purohit, 2013a and 2013b; Gulati et al., 2012; Pandey et al., 2012; Rajendran and Sethumadhavan, 2014). Hence they are of particular interest as candidates for further experimental assessment. About 41,744,328 rs human SNPs have been validated and submitted in NCBI dbSNP database (Sherry et al., 2001). Most of them are still uncharacterized in terms of their disease causing potential. The future of SNP analysis greatly lies in the development of personalized medicines that can facilitate the treatment of genomic variations induced disorders. Currently, most molecular studies are focusing on SNPs located in coding and regulatory regions, yet many of these studies have been unable to detect significant associations between SNPs and disease susceptibility. To develop a coherent approach for prioritizing SNP selection for genotyping in molecular studies, we applied a machine learning approach to SNP screening. Our hypothesis was that, predicting the pathogenic SNPs by compiling the multiple data patterns obtained from various SNP characterization platforms can provide a significant level of prediction accuracy. Thus, implementing machine learning platform can be a significant approach to pre-

Interdiscip Sci Comput Life Sci (2013) 5: 241–246

dict the pathological outcomes using a range of best possible data patterns. The inference produced by this study reports effective method to classify cancer associated amino acid variants in protein. Further it directly shows how the novel oncogenes can be detected using the proposed model in this work.

2 Material and methods 2.1

SIFT evolutionary conservation score

SIFT is a sequence homology-based program that evaluates the evolutionary conservation scores to predict the effect of amino acid substitutions in the gene coding region (Kumar et al., 2009). It identifies the highly conserved amino acids and calculates the tolerance index of a particular substitution. The prediction carried out by the SIFT program is based on the degree of conservation of amino acid residues in sequence alignments derived from closely related sequences, collected through PSI-BLAST (Altschul et al., 1997). Furthermore it scans individual positions of the sequence and calculates the conservation probability of a particular residue for all possible substitutions which is recorded in a scaled probability matrix. Generally, a highly conserved position is intolerant to most substitutions with a SIFT score 0.05, and the poorly conserved position can tolerate most substitutions showing a SIFT score ≥ 0.05. The prediction accuracy of the SIFT program is 88.3-90.6% (Dai and Cogswell, 2003), when tested with different datasets of human variants. A total of 36 naturally occurring nonsynonymous exonic polymorphisms and the substitution showing a SIFT score ≥ 0.05 were further analyzed using structure based evaluation programs. 2.2

Prediction of damaging nsSNPs using Polyphen server

The Polyphen program (Ramensky et al., 2002) was used to analyze the damaging probability of the nsSNPs. The annotations are checked from the SWALL database. It uses Coils2 program (Lupas et al., 1991) to predict coiled coil regions and the SignalP program to predict signal peptide regions of the protein sequences (Ramensky et al., 2002). PolyPhen computes the absolute value of the difference between substitution scores of both allele variants in the polymorphic position. Polyphen uses Blast Algorithm to derive the conservation score of the contig nucleotide and their possible functional roles (Calabrese et al., 2009). The predictions are based on four different structure and sequence annotation based parameters including sequence annotation, sequence prediction, multiple alignment and structure. The structural analysis was carried out by buried site prediction and covalent and non-covalent bond formation. The Polyphen program determines the functional impact by analyzing four key points which

Interdiscip Sci Comput Life Sci (2013) 5: 241–246

includes signal peptide, trans-membrane, ligand binding and protein interaction. 2.3

Support vector machine based PhD-SNP tool to detect disease-associated SNPs

PhD-SNP is a support vector machine based classifier that uses a supervised learning approach to classify the disease causing point mutations that form the given datasets (Capriotti et al., 2006). For a given mutation the substitution forms the wild residue to the mutant which is encoded in a 20 element vector that has a -1 in the position relative to the wild-type residue, 1 in the position relative to the mutant residues and 0 in the remaining 18 positions. A second 20 element vector encoding for the sequence environment is built reporting the occurrence of the residues in a window of 19 residues around the mutated residue (Capriotti et al., 2006). A total of 8 nsSNPs which were reported to be commonly deleterious by the SIFT and Polyphen programs were used for the analysis. 2 nsSNPs were reported to be associated with disease causing phenomena using the PhD-SNP program. 2.4

Predicting disease associated mutation using Pmut

Pmut is a software aiming at the annotation and prediction of pathological mutations. Pmut is based on the use of different kinds of sequence information to label mutations, and neural networks to process this information (Ferrer-Costa et al., 2004). It provides a very simple output: a yes/no answer and a reliability index. The cross-validated performance of our method is 84% overall success rate, and 67% improvement over random. 2.5

Predicting cancer associated mutations

Dr Cancer is a disease-specific machine learning approach to predict if a non-synonymous SNP is related to cancer. The implemented Support Vector Machine (SVM) method has been trained on a set of 3174 disease-related mutations from 183 cancer-related proteins and an equal number of neutral polymorphisms. The cancer-specific method, tested with a 20-fold crossvalidation procedure, results in 75% overall accuracy, a correlation coefficient of 0.50, and an area under ROC curve of 0.82. 2.6

Machine learning algorithm

Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases. A learner can take advantage of examples (data) to capture characteristics of interest of their unknown underlying probability distribution. Data can be seen as examples that illustrate relations

243

between observed variables. Here support vector machine learning classification method was implemented to design the classification system and further to compile the SNP datasets to examine their cancer association probability. A support vector machine (SVM) is a concept in statistics and computer science for a set of related supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis. The standard SVM takes a set of input data and predicts, for each given input, of which two possible classes form the input, making the SVM a non-probabilistic binary linear classifier. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on. More formally, a support vector machine constructs a hyperplane or set of hyperplanes in a high- or infinitedimensional space, which can be used for classification, regression, or other tasks. The hyperplanes in the higher-dimensional space are defined as the set of points whose inner product with a vector in that space is constant. The vectors defining the hyperplanes can be chosen to be linear combinations with parameters αi of images of feature vectors that occur in the data base. Here consider D as a training data for which, D = {(xi , yi )|xi ∈ Rp , yi ∈ {1, 0}} (i = 1 to n), where yi is either 1 or 0, indicating the class to which the point xi belongs. Each xi is a p dimensional real vector. On the basis of these values, the data hyperplane is constructed, which is further used for the classification. Application of SVM classification system was carried out according to the methodology provided in LIBSVM package (Chang and Lin, 2011). The selected SVM kernel is a Radial Basis Function (RBF) kernel K(xi , xj ) = exp(−γ||xi − xj ||2 ) and γ and C parameters are optimized performing a grid like search.

3 Result and discussion Predicting cancer associated SNPs using computational platform requires vast coverage of wide range of biological phenomena, especially those accounting for the pathological outcomes. In our previous studies we have implemented molecular dynamics simulation to examine the effect of phenotypic changes induced by the computationally predicted deleterious and damaging SNPs (Kumar and Purohit, 2012a, 2012b

244

Interdiscip Sci Comput Life Sci (2013) 5: 241–246

and 2012c, Kumar et al., 2012). Although our previous works have demonstrated a good range of accuracy level, we still require further improvements. It is a viable approach to determine the effect of computationally detected disease-associated nsSNPs on protein phenotype through molecular dynamics simulation approaches, but when the structural coordinates are absent, elucidating the conformational changes becomes a challenging approach. Furthermore, it becomes a very time consuming task to examine the phenotypic effects of large amount of computationally detected disease-associated nsSNPs on their corresponding protein structure. Thus to keep prediction accuracy high, we implemented wide range of computational approaches and finally we optimized the predictions by implementing machine learning algorithm. At first we collected 557 non-synonymous SNPs corresponding to the CENP family protein from dbSNP database. CENP-E was excluded since our previous work was able to filter one point mutation, showing highly damaging effect on the protein phenotypes, and thus was reported as highly damaging cancer associated mutation. Further, in this work SIFT and Polyphen2 tools were used to examine the damaging and deleterious property of the SNP datasets. Moreover, we implemented PhD-SNP and Pmut servers to examine the disease associated probabilities. Together these tools cover the calculation of wide range of biological phenomena, thus foster the overall prediction accuracy level. Since these tools do not provide the calculations corresponding to the cancer associated outcomes, we further implemented Dr Cancer tool that helped in filtering out the non-cancer mutation from those which showed positive hits for cancer. Dr Cancer tool is based on the machine learning algorithm and the prediction model has been trained on several cancer associated and the neutral mutations. Thus the prediction from Dr Cancer provided much reliability. Together applying SIFT, Polyphen2, PhD-SNP, Pmut and Dr Cancer tools, we obtained a set of nsSNP dataset that was equally predicted to be deleterious, damaging, disease associated as well as cancer associated. All the above implemented computational platforms

Table 1 Gene

Rs ID

predicted on the basis of their calculation which was in turn dependent on the changes observed in diverse range of biological phenomena. Hence applying the machine learning algorithm using datasets obtained from all these tools became a significant approach that would certainly provide the final predictions covering scores from all. Initially we collected the scores from SIFT, Polyphen2, PhD-SNP, Pmut and Dr Cancer tools for 500 cancer associated mutations obtained from Cosmic database and equal amount of neutral mutations from Uniprot and SwissVar databases. Prediction obtained from PhD-SNP was considered as +1 for diseased inference and 0 for neutral. All together 5 variables were used for the binary classification of cancer outcomes for the input SNP data. Support vector classification model was the best approach for the classification purpose, since predicting cancer related SNP required a binary output of true and false. Our results obtained from the developed classification system pointed towards the cancer causing potential of 5 CENP family genes that had been rarely reported to cause cancer (Table 1). We observed A49M mutation in CENPX, in particular, showing high cancer associated probability. While observing the frequency of predicted cancer associated mutations, CENPJ was observed to be the most significant. CENPJ had been previously reported in several cases of microcephaly and few positive hits for cancer mutations were also available in Cosmic database reporting its presence in adenocarcinoma of large intestine tissues (P1174S, M1334I, T1246M, F948F, E602E, R378Q, K368K, L342F, K92N, R45Q), malignant melanoma (P1103L, R1039W), serous carcinoma in ovary (L575S, E212D, N123Y) and prostate carcinoma (T208T) tumour samples. This showed that the CENPJ mutations had positive association to the cancer cases. Moreover, E1235V mutation in CENPJ had shown significant loss in STIL protein binding activity and disrupted the centriole biogenesis process, which is in turn a most dynamic cause of aneuploidy and cancer progression. In this work we reported three mutations, T1218M, T1269I and T1307I, which showed very high probability to cause cancers in human. All these mutations were found to be located in its Cterminal region which was essential for its binding to

Cancer associated nsSNPs in CENP protein family Mutation

SIFT

Polyphen

PhD-SNP

Pmut

SEQPROOF

CENPI

rs184049547

S442R

0.00

0.996

Disease

0.8668

0.544

CENPJ

rs149855336

T1218M

0.00

1.912

Disease

0.5049

0.662

rs113104867

T1269I

0.01

1.912

Disease

0.5793

0.651

rs144251950

T1307I

0.02

1.912

Disease

0.6238

0.664 0.846

CENPK

rs147863579

R248H

0.01

0.589

Disease

0.8449

CENPL

rs147875545

A263D

0.05

0.817

Disease

0.9104

0.747

rs191005165

S289L

0.01

0.951

Disease

0.6455

0.564

rs11555480

A49M

0.00

0.989

Disease

0.8326

0.682

CENPX

Interdiscip Sci Comput Life Sci (2013) 5: 241–246

CEP152 and STIL protein, thus regulating the centriole biogenesis process. Furthermore, CENPI, CENPK and CENPL showed positive hits, when search was performed in the Cosmic database, for their corresponding mutations in tumour samples. Mutations in CENPI had been previously reported in the tumour samples of breast carcinoma (S710F, T546P, L550V, T546P, L550V), adenocarcinoma of large intestine (S143Y), lymphoid neoplasm (K625N), malignant melanoma (L474F) and squamous cell carcinoma (K210N). A few positive hits of mutations were also observed for CENPK and CENPL in the tumour samples of large intestine adenocarcinoma tissue. No positive hits for cancer associated mutations were obtained for CENPX protein in the Cosmic database. In this work we have presented A49M mutation as the most likely cause to induce cancer associated phenotype for the first time. SNPs in other CENP family genes were filtered out and did not showed any positive correlation to cancer associated outcomes.

4 Conclusion CENPX protein forms the DNA-binding component of the FA core complex involved in DNA damage repair and genome maintenance. It is one of the essential components of the heterotetrameric CENP-T-W-SX complex that binds and supercoils DNA, and plays an important role in kinetochore assembly. It is involved in regulating a wide range of cellular activity, such as DNA repair, cell cycle progression, and positive regulation of protein ubiquitination, replication fork processing and resolution of meiotic recombination intermediates. It is also the component of Fanconi anaemia nuclear complex and FANCM-MHF complex. All these functional activities directly signify that deregulation in the structural component or damages in its expression mechanism will significantly affect the cell cycle progression mechanism. Although not much insight has been gained regarding the activities and its pathological outcomes, the results obtained in this study have strongly suggested that it can act as a vital component in cancer progression pathway. The results obtained in this study will serve as a starting point for future investigation on cancer research in association to CENPX protein. Future research implicating their roles in cancer progression is eventually required to obtain clear insight into these predictions. Our work focuses on the prediction accuracy of pathological outcomes for the predicted SNPs, whereas we did not compile the large pool of SNPs that are filtered out as non-significant, on the basis of support vector classification. Moreover, we did not present the accuracy of negative likelihood outcomes associated with the remaining SNPs. It is further required to develop computational methodologies that can easily focus on

245

the positive and negative prediction likelihoods.

Acknowledgements We gratefully acknowledge the management of Vellore Institute of Technology University for providing the facilities to carry out this work.

References [1] Altschul, S.F., Madden, T.L., Sch¨ affer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucl Acid Res 25, 33893402. [2] Baker, D.J., Chen, J., van Deursen, J.M. 2005. The mitotic checkpoint in cancer and aging: What have mice taught us? Curr Opin Cell Biol 17, 583-589. [3] Bamford, S., Dawson, E., Forbes, S., Clements, J., Pettett, R., Dogan, A., Flanagan, A., Teague, J., Futreal, P.A., Stratton, M.R., Wooster, R. 2004. The COSMIC (Catalogue of Somatic Mutations in Cancer) database and website. Br J Cancer 91, 355-358. [4] Calabrese, R., Capriotti, E., Fariselli, P., Martelli, P.L., Casadio, R. 2009. Functional annotations improve the predictive score of human disease-related mutations in proteins. Hum Mutat 30, 1237-1244. [5] Cao, J.Y. 2010. Prognostic significance and therapeutic implications of centromere protein F expression in human nasopharyngeal carcinoma. Mol Cancer 9, 237. [6] Capriotti, E., Altman, R.B. 2011. A new diseasespecific machine learning approach for the prediction of cancer-causing missense variants. Genomics 98, 310317. [7] Capriotti, E., Calabrese, R., Casadio, R. 2006. Predicting the insurgence of human genetic diseases associated to single point protein mutations with support vector machines and evolutionary information. Bioinformatics 22, 2729-2734. [8] Chang, C.-C., Lin, C.-J. 2011. LIBSVM: A library for support vector machines. ACM TIST 2, 27. [9] Dai, W., Cogswell, J.P. 2003. Polo-like kinases and the microtubule organization center: Targets for cancer therapies. Prog Cell Cycle Res 5, 327-334. [10] Ferrer-Costa, C., Gelp´ı, J.L., Zamakola, L., Parraga, I., de la Cruz, X., Orozco, M. 2005. PMUT: A webbased tool for the annotation of pathological mutations on proteins. Bioinformatics 21, 3176-3178. [11] Guo, X.Z., Zhang, G., Wang, J.Y., Liu, W.L., Wang, F., Dong, J.Q., Xu, L.H., Cao, J.Y., Song, L.B., Zeng, M.S. 2008. Prognostic relevance of centromere protein H expression in esophageal carcinoma. BMC Cancer 8, 233. [12] Hu, H., Liu, Y., Wang, M., Fang, J., Huang, H., Yang, N., Li, Y., Wang, J., Yao, X., Shi, Y., Li, G., Xu, R.M. 2011. Structure of a CENP-A-histone H4 heterodimer in complex with chaperone HJURP. Genes Dev 25, 901-906.

246 [13] Kamaraj, B., Purohit, R. 2013a. Mutational analysis of TYR gene and its structural consequences in OCA1A. Gene 513, 184-195. [14] Kamaraj, B., Purohit, R. 2013b. In-silico analysis of betaine aldehyde dehydrogenase2 of oryza sativa and significant mutations responsible for fragrance. J Plant Interact 8, 321-333. [15] Kops, G.J., Weaver, B.A., Cleveland, D.W. 2005. On the road to cancer: aneuploidy and the mitotic checkpoint. Nat Rev Cancer 5, 773-785. [16] Kumar, P., Henikoff, S., Ng, P.C. 2009. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat Protoc 4, 10731081. [17] Kumar, A., Purohit, R. 2012a. Computational investigation of pathogenic nsSNPs in CEP63 protein. Gene 503, 75-82. [18] Kumar, A., Purohit, R. 2012b. Computational screening and molecular dynamics simulation of disease associated nsSNPs in CENP-E. Mutat Res 738-739, 28-37. [19] Kumar, A., Purohit, R. 2012c. Computational centrosomics: An approach to understand the dynamic behaviour of centrosome. Gene 511, 125-126. [20] Kumar, A., Rajendran, V., Sethumadhavan, R., Purohit, R. 2012. In silico prediction of a disease-associated STIL mutant and its affect on the recruitment of centromere protein J (CENPJ). FEBS Open Bio 2, 285293. [21] Kumar, A., Rajendran, V., Sethumadhavan, R., Purohit, R. 2013. Insight into Nek2A activity regulation and its pharmacological prospects. Egyp J Med Hum Genet 14, 213-219. [22] Lupas, A., Van Dyke, M., Stock, J. 1991. Predicting coiled coils from protein sequences. Science 252, 11621164. [23] Pandey, A., Kumar, A., Purohit, R. 2013. Sequencing Closterium moniliferum: Future prospects in nuclear waste disposal. Egyp J Med Hum Genet 14, 113-115.

Interdiscip Sci Comput Life Sci (2013) 5: 241–246 [24] Purohit, R., Rajasekaran, R., Sudandiradoss, C., George Priya Doss, C., Ramanathan, K., Sethumadhavan, R. 2008. Studies on flexibility and binding affinity of Asp25 of HIV-1 protease mutants. Int J Biol Macromol 42, 386-391. [25] Purohit, R., Sethumadhavan, R. 2009. Structural basis for the resilience of Darunavir (TMC114) resistance major flap mutations of HIV-1 protease. Interdiscip Sci Comput Life Sci 1, 320-328. [26] Purohit, R., Rajendran, V., Sethumadhavan, R. 2011a. Relationship between mutation of serine residue at 315th position in M. tuberculosis catalase-peroxidase enzyme and isoniazid susceptibility: An in silico analysis. J Mol Model 17, 869-877. [27] Purohit, R., Rajendran, V., Sethumadhavan, R. 2011b. Studies on adaptability of binding residues and flap region of TMC-114 resistance HIV-1 protease mutants. J Biomol Struct Dyn 29, 137-152. [28] Rajendran, V., Purohit, R., Sethumadhavan, R. 2012. In silico investigation of molecular mechanism of laminopathy cause by a point mutation (R482W) in lamin A/C protein. Amino Acids 43, 603-615. [29] Rajendran, V., Sethumadhavan, R. 2014. Drug resistance mechanism of PncA in mycobacterium tuberculosis. J Biomol Struct Dyn 32, 209-221. [30] Ramensky, V., Bork, P., Sunyaev, S. 2002. Human non-synonymous SNPs: Server and survey. Nucl Acid Res 30, 3894-3900. [31] Sekulic, N., Bassett, E.A., Rogers, D.J., Black, B.E. 2010. The structure of (CENP-A-H4) (2) reveals physical features that mark centromeres. Nature 467, 347351. [32] Sherry, S.T., Ward, M.H., Kholodov, M., Baker, J., Phan, L., Smigielski, E.M., Sirotkin, K. 2001. dbSNP: The NCBI database of genetic variation. Nucl Acid Res 29, 308-311.

Identifying novel oncogenes: a machine learning approach.

Genome sequencing has overflowed the databases with huge amount of SNP data. Although the amount of detected single nucleotide polymorphisms (SNPs) is...
130KB Sizes 0 Downloads 0 Views