Algorithmic approach for removing the redundancy in diabetic gene categories based on semantic similarity and gene expression data.

Interdiscip Sci Comput Life Sci (2015) 7: 1–6 DOI: 10.1007/s12539-014-0248-3

Algorithmic Approach for Removing the Redundancy in Diabetic Gene Categories Based on Semantic Similarity and Gene Expression Data 1

2

Atul Kumar1∗ ,

D. Jeya Sundara Sharmila2

(Department of Bioinformatics, Karunya University, Coimbatore, Tamil Nadu, India) (Department of Nanosciences and Technology, Tamil Nadu Agriculture University, Coimbatore, Tamil Nadu, India)

Received 14 October 2014 / Revised 27 November 2014 / Accepted 21 January 2015

Abstract: Even after so much advancement in gene expression microarray technology, the main hindrance in analysing microarray data is its limited number of samples as compared to a number of factors, which is a major impediment in revealing actual gene functionality and valuable information from the data. Analysing gene expression data can indicate the factors which are differentially expressed in the diseased tissue. As Most of these genes have no part to play in causing the disease of interest, thus, identification of disease-causing genes can revel not just the case of the disease, but also its pathogenic mechanism. There are a lot of gene selection methods available which have the capacity to remove irrelevant genes, but most of them are not sufficient enough in removing redundancy in genes from microarray data which increases the computational cost and decreases the classification accuracy. Combining the gene expression data with the Gene ontology information can be helpful in determining the redundancy which can then be removed using the algorithm mentioned in the work. The gene list obtained after this sequential steps of the algorithm can be analysed further to obtain the most deterministic genes responsible for Type II diabetes. Key words: Microarray Technology, Gene Expression, Diabetes, Greedy Algorithm, Gene Ontology, Sematic Similarity, Pearson Correlation, GEO Database.

1 Introduction Rapid advancement in gene expression microarray technology has enabled simultaneous measurement of the expression levels for tens of thousands of genes in a single experiment (Schena et al., 1995). Analyzing gene expression data can show the factors which are differentially expressed in the diseased tissue (Zhang, 2006). The main hindrance in analyzing microarray data is its limited number of samples as compared to number of genes. Most of these genes have no role to play in causing the disease of interest, thus, identification of disease-causing genes can reveal not just the cause of the disease, but also its pathogenic mechanism (Mohammadi et al., 2011). Available gene selection methods have the capacity to remove irrelevant genes, but most of them are inadequate in removing redundancy in genes from microarray data which increases the computational cost and decreases the classification accuracy. Due to the presence of noise and low number of samples in microarray data the actual gene functional∗

Corresponding author. E-mail: [email protected]

ity and valuable information cannot be easily revealed from the data. Gene expression data in combination with Gene ontology information can be helpful in determining the redundancy which can then be removed using the algorithm mentioned in the work. Pearson correlation and semantic similarity measure have been combined to find the similarity between any of the two genes (Mohammadi et al., 2011). Due to the low sample number Pearson correlation alone cannot be considered for finding the similarity and due to incomplete information in Gene ontology the semantic similarity measure is insufficient to determine the similarity between the genes, so the average of the scores of both Pearson correlation and semantic similarity measure is used to determine the similarity between the two genes. R(gi , gj ) =

Rexp(gi ,gj ) + Rsem(gi ,gj ) 2

Rsem (gi , gj ) and Rexp (gi , gj ) represent the semantic similarity and the expression similarity of genes gi and gj respectively. Semantic similarity measures can be used to calculate the similarity of two concepts organized in ontology.

2

Interdiscip Sci Comput Life Sci (2015) 7: 1–6

The ontology structure defines the function parents (c) that, given a concept c, returns the set of more generic concepts directly linked to c (Couto et al., 2007). Based on this, Resnik, Jiang and Conrath and Lin proposed 3 different ways for calculating the semantic similarity. Pearson correlation coefficient is used to find the expression similarity of two genes: giav denotes the average value of gene gi expressions and gik represent the value of k th sample in gene gi . In the current work based on the previously mentioned concept algorithm has been designed which help in reducing the redundancy from the microarray data sample.

2 Materials and Method 71 samples from different tissues of Homo sapiens (Diabetic and Normal) were collected from GEO database (Edgar et al., 2002) and Diabetes Genome Anatomy Project (DGAP). Out of these, 37 samples are of normal human beings and 34 are of diabetic

Table 1

humans (Table 1) (Kumar et al., 2014). Using the gene ontology information of all the genes in the given data sets semantic similarity was calculated for all the combination of the GO terms present in a particular dataset based on three methods given by Lins, Resnik and Jiang and a combination of Resnik’s and Lin’s similarity measures (simRel) (Schlicker and Albrecht, 2007). Expression similarity for all the combination of the GO terms present in a particular dataset was calculated through Pearson correlation coefficient. Semantic similarity values (simRel) and Pearson values were averaged and based on the average value a greedy algorithm was followed to obtain the genes which have a value less than the threshold value of 0.8. The threshold of 0.8 was taken after several experimental trials which showed that taking a threshold value greater than 0.8 resulted in a number of similar genes in the output file, whereas taking a value less than 0.8 was resulting in the loss of many of the important genes.

Data set samples taken for studies (Kumar et al., 2014) No of Samples

Accession

Data Normal

Diabetic

No of Genes

Country

GSE7146

Effect of insulin infusion on human skeletal muscle (Parikh et al. 2007)

6

6

22215

Sweden

DGAP

Human pancreatic islets from normal and Type 2 diabetic subjects (A) (Gunton et al. 2005)

7

5

22191

Caucasian

DGAP

Human pancreatic islets from normal and Type 2 diabetic subjects (B) (Gunton et al. 2005)

7

5

22550

DGAP

Human skeletal muscle-type 2 diabetes (Mootha et al. 2003)

17

18

22177

3 Results and Discussion 3.1

Removing the Gene duplicity by elimination the of Genes having same GO id

Data sets were subjected to SOURCE server (http:// smd.princeton.edu/cgi-bin/source/sourceBatchSearch) to obtain the gene ontology information for all the genes present in the data sets. The server returned an output file with genes and their corresponding GO ids and categorized the genes based on three hierarchies that define functional attributes of gene products: Molecular Function (MF), Biological Process (BP), and Cellular Component. In the data set “Effect of insulin infusion on human skeletal muscle” out of 22172 genes, 16167 (73%) genes were reported to be involved in molecular function, 857 (4%) genes in biological and 633 (3%) genes in cellular component. Among these 22172 genes there were 4515 (20%) genes for which there was no information present in the gene ontology database (Fig. 1). Except for “Human pancreatic islets from normal and type 2 diabetic subjects (B)”

and Asian

Sweden

all the dataset taken for studies have shown almost similar distribution of genes in different hierarchy this is because the gene chip array chosen for its study was HG-U133 B (Gunton et al. 2005) unlike others where it was HG-U133 A (Table 2). The use of the different gene chip array has caused a major change in the distribution of genes in different hierarchy for gene set “Human pancreatic islets from normal and type 2 diabetic subjects (B)” (Fig. 2). Out of 22550 genes, 8664 (38%) genes were reported to be involved in molecular function, 857 (4%) genes in biological and 633 (5%) genes in cellular component, whereas there were 12036 (53%) genes for which there was no information present in the gene ontology database. The result file obtained through SOURCE server showed high redundancy in the GO ids. Thus Ablebits-a commercial software (free trial version) (http://www.ablebits.com/) plugin was used to generate a status column mentioning duplicate against the GO ids that were repeated. Based on the Fischer score (Kumar et al., 2014), except the top scored

Interdiscip Sci Comput Life Sci (2015) 7: 1–6 Table 2 Accession GSE7146

3

Distribution of genes in different hierarchy for each data set under study Data

Molecular

Biological

Cellular

No Gene

Function

Function

Component

Information

Effect of insulin infusion on human skeletal muscle

16167

857

633

4515

DGAP

Human pancreatic islets from normal and Type 2 diabetic subjects (A)

16176

860

633

4522

DGAP

Human pancreatic islets from normal and Type 2 diabetic subjects (B)

8664

834

1016

12036

DGAP

Human skeletal muscle - type 2 diabetes

16165

859

631

4522

No Gene information 20% Cellular component 3% Biological function 4%

If status column is Duplicate eliminate the line End

Molecular function 73%

The results obtained after executing the algorithm showed a drastic decrease in gene number by removing the redundant genes from the data sets. Table 3 summarizes the result of the first step of redundancy reduction. 3.2

Fig. 1

Distribution of genes in different hierarchy for all data sets except “Human pancreatic islets from normal and Type 2 diabetic subjects (B)”

No Gene information 53%

Molecular function 38%

Biological Cellular function component 4% 5%

Fig. 2

Distribution of genes in different hierarchy for “Human pancreatic islets from normal and Type 2 diabetic subjects (B)”

gene among the duplicate gene set all the genes were removed using the algorithm given below. Input: GO file with duplicate status column Output: Non redundant GO terms file Initialize: Read file push each line into an array count till end of the file % repeat until i > count split each line and put in another array

Semantic similarity between genes in data sets

Out of all the gene sets obtained from the above results, the genes which were categorized under molecular functions were taken to identify the semantic similarity among them using funsimmat (Schlicker and Albrecht, 2008) (http://funsimmat.bioinf.mpiinf.mpg.de/). Since the molecular function represent the ability or job performed by a gene product, whereas biological function and cellular function represent recognized series of events or molecular functions and locations, at the levels of subcellular structures and macromolecular complexes respectively so biological function gene and cellular component gene were not considered for finding semantic similarity. The semantic similarity was determined using Resnik, Jiang and Conrath, Lin and a combination of Resnik Lin method (Couto et al., 2007). All the possible combination of genes in different data sets was generated and semantic value for each combination was generated using the above mentioned methods. The different combination generated in each dataset is shown in Table 4. Out of these the semantic values of Resnik Lin (SimRel) were used for calculation as this method takes into account the relevance information and provides a high relevance in generic terms for the comparison of the exact function of different gene products (Schlicker et al., 2006). 3.3

Pearson correlation coefficient for expression similarity between genes in dataset

To find the expression similarity between two genes Pearson correlation coefficient was used. It was required to make the same set of combination of genes as it was given by the server for semantic similarity so

4

Interdiscip Sci Comput Life Sci (2015) 7: 1–6 Table 3

Number of genes in different hierarchy after first step of redundancy reduction

Accession GSE7146

Data

Molecular

Biological

Cellular

Function

Process

Component


1297

257

38

DGAP


1297

258

38

DGAP


878

216

45

DGAP

Human skeletal muscle-type 2 diabetes

1296

256

38

Table 4 Accession GSE7146

Number of different combinations of gene in data sets Total Number of

Data

Combinations


840456

DGAP


840456

DGAP


385881

DGAP


840456

as the average value for both semantic and expression similarity can be calculated. To obtain the same set of combination of genes as of semantic the following algorithm was executed, which generated the genes with the same combination as of semantic similarity Input 1: Semantic similarity file Input 2: Non redundant GO terms file Output: GO terms with same set of combination Initialize: read input1 and input 2 push each line into an array count till end of the file % repeat until i > count split each line and put in another array if GO ids in columns of both the array is same print the line End 3.4

The Greedy algorithm

The expression and the semantic value (Resnik Lin Value) for the all the different combination of gene set was averaged and the score obtained was used for the removing the genes for whom average value was more than 0.8 (Mohammadi et al., 2011). Thus, using the threshold value of 0.8 only those genes were obtained which were highly dissimilar and mainly contribute to causing a disease. A greedy algorithm approach was used to obtain the unique set of genes Input: File with average score Output: File with unique genes Initialize: read input push each line into an array count till end of the file

Assign cutoff 0.8 % repeat until i > count split each line and put in another array splice each gene with different combination in separate arrays if any of the scores in combination greater than cutoff reject the gene else accept End The output file obtained through it contained all the unique genes which are dissimilar to each other with a similarity score of less that 0.8. The number of unique genes in each of the data sets is summarized in Table 5. Table 5

Number of Unique genes in each data set Number of

Accession

Data

Unique genes

GSE7146


1223

DGAP


1210

DGAP


803

DGAP


1238

The unique genes found through this approach were compared with the top 10 genes obtained through Fischer discriminate analysis (Kumar et al., 2014) to check whether the genes which obtained a high ranking in Fischer score are unique or repeated. In the dataset “Ef-


5

fect of insulin infusion on human skeletal” out of the top 10 genes only 5 genes were found to be unique. In the same way for “Human pancreatic islets from normal and Type 2 diabetic subjects (A)” out of top 10,

Table 6

Unique genes based on the Fischer Score (Kumar et al., 2014) for “Effect of insulin infusion on human skeletal”

Top 10 genes based on Fischer score

6 genes, for “Human pancreatic islets from normal and Type 2 diabetic subjects (B)” 4 genes and for “Human skeletal muscle-type 2 diabetes” 5 genes were found to be unique (Table 6-9). Most of the unique genes identified in the dataset and reported to have a high Fischer score are involved in some pathway which is reported

Table 8

Unique or repeated

Unique genes based on the Fischer Score (Kumar et al., 2014) for “Human pancreatic islets from normal and Type 2 diabetic subjects (B)”

G0S2: G0/G1switch 2

Unique

SLC22A6: solute carrier family 22 (organic anion transporter), member 6

Unique

CDC6: CDC6 cell division cycle 6 homolog (S. cerevisiae)

Repeated

THRAP6: thyroid hormone receptor associated protein 6

Unique

SCNN1G: sodium channel, nonvoltage-gated 1, gamma

Unique

LPHN3: Latrophilin 3

Repeated

VPS36: vacuolar protein sorting 36 (yeast)

Unique

LOC441601 /// LOC652471: septin 7 pseudogene /// similar to septin 7

Repeated

DNAJC1: DnaJ (Hsp40) homolog, subfamily C, member 1

Unique

KIAA0692: KIAA0692

Repeated

TLE1: transducin-like enhancer of split 1 (E (sp1) homolog, Drosophila)

Unique

UBXD8: UBX domain containing 8

Repeated

ANKRD15: Ankyrin repeat domain 15

Repeated

Table 7

Unique genes based on the Fischer Score (Kumar et al., 2014) for “Human pancreatic islets from normal and Type 2 diabetic subjects (A)”


Unique or


repeated

CAPS: calcyphosine

Unique

Transcribed locus

Unique

Transcribed locus, moderately similar to PREDICTED: similar to XP 517655.1 KIAA0825 protein [Pan troglodytes]

Repeated

MAP2K5: Mitogen-activated protein kinase kinase 5

Repeated

ZNF559: Zinc finger protein 559

Repeated

ZNF638: Zinc finger protein 638

Repeated

ZNF605: zinc finger protein 605

Repeated

Table 9

Unique genes based on the Fischer Score (Kumar et al., 2014) for “Human skeletal muscle-type 2 diabetes”

Unique or repeated

TMEM111: Transmembrane protein 111

Repeated

CYP7A1: cytochrome P450, family 7, subfamily A, polypeptide 1

Unique

Hs.247983.0

Repeated

NSUN5B: NOL1/NOP2/Sun domain family, member 5B

Repeated

Unique or


repeated

ZAK: sterile alpha motif and leucine zipper containing kinase AZK

Repeated

ANKHD1 /// MASK-BP3: ankyrin repeat and KH domain containing 1 /// MASK-4EBP3 alternate reading frame gene

Repeated

ProSAPiP1: ProSAPiP1 protein

Unique

PCDHB3: protocadherin beta 3

Unique


Repeated

HPRT1: hypoxanthine phosphoribosyltransferase 1 (Lesch-Nyhan syndrome)

Unique

C11orf61: chromosome 11 open reading frame 61

Repeated

CADPS2: Ca2+-dependent activator protein for secretion 2

Unique

MRNA; cDNA DKFZp686F1844 (from clone DKFZp686F1844)

Unique

COX7A1: cytochrome c oxidase subunit VIIa polypeptide 1 (muscle)

Repeated

CYP3A4: cytochrome P450, family 3, subfamily A, polypeptide 4

Unique


Repeated

CTBP1: C-terminal binding protein 1

Unique

FMO5: flavin containing monooxygenase 5

Unique

TMEM106C: transmembrane protein 106C PLK1 /// RPL37A: polo-like kinase (Drosophila) /// ribosomal protein L37a

Unique 1

Unique

6


to be involved directly or indirectly in causing Type II diabetes (Kumar et al., 2014).

4 Conclusion The major problem with the microarray data is the high redundancy in the genes which restricts from getting useful and valuable information from the data.

Table 10

The genes in the microarray data either have the same GO id or it may have common ancestors which makes them similar in molecular function. The redundant genes have been removed based on same GO id at the first step and then based on the average score of Pearson and Semantic similarity. The reduction in the number of genes at each step is summarized in Table 10.

Reduction in redundancy of genes at each stage

Data

No. of genes

No. of genes in

No. of Genes

obtained from

Non-Redundant

obtained after

GEO Database

GO Set

final step


22215

6107

1223


22191

6115

1210


22550

13175

803


22177

6113

1238

The genes obtained after final redundancy reduction step showed that almost 50% of genes which got a high score in Fischer discriminant analysis (Kumar et al., 2014) are repeated and they are eliminated thus leaving the genes which are only unique. Further to obtain the most discriminatory genes which may be a major target for a disease can be identified by subjecting the unique genes to a classifier like support vector machine or any of the machine learning approaches which can classify the discriminatory gene and eliminate the others.

data mining and Gene Ontology. BMC Medical Genomics 4, 12-19. [6] Mootha, V.K., Lindgren, C.M., Eriksson, K.F., Subramanian, A., Sihag, S., Lehar, J., Puigserver, P., Carlsson, E., Ridderstrale, M., Laurila, E., Houstis, N., Daly, M.J., Patterson, N., Mesirov, J.P., Golub, T.R., Tamayo, P., Spiegelman, B., Lander, E.S., Hirschhorn, J.N., Altshuler, D., Groop, L.C. 2003. PGC-1a Responsive Genes Involved in Oxidative Phosphorylationare Coordinately Downregulated in Human Diabetes. Nature Genetics 34 (3), 267-273.

References

[7] Parikh, H., Carlsson, E., Chutkow, W.A., Johansson, L.E., Storgaard, H., Poulsen, P., Saxena, R., Ladd, C., Schulze, P.C., Mazzini, M.J., Jensen, C.B., Krook, A., Bj¨ ornholm, M., Tornqvist, H., Zierath, J.R., Ridderstr˚ ale, M., Altshuler, D., Lee, R.T., Vaag, A., Groop, L.C., Mootha, V.K. 2007. TXNIP regulates peripheral glucose metabolism in humans. PLOS Med 4 (5), 868879

[1] Couto, F.M., Silva, M.J., Coutinho, P.M. 2007. Measuring semantic similarity between Gene Ontology terms. Data & Knowledge Engineering 61, 137-152. [2] Edgar, R., Domrachev, M., Lash, A.E. 2002. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Research 30 (1), 207-210. [3] Gunton, J.E., Kulkarni, R.N., Yim, S., Okada, T., Hawthorne, W.J., Tseng, Y.H., Roberson, R.S., Ricordi, C., O’Connell, P.J., Gonzalez, F.J., Kahn, C.R. 2005. Loss of ARNT/HIF1beta mediates altered gene expression and pancreatic-islet dysfunction in human type 2 diabetes. Cell 122 (3), 337-349. [4] Kumar, A., Sharmila, D.J.S., Kant, R. 2014. Selection of Discriminatory Gene Set for Type II Diabetes Using Fisher Linear Discriminant. International Journal of Advanced Computer and Mathematical Sciences 5 (2), 36-42. [5] Mohammadi, A., Saraee, M.H., Salehi, M. 2011. Identification of disease-causing genes using microarray

[8] Schena, M., Shalon, D., Davis, R.W., Brown, P.O. 1995. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270, 467-470. [9] Schlicker, A., Albrecht, M. 2008. FunSimMat: a comprehensive functional similarity database. Nucleic Acids Research 36, 434-439. [10] Schlicker, A., Domingues, F.S., Rahnenf¨ uhrer, J., Lengauer, T. 2006. A new measure for functional similarity of gene products based on Gene Ontology. BMC Bioinformatics 7, 302-317. [11] Zhang, A. 2006. Advanced Analysis of Gene Expression Microarray Data. Danvers. World Scientific Publishing Co.

TopoICSim: a new semantic similarity measure based on gene ontology.

Drug repositioning by applying 'expression profiles' generated by integrating chemical structure similarity and gene semantic similarity.

Interspecies gene function prediction using semantic similarity.

Enhancing the Lasso Approach for Developing a Survival Prediction Model Based on Gene Expression Data.

Semantic similarity measurement between gene ontology terms based on exclusively inherited shared information.

Clinical phenotype-based gene prioritization: an initial study using semantic similarity and the human phenotype ontology.

Functional module search in protein networks based on semantic similarity improves the analysis of proteomics data.

Removing Batch Effects from Longitudinal Gene Expression - Quantile Normalization Plus ComBat as Best Approach for Microarray Transcriptome Data.

Subtyping of Gliomaby Combining Gene Expression and CNVs Data Based on a Compressive Sensing Approach.

The use of semantic similarity measures for optimally integrating heterogeneous Gene Ontology data from large scale annotation pipelines.

Inferring gene ontologies from pairwise similarity data.

Correlating information contents of gene ontology terms to infer semantic similarity of gene products.

A density-based approach for detecting complexes in weighted PPI networks by semantic similarity.

OmniSearch: a semantic search system based on the Ontology for MIcroRNA Target (OMIT) for microRNA-target gene interaction data.

A weighted multipath measurement based on gene ontology for estimating gene products similarity.

LNDriver: identifying driver genes by integrating mutation and expression data based on gene-gene interaction network.

SemFunSim: a new method for measuring disease similarity by integrating semantic and gene functional association.

Redundancy gain in semantic categorisation.

Expressing Redundancy among Linear-Epitope Sequence Data Based on Residue-Level Physicochemical Similarity in the Context of Antigenic Cross-Reaction.

Case-based retrieval framework for gene expression data.

Meta-analysis based variable selection for gene expression data.

Drug similarity search based on combined signatures in gene expression profiles.

Inferring nonlinear gene regulatory networks from gene expression data based on distance correlation.

Symbiont modulates expression of specific gene categories in Angomonas deanei.