Interdiscip Sci Comput Life Sci (2015) 7: 1–6 DOI: 10.1007/s12539-014-0248-3
Algorithmic Approach for Removing the Redundancy in Diabetic Gene Categories Based on Semantic Similarity and Gene Expression Data 1
2
Atul Kumar1∗ ,
D. Jeya Sundara Sharmila2
(Department of Bioinformatics, Karunya University, Coimbatore, Tamil Nadu, India) (Department of Nanosciences and Technology, Tamil Nadu Agriculture University, Coimbatore, Tamil Nadu, India)
Received 14 October 2014 / Revised 27 November 2014 / Accepted 21 January 2015
Abstract: Even after so much advancement in gene expression microarray technology, the main hindrance in analysing microarray data is its limited number of samples as compared to a number of factors, which is a major impediment in revealing actual gene functionality and valuable information from the data. Analysing gene expression data can indicate the factors which are differentially expressed in the diseased tissue. As Most of these genes have no part to play in causing the disease of interest, thus, identification of disease-causing genes can revel not just the case of the disease, but also its pathogenic mechanism. There are a lot of gene selection methods available which have the capacity to remove irrelevant genes, but most of them are not sufficient enough in removing redundancy in genes from microarray data which increases the computational cost and decreases the classification accuracy. Combining the gene expression data with the Gene ontology information can be helpful in determining the redundancy which can then be removed using the algorithm mentioned in the work. The gene list obtained after this sequential steps of the algorithm can be analysed further to obtain the most deterministic genes responsible for Type II diabetes. Key words: Microarray Technology, Gene Expression, Diabetes, Greedy Algorithm, Gene Ontology, Sematic Similarity, Pearson Correlation, GEO Database.
1 Introduction Rapid advancement in gene expression microarray technology has enabled simultaneous measurement of the expression levels for tens of thousands of genes in a single experiment (Schena et al., 1995). Analyzing gene expression data can show the factors which are differentially expressed in the diseased tissue (Zhang, 2006). The main hindrance in analyzing microarray data is its limited number of samples as compared to number of genes. Most of these genes have no role to play in causing the disease of interest, thus, identification of disease-causing genes can reveal not just the cause of the disease, but also its pathogenic mechanism (Mohammadi et al., 2011). Available gene selection methods have the capacity to remove irrelevant genes, but most of them are inadequate in removing redundancy in genes from microarray data which increases the computational cost and decreases the classification accuracy. Due to the presence of noise and low number of samples in microarray data the actual gene functional∗
Corresponding author. E-mail:
[email protected] ity and valuable information cannot be easily revealed from the data. Gene expression data in combination with Gene ontology information can be helpful in determining the redundancy which can then be removed using the algorithm mentioned in the work. Pearson correlation and semantic similarity measure have been combined to find the similarity between any of the two genes (Mohammadi et al., 2011). Due to the low sample number Pearson correlation alone cannot be considered for finding the similarity and due to incomplete information in Gene ontology the semantic similarity measure is insufficient to determine the similarity between the genes, so the average of the scores of both Pearson correlation and semantic similarity measure is used to determine the similarity between the two genes. R(gi , gj ) =
Rexp(gi ,gj ) + Rsem(gi ,gj ) 2
Rsem (gi , gj ) and Rexp (gi , gj ) represent the semantic similarity and the expression similarity of genes gi and gj respectively. Semantic similarity measures can be used to calculate the similarity of two concepts organized in ontology.
2
Interdiscip Sci Comput Life Sci (2015) 7: 1–6
The ontology structure defines the function parents (c) that, given a concept c, returns the set of more generic concepts directly linked to c (Couto et al., 2007). Based on this, Resnik, Jiang and Conrath and Lin proposed 3 different ways for calculating the semantic similarity. Pearson correlation coefficient is used to find the expression similarity of two genes: giav denotes the average value of gene gi expressions and gik represent the value of k th sample in gene gi . In the current work based on the previously mentioned concept algorithm has been designed which help in reducing the redundancy from the microarray data sample.
2 Materials and Method 71 samples from different tissues of Homo sapiens (Diabetic and Normal) were collected from GEO database (Edgar et al., 2002) and Diabetes Genome Anatomy Project (DGAP). Out of these, 37 samples are of normal human beings and 34 are of diabetic
Table 1
humans (Table 1) (Kumar et al., 2014). Using the gene ontology information of all the genes in the given data sets semantic similarity was calculated for all the combination of the GO terms present in a particular dataset based on three methods given by Lins, Resnik and Jiang and a combination of Resnik’s and Lin’s similarity measures (simRel) (Schlicker and Albrecht, 2007). Expression similarity for all the combination of the GO terms present in a particular dataset was calculated through Pearson correlation coefficient. Semantic similarity values (simRel) and Pearson values were averaged and based on the average value a greedy algorithm was followed to obtain the genes which have a value less than the threshold value of 0.8. The threshold of 0.8 was taken after several experimental trials which showed that taking a threshold value greater than 0.8 resulted in a number of similar genes in the output file, whereas taking a value less than 0.8 was resulting in the loss of many of the important genes.
Data set samples taken for studies (Kumar et al., 2014) No of Samples
Accession
Data Normal
Diabetic
No of Genes
Country
GSE7146
Effect of insulin infusion on human skeletal muscle (Parikh et al. 2007)
6
6
22215
Sweden
DGAP
Human pancreatic islets from normal and Type 2 diabetic subjects (A) (Gunton et al. 2005)
7
5
22191
Caucasian
DGAP
Human pancreatic islets from normal and Type 2 diabetic subjects (B) (Gunton et al. 2005)
7
5
22550
DGAP
Human skeletal muscle-type 2 diabetes (Mootha et al. 2003)
17
18
22177
3 Results and Discussion 3.1
Removing the Gene duplicity by elimination the of Genes having same GO id
Data sets were subjected to SOURCE server (http:// smd.princeton.edu/cgi-bin/source/sourceBatchSearch) to obtain the gene ontology information for all the genes present in the data sets. The server returned an output file with genes and their corresponding GO ids and categorized the genes based on three hierarchies that define functional attributes of gene products: Molecular Function (MF), Biological Process (BP), and Cellular Component. In the data set “Effect of insulin infusion on human skeletal muscle” out of 22172 genes, 16167 (73%) genes were reported to be involved in molecular function, 857 (4%) genes in biological and 633 (3%) genes in cellular component. Among these 22172 genes there were 4515 (20%) genes for which there was no information present in the gene ontology database (Fig. 1). Except for “Human pancreatic islets from normal and type 2 diabetic subjects (B)”
and Asian
Sweden
all the dataset taken for studies have shown almost similar distribution of genes in different hierarchy this is because the gene chip array chosen for its study was HG-U133 B (Gunton et al. 2005) unlike others where it was HG-U133 A (Table 2). The use of the different gene chip array has caused a major change in the distribution of genes in different hierarchy for gene set “Human pancreatic islets from normal and type 2 diabetic subjects (B)” (Fig. 2). Out of 22550 genes, 8664 (38%) genes were reported to be involved in molecular function, 857 (4%) genes in biological and 633 (5%) genes in cellular component, whereas there were 12036 (53%) genes for which there was no information present in the gene ontology database. The result file obtained through SOURCE server showed high redundancy in the GO ids. Thus Ablebits-a commercial software (free trial version) (http://www.ablebits.com/) plugin was used to generate a status column mentioning duplicate against the GO ids that were repeated. Based on the Fischer score (Kumar et al., 2014), except the top scored
Interdiscip Sci Comput Life Sci (2015) 7: 1–6 Table 2 Accession GSE7146
3
Distribution of genes in different hierarchy for each data set under study Data
Molecular
Biological
Cellular
No Gene
Function
Function
Component
Information
Effect of insulin infusion on human skeletal muscle
16167
857
633
4515
DGAP
Human pancreatic islets from normal and Type 2 diabetic subjects (A)
16176
860
633
4522
DGAP
Human pancreatic islets from normal and Type 2 diabetic subjects (B)
8664
834
1016
12036
DGAP
Human skeletal muscle - type 2 diabetes
16165
859
631
4522
No Gene information 20% Cellular component 3% Biological function 4%
If status column is Duplicate eliminate the line End
Molecular function 73%
The results obtained after executing the algorithm showed a drastic decrease in gene number by removing the redundant genes from the data sets. Table 3 summarizes the result of the first step of redundancy reduction. 3.2
Fig. 1
Distribution of genes in different hierarchy for all data sets except “Human pancreatic islets from normal and Type 2 diabetic subjects (B)”
No Gene information 53%
Molecular function 38%
Biological Cellular function component 4% 5%
Fig. 2
Distribution of genes in different hierarchy for “Human pancreatic islets from normal and Type 2 diabetic subjects (B)”
gene among the duplicate gene set all the genes were removed using the algorithm given below. Input: GO file with duplicate status column Output: Non redundant GO terms file Initialize: Read file push each line into an array count till end of the file % repeat until i > count split each line and put in another array
Semantic similarity between genes in data sets
Out of all the gene sets obtained from the above results, the genes which were categorized under molecular functions were taken to identify the semantic similarity among them using funsimmat (Schlicker and Albrecht, 2008) (http://funsimmat.bioinf.mpiinf.mpg.de/). Since the molecular function represent the ability or job performed by a gene product, whereas biological function and cellular function represent recognized series of events or molecular functions and locations, at the levels of subcellular structures and macromolecular complexes respectively so biological function gene and cellular component gene were not considered for finding semantic similarity. The semantic similarity was determined using Resnik, Jiang and Conrath, Lin and a combination of Resnik Lin method (Couto et al., 2007). All the possible combination of genes in different data sets was generated and semantic value for each combination was generated using the above mentioned methods. The different combination generated in each dataset is shown in Table 4. Out of these the semantic values of Resnik Lin (SimRel) were used for calculation as this method takes into account the relevance information and provides a high relevance in generic terms for the comparison of the exact function of different gene products (Schlicker et al., 2006). 3.3
Pearson correlation coefficient for expression similarity between genes in dataset
To find the expression similarity between two genes Pearson correlation coefficient was used. It was required to make the same set of combination of genes as it was given by the server for semantic similarity so
4
Interdiscip Sci Comput Life Sci (2015) 7: 1–6 Table 3
Number of genes in different hierarchy after first step of redundancy reduction
Accession GSE7146
Data
Molecular
Biological
Cellular
Function
Process
Component
Effect of insulin infusion on human skeletal muscle
1297
257
38
DGAP
Human pancreatic islets from normal and Type 2 diabetic subjects (A)
1297
258
38
DGAP
Human pancreatic islets from normal and Type 2 diabetic subjects (B)
878
216
45
DGAP
Human skeletal muscle-type 2 diabetes
1296
256
38
Table 4 Accession GSE7146
Number of different combinations of gene in data sets Total Number of
Data
Combinations
Effect of insulin infusion on human skeletal muscle
840456
DGAP
Human pancreatic islets from normal and Type 2 diabetic subjects (A)
840456
DGAP
Human pancreatic islets from normal and Type 2 diabetic subjects (B)
385881
DGAP
Human skeletal muscle-type 2 diabetes
840456
as the average value for both semantic and expression similarity can be calculated. To obtain the same set of combination of genes as of semantic the following algorithm was executed, which generated the genes with the same combination as of semantic similarity Input 1: Semantic similarity file Input 2: Non redundant GO terms file Output: GO terms with same set of combination Initialize: read input1 and input 2 push each line into an array count till end of the file % repeat until i > count split each line and put in another array if GO ids in columns of both the array is same print the line End 3.4
The Greedy algorithm
The expression and the semantic value (Resnik Lin Value) for the all the different combination of gene set was averaged and the score obtained was used for the removing the genes for whom average value was more than 0.8 (Mohammadi et al., 2011). Thus, using the threshold value of 0.8 only those genes were obtained which were highly dissimilar and mainly contribute to causing a disease. A greedy algorithm approach was used to obtain the unique set of genes Input: File with average score Output: File with unique genes Initialize: read input push each line into an array count till end of the file
Assign cutoff 0.8 % repeat until i > count split each line and put in another array splice each gene with different combination in separate arrays if any of the scores in combination greater than cutoff reject the gene else accept End The output file obtained through it contained all the unique genes which are dissimilar to each other with a similarity score of less that 0.8. The number of unique genes in each of the data sets is summarized in Table 5. Table 5
Number of Unique genes in each data set Number of
Accession
Data
Unique genes
GSE7146
Effect of insulin infusion on human skeletal muscle
1223
DGAP
Human pancreatic islets from normal and Type 2 diabetic subjects (A)
1210
DGAP
Human pancreatic islets from normal and Type 2 diabetic subjects (B)
803
DGAP
Human skeletal muscle-type 2 diabetes
1238
The unique genes found through this approach were compared with the top 10 genes obtained through Fischer discriminate analysis (Kumar et al., 2014) to check whether the genes which obtained a high ranking in Fischer score are unique or repeated. In the dataset “Ef-
Interdiscip Sci Comput Life Sci (2015) 7: 1–6
5
fect of insulin infusion on human skeletal” out of the top 10 genes only 5 genes were found to be unique. In the same way for “Human pancreatic islets from normal and Type 2 diabetic subjects (A)” out of top 10,
Table 6
Unique genes based on the Fischer Score (Kumar et al., 2014) for “Effect of insulin infusion on human skeletal”
Top 10 genes based on Fischer score
6 genes, for “Human pancreatic islets from normal and Type 2 diabetic subjects (B)” 4 genes and for “Human skeletal muscle-type 2 diabetes” 5 genes were found to be unique (Table 6-9). Most of the unique genes identified in the dataset and reported to have a high Fischer score are involved in some pathway which is reported
Table 8
Unique or repeated
Unique genes based on the Fischer Score (Kumar et al., 2014) for “Human pancreatic islets from normal and Type 2 diabetic subjects (B)”
G0S2: G0/G1switch 2
Unique
SLC22A6: solute carrier family 22 (organic anion transporter), member 6
Unique
CDC6: CDC6 cell division cycle 6 homolog (S. cerevisiae)
Repeated
THRAP6: thyroid hormone receptor associated protein 6
Unique
SCNN1G: sodium channel, nonvoltage-gated 1, gamma
Unique
LPHN3: Latrophilin 3
Repeated
VPS36: vacuolar protein sorting 36 (yeast)
Unique
LOC441601 /// LOC652471: septin 7 pseudogene /// similar to septin 7
Repeated
DNAJC1: DnaJ (Hsp40) homolog, subfamily C, member 1
Unique
KIAA0692: KIAA0692
Repeated
TLE1: transducin-like enhancer of split 1 (E (sp1) homolog, Drosophila)
Unique
UBXD8: UBX domain containing 8
Repeated
ANKRD15: Ankyrin repeat domain 15
Repeated
Table 7
Unique genes based on the Fischer Score (Kumar et al., 2014) for “Human pancreatic islets from normal and Type 2 diabetic subjects (A)”
Top 10 genes based on Fischer score
Unique or
Top 10 genes based on Fischer score
repeated
CAPS: calcyphosine
Unique
Transcribed locus
Unique
Transcribed locus, moderately similar to PREDICTED: similar to XP 517655.1 KIAA0825 protein [Pan troglodytes]
Repeated
MAP2K5: Mitogen-activated protein kinase kinase 5
Repeated
ZNF559: Zinc finger protein 559
Repeated
ZNF638: Zinc finger protein 638
Repeated
ZNF605: zinc finger protein 605
Repeated
Table 9
Unique genes based on the Fischer Score (Kumar et al., 2014) for “Human skeletal muscle-type 2 diabetes”
Unique or repeated
TMEM111: Transmembrane protein 111
Repeated
CYP7A1: cytochrome P450, family 7, subfamily A, polypeptide 1
Unique
Hs.247983.0
Repeated
NSUN5B: NOL1/NOP2/Sun domain family, member 5B
Repeated
Unique or
Top 10 genes based on Fischer score
repeated
ZAK: sterile alpha motif and leucine zipper containing kinase AZK
Repeated
ANKHD1 /// MASK-BP3: ankyrin repeat and KH domain containing 1 /// MASK-4EBP3 alternate reading frame gene
Repeated
ProSAPiP1: ProSAPiP1 protein
Unique
PCDHB3: protocadherin beta 3
Unique
ZNF688: zinc finger protein 688
Repeated
HPRT1: hypoxanthine phosphoribosyltransferase 1 (Lesch-Nyhan syndrome)
Unique
C11orf61: chromosome 11 open reading frame 61
Repeated
CADPS2: Ca2+-dependent activator protein for secretion 2
Unique
MRNA; cDNA DKFZp686F1844 (from clone DKFZp686F1844)
Unique
COX7A1: cytochrome c oxidase subunit VIIa polypeptide 1 (muscle)
Repeated
CYP3A4: cytochrome P450, family 3, subfamily A, polypeptide 4
Unique
ZNF267: zinc finger protein 267
Repeated
CTBP1: C-terminal binding protein 1
Unique
FMO5: flavin containing monooxygenase 5
Unique
TMEM106C: transmembrane protein 106C PLK1 /// RPL37A: polo-like kinase (Drosophila) /// ribosomal protein L37a
Unique 1
Unique
6
Interdiscip Sci Comput Life Sci (2015) 7: 1–6
to be involved directly or indirectly in causing Type II diabetes (Kumar et al., 2014).
4 Conclusion The major problem with the microarray data is the high redundancy in the genes which restricts from getting useful and valuable information from the data.
Table 10
The genes in the microarray data either have the same GO id or it may have common ancestors which makes them similar in molecular function. The redundant genes have been removed based on same GO id at the first step and then based on the average score of Pearson and Semantic similarity. The reduction in the number of genes at each step is summarized in Table 10.
Reduction in redundancy of genes at each stage
Data
No. of genes
No. of genes in
No. of Genes
obtained from
Non-Redundant
obtained after
GEO Database
GO Set
final step
Effect of insulin infusion on human skeletal muscle
22215
6107
1223
Human pancreatic islets from normal and Type 2 diabetic subjects (A)
22191
6115
1210
Human pancreatic islets from normal and Type 2 diabetic subjects (B)
22550
13175
803
Human skeletal muscle-type 2 diabetes
22177
6113
1238
The genes obtained after final redundancy reduction step showed that almost 50% of genes which got a high score in Fischer discriminant analysis (Kumar et al., 2014) are repeated and they are eliminated thus leaving the genes which are only unique. Further to obtain the most discriminatory genes which may be a major target for a disease can be identified by subjecting the unique genes to a classifier like support vector machine or any of the machine learning approaches which can classify the discriminatory gene and eliminate the others.
data mining and Gene Ontology. BMC Medical Genomics 4, 12-19. [6] Mootha, V.K., Lindgren, C.M., Eriksson, K.F., Subramanian, A., Sihag, S., Lehar, J., Puigserver, P., Carlsson, E., Ridderstrale, M., Laurila, E., Houstis, N., Daly, M.J., Patterson, N., Mesirov, J.P., Golub, T.R., Tamayo, P., Spiegelman, B., Lander, E.S., Hirschhorn, J.N., Altshuler, D., Groop, L.C. 2003. PGC-1a Responsive Genes Involved in Oxidative Phosphorylationare Coordinately Downregulated in Human Diabetes. Nature Genetics 34 (3), 267-273.
References
[7] Parikh, H., Carlsson, E., Chutkow, W.A., Johansson, L.E., Storgaard, H., Poulsen, P., Saxena, R., Ladd, C., Schulze, P.C., Mazzini, M.J., Jensen, C.B., Krook, A., Bj¨ ornholm, M., Tornqvist, H., Zierath, J.R., Ridderstr˚ ale, M., Altshuler, D., Lee, R.T., Vaag, A., Groop, L.C., Mootha, V.K. 2007. TXNIP regulates peripheral glucose metabolism in humans. PLOS Med 4 (5), 868879
[1] Couto, F.M., Silva, M.J., Coutinho, P.M. 2007. Measuring semantic similarity between Gene Ontology terms. Data & Knowledge Engineering 61, 137-152. [2] Edgar, R., Domrachev, M., Lash, A.E. 2002. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Research 30 (1), 207-210. [3] Gunton, J.E., Kulkarni, R.N., Yim, S., Okada, T., Hawthorne, W.J., Tseng, Y.H., Roberson, R.S., Ricordi, C., O’Connell, P.J., Gonzalez, F.J., Kahn, C.R. 2005. Loss of ARNT/HIF1beta mediates altered gene expression and pancreatic-islet dysfunction in human type 2 diabetes. Cell 122 (3), 337-349. [4] Kumar, A., Sharmila, D.J.S., Kant, R. 2014. Selection of Discriminatory Gene Set for Type II Diabetes Using Fisher Linear Discriminant. International Journal of Advanced Computer and Mathematical Sciences 5 (2), 36-42. [5] Mohammadi, A., Saraee, M.H., Salehi, M. 2011. Identification of disease-causing genes using microarray
[8] Schena, M., Shalon, D., Davis, R.W., Brown, P.O. 1995. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270, 467-470. [9] Schlicker, A., Albrecht, M. 2008. FunSimMat: a comprehensive functional similarity database. Nucleic Acids Research 36, 434-439. [10] Schlicker, A., Domingues, F.S., Rahnenf¨ uhrer, J., Lengauer, T. 2006. A new measure for functional similarity of gene products based on Gene Ontology. BMC Bioinformatics 7, 302-317. [11] Zhang, A. 2006. Advanced Analysis of Gene Expression Microarray Data. Danvers. World Scientific Publishing Co.