Semantic Similarity Measurement between Gene Ontology Terms Based on Exclusively Inherited Shared Information Shu-Bo Zhang, Jian-Huang Lai PII: DOI: Reference:
S0378-1119(14)01488-7 doi: 10.1016/j.gene.2014.12.062 GENE 40176
To appear in:
Gene
Received date: Revised date: Accepted date:
16 July 2014 15 December 2014 24 December 2014
Please cite this article as: Zhang, Shu-Bo, Lai, Jian-Huang, Semantic Similarity Measurement between Gene Ontology Terms Based on Exclusively Inherited Shared Information, Gene (2014), doi: 10.1016/j.gene.2014.12.062
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
T
Semantic Similarity Measurement between Gene Ontology Terms Based on Exclusively Inherited Shared Information
Shu-Bo Zhang*
SC
RI P
Affiliation: Department of Computer Science, Guangzhou Maritime Institute, Guangzhou, P.R. China. Postal address: Room 803 Building 88, Dashabei Road, Guangzhou, 510275, P.R. China. Telephone: +86-20-82384406 Fax: +86-20-82384406 E-mail:
[email protected] Jian-Huang Lai
MA
NU
Affiliation: School of Information Science and Technology, Sun Yat-sen University, Guangzhou, P.R. China. Postal address: Room 105 Building 110 East District, 135 Xingangxi Road, Guangzhou, 510275, China. Telephone: +86-20-84110175 Fax: +86-20-84110175 E-mail:
[email protected] AC CE
PT
ED
* Corresponding author Competing Interests: The authors have declared that no competing interests exist.
ACCEPTED MANUSCRIPT
ABSTRACT Quantifying the semantic similarities between pairs of terms in the Gene Ontology
RI P
T
(GO) structure can help to explore the functional relationships between biological entities. A common approach to this problem is to measure the information they have in common based
SC
on the information content of their common ancestors. However, many studies have their limitations in measuring the information two GO terms share. This study presented a new
NU
measurement, exclusively inherited shared information (EISI) that captured the information shared by two terms based on an intuitive observation on the multiple inheritance relationships
MA
among the terms in the GO graph. EISI was derived from the information content of the exclusively inherited common ancestors (EICAs), which were screened from the common
ED
ancestors according to the attribute of their direct children. The effectiveness of EISI was
PT
evaluated against some state-of-the-art measurements on both artificial and real datasets, it produced more relevant result with experts’ scores on the artificial dataset, and supported the
AC CE
prior knowledge of gene function in pathways on the Saccharomyces genome database (SGD). The promising features of EISI are the following: (1) it provides a more effective way to characterize the semantic relationship between two GO terms by taking account multiple common ancestors related, and (2) can quickly detect all EICAs with time complexity of O(n) , which is much more efficient than other methods based on disjunctive common ancestors. It is a promising alternative to multiple inheritance based methods for practical applications on large-scale dataset. The algorithm EISI was implemented in Matlab and is freely available from http://treaton.evai.pl/EISI/.
Key words: Semantic similarity measurement; Gene Ontology; Information content; Common ancestors; Exclusively inherited common ancestors;
ACCEPTED MANUSCRIPT
1. INTRODUCTION
T
Comparison of biological entities is important in biological research as it can help to
RI P
explore the relationship of regulation or function between gene products or genes (collectively called genes hereafter for simplicity), and contribute to the inference of biological roles and
SC
functions of genes. Traditional approach to address this issue is based on comparative
NU
experiment, which is costly and time consuming, other methods include comparing the sequences or structures between genes[1] by means of bioinformatics approaches. The advent
MA
of high-throughput technologies has produced a wealth of heterogeneous biological data related to functional annotation of gene. This provides us with a promising way to compare
ED
genes on functional level on aspects that could othewise not be comparable. However, comparing genes based on such huge amount of diverse biomedical datasets is a challenging
PT
task, as they are usually constructed in an unconsolidated way. This led to the introduction of
AC CE
various biological ontologies. Gene Ontology (GO) project is one of those that provide consolidated description of gene function for data from different resources. It can be used to explore the functional relationship between two biological entities[2], and has a variety of applications in the fields such as gene function prediction[3, 4], gene expression data analysis [5, 6], gene clustering[7, 8], disease gene prioritization[9, 10], analysis of protein interactions[11, 12], and so on. The Gene Ontology comprises two parts, the GO graph and the GO annotation [13]. The former is composed of controlled terms and organized in three orthogonal aspects: biological process (BP), molecular function (MF), and cellular component (CC), and is structured as a Directed Acyclic Graph (DAG)[14]. In the GO graph, the nodes are controlled terms with specific biological meaning, such as biological process, molecular function, or cellular localization, the edges link nodes and characterize the relationships among terms. The
ACCEPTED MANUSCRIPT most common relationships are ‘is–a’ and ‘part–of’. The GO annotation builds a bridge between GO terms and genes, it provides annotation information for genes with controlled
T
terms in the GO graph. When a gene is annotated with a GO term, it is also annotated with all
RI P
the ancestors of that term in the GO graph[15]. Moreover, this gene is relevant to other genes that are annotated with the same term, as well as the ancestors and descendants of that term[2].
SC
This suggests that we can compare two genes on functional level by measuring the semantic similarity of their GO terms.
NU
In recent years, the research on semantic similarity measurement between two GO
MA
terms has drawn more and more attentions from the community of bioinformatics. A variety of metrics have been introduced, and some software tools have been proposed for calculating
ED
semantic similarities of GO terms, including Fussimeg [16],FunSimMat[17], G-SESAME[18], GFSAT[19],GOSemSim[20] and SORA[1]. Traditional approaches to the semantic similarity
PT
measurement between GO terms are generally classified into three categories: edge-based
AC CE
methods[21-24] (also called structure based approaches), which define the semantic similarity based on the conceptual distance derived from the information related with the length or type of edges in the GO graph; node-based methods[25-28] (also called annotation-based, information content-based methods), where the nodes and their properties are adopted to compute the information content for similarity; and hybrid methods[29-33], that combine the information content with GO graph structure of GO terms for semantic similarity. Node-based semantic similarity measures are possibly the most frequently mentioned metrics in the literatures[34]. This category of approaches are established on the basis of information theory, and the underlying principle behind is that the more information two concepts have in common, the more similar they are. The information of a concept is quantified by its information content (IC), which rests upon the possibility that it occurs in the GO graph[32, 35] or in a corpus[25]. The information content is an indicator that measures
ACCEPTED MANUSCRIPT how informative and specific a concept is, and is defined as the negative logarithm of the probability that concept appears. Resnik[36] proposed a measure based on the information
T
content of the most informative ancestor, which is identified by calculating the IC values of all
RI P
common ancestors two terms shared and selecting the one with the maximal values. Since the similarity value of Resnik’s measurement may be larger than one, Lin[26] and Jiang and
SC
Conrath[29] proposed their improved schemes to normalize the similarity value to (0,1). Nevertheless, these two kinds of measurements defined similarity based on Resnik’s
NU
measurement that only consider the information content of a single common ancestor, namely,
MA
the Most Informative Common Ancestor (MICA) that inherited by both terms. This is proper in the case that the GO graph is a tree, but it will become problematic in the DAG structure of
ED
GO, as a node may have more than one parent nodes and thus some biological information inherited from some ancestors will be neglected.
PT
To address the problem caused by multiple inheritance, Couto et al. employed the concept of disjunctive common ancestors and defined a graph-based similarity measure
AC CE
(GraSM)[37], where the information two terms share was derived from all their disjunctive common ancestors by taking the average of their information content. They later updated GraSM and proposed a new method, dubbed Disjunctive Shared Information (DiShIn) [28], to address the computational complexity problem caused by its recursive definition for disjunctive common ancestors and the problem caused by parallel interpretations shared by two terms. Both GraSM and DiShIn can be directly integrated into any semantic similarity measure based on the MICA[28]. However, the dynamic implementation of GraSM and DiShIn is rather time consuming, as they need to search for the paths between pairs of nodes in the GO graph. To circumvent this problem, they performed a preliminary calculation and stored the results in a database for later computation. The focus of this paper is to follow the information theoretic vein and propose a novel
ACCEPTED MANUSCRIPT approach that measures the semantic similarity between two GO terms. We introduced a new measurement based on shared information, exclusively inherited shared information (EISI),
T
which quantifies the information two terms have in common base on some informative
RI P
common ancestors. The EISI is proposed based on the observation that, only those common ancestors that are inherited exclusively contribute to the shared information of two terms. The
SC
common ancestor set was first constructed, each element in which denoted a node that is inherited by both terms. Then all common ancestors were checked, and those had direct
NU
descendants inherited by either term exclusively were considered to be exclusively inherited
MA
common ancestors (EICAs). Finally, the information content shared by two terms was calculated by taking the average of the information content of all EICAs. Experiments were
ED
conducted on artificial dataset and the Saccharomyces genome database (SGD), and the results shown that the similarity measurement based on EISI correlated better with experts’
PT
scores on artificial dataset, and supported the prior knowledge of classification information in pathways on SGD. Our measurement has the following advanced properties: (1) it provides a
AC CE
more effective way to characterize the relationship between two GO terms by considering multiple common ancestors, and (2) it can quickly detect all EICAs with time complexity of O(n) , which is much more efficient than other methods based on disjunctive common ancestors. It is a promising alternative to multiple inheritance based methods for practical application.
2. METHODS To take into account multiple common ancestors in an effective way, this paper proposes a new measurement to quantify the information shared by two GO terms, based on the exclusively inherited common ancestors (EICAs) they have in common. Like GraSM and DiShIn, EISI takes into account multiple common ancestors two GO terms share, and defines
ACCEPTED MANUSCRIPT their common information as the average of the inforamtion content of their common ancestors. However, EISI consideres a common ancestor to be informative only if it has direct
T
descendant that is inherited by either of terms exclusively, this means that not all common
RI P
ancestors are considered in EISI algorithm, which may reduce the computational complexity for calculating the shared information content.
SC
2.1 Related work
NU
Over the years, great efforts have been devoted to measuring the semantic similarity between GO terms based on information content [25, 26, 28, 29, 36-39]. Among these, the
MA
methods proposed by Resnik[36],Lin[26] and Jiang and Conrath[29] received much attentions. According to Resnik, the information two terms share is derived from their most informative
ED
common ancestors, which can be defined as
(1)
PT
CA(t1 , t2 ) {t : t parent (t1 ) t parent (t2 )}
where parent (t1 ) and parent (t2 ) are the parent node sets of t1 and t2 , respectively.
AC CE
Let t1 and t2 are two terms, the information they share can be calculated as the information content of their most informative common ancestor, shareRe snik (t1 , t2 ) {IC (t ) | t CA(t1 , t2 ) (ti CA(t1 , t2 ), IC (ti ) IC (t ))}
(2)
IC (t ) and IC (ti ) in formula (2) denote the information content of term t and ti , respectively. The
information content of a GO term t is difined as IC (t ) log p(t )
(3)
where p(t ) is the probability that t occurs in a certain GO annotation corpus(e.g., GOA database). Thus, the shared information content is further used to measure the semantic similarity between t1 and t2 , simRe snik (t1 , t2 ) shareRe snik (t1 , t2 ) max{IC(t ) | t CA(t1 , t2 )}
(4)
ACCEPTED MANUSCRIPT Since the maximum of Resnik’s measurement can be greater than one, Lin[26] proposed a normalized version, which defined the similarity as the ratio of information content between
2 shareRe snik (t1 , t2 ) IC (t1 ) IC (t2 )
RI P
simLin (t1 , t2 )
T
the most informative common ancestor and both terms,
(5)
SC
Another information theoretic semantic similarity metric based on the information content was introduced by Jiang and Conrath[29]. They proposed a combined approach that takes into
NU
account both the information content and the conceptual distance. The primary model is rather complex as it considers several other factors, such as node depth, local density and link type,
MA
and is often simplified as follows by considering only the information content, DJiang (t1 , t2 ) 2 shareRe snik (t1 , t2 ) IC(t1 ) IC(t2 )
(6)
ED
Note that this measure quantifies semantic distance rather than semantic similarity
formula:
PT
metrics, we can convert this distance value into a similarity metrics according to the following
AC CE
simJiang (t1 , t2 )
1 1 DJiang (t1 , t2 )
(7)
To address the problem caused by multiple inheritance, Couto et al.[37] employed the concept of disjunctive common ancestors (DCAs) and proposed a graph-based similarity measure (GraSM), where they redefined the shared information content as the average of information content of all their disjunctive common ancestors. According to Couto, A common ancestor of two terms is regarded as a disjunctive common ancestor, if there exists a path from one of the terms to that ancestor, which is different from any other paths from the other term to the same common ancestors. By this concept, Couto defined GraSM as follows: simGraSM (t1 , t2 ) { log p(t ) | t DCA(t1 , t2 )}
(8)
where DCA(t1 , t2 ) denotes the disjunctive common ancestors of the two terms: DCA(t1 , t2 )
ACCEPTED MANUSCRIPT {a1 | a1 CA(t1 , t2 ) a2 : (a2 CA(t1 , t2 ) IC (a1 ) IC (a2 ) a1 a2 ) ((a1 , a2 ) DA(t1 ) DA(t2 ))} , CA(t1 , t2 )
represents the common ancestors of t1 and t2 , and DA(t ) is the disjunctive ancestors of a
T
term: DA(t ) {(a1 , a2 ) | (p : p paths(a1 , t ) a2 p) (p : p paths(a2 , t ) a1 p)} , where paths(a, t ) gives
RI P
the set of distinct paths from t to a in the DCA(t1 , t2 ) .
For GraSM, the disjunctive common ancestors of a term are defined in a recursive way,
SC
and the computational complexity of detecting the DCAs of two GO terms is non-linear. This
NU
will cause limitations for large-scale study in real-time scenarios. Moreover, the shared information will decrease even when the disjunctive common ancestors are inherited in
MA
parallel. To overcome the problem caused by parallel inheritence, Couto and colleagues proposed an updated measurement DiShIn, to calculate the shared information content
ED
between two terms by counting the number of distinct paths from common ancestors to the terms[28]. DiShIn increases the similarity value of terms that inherit information from a
PT
disjunctive common ancestor in parallel, as it shows stronger relation than those being
AC CE
inherited by either term exclusively. Both GraSM and DiShIn opened up a new door to the semantic similarity measurement between GO terms by taking into account multiple common ancestors. Recently, DiShIn was even proposed as a solution to the next generation of similarity measures that can fully explore the semantics in ontologies[40].
2.2 Exclusively inherited shared information (EISI) In the DAG structure of GO graph, the nodes denote some terms that represent the controlled biological vocabularies related to the molecular function, biological process or cellular component information of gene products, and the edges in the graph link different terms to each other by certain relationships, such as “part-of”, “is-a”, “regularized”, and so on. All terms in the graph are organized in a hierarchical way, and the one closer to the root of the graph has more general biological meaning, whereas a term far from the root has more
ACCEPTED MANUSCRIPT specific biological meaning. In the hierarchical structure of GO graph, there exist inheritance relationships among terms in a path, that is to say, a term in lower position inherits some
T
meanings from its ancestors, which are more general biologically. Thus, the information
RI P
becomes more and more specific from ancestral terms to descendant terms, and the semantics of GO terms also become more and more specific.
SC
Due to the semantics inheritance property between terms in the GO graph, information transmits from parents to children step by step. In a given path from ancestor to descendant,
NU
the information of a node’s ancestors is redundant for its children, as that node concentrates
MA
all informations related to its child nodes from its ancestors. Based on this observation, one may naturally think that all the ancestor nodes of a given common ancestor are redundant for
ED
two of its child terms investigated. This is right in the case of single inheritance, however it may be problematic in the case of multiple inheritance, as there may exist some paths from a
PT
common ancestor to either of the children, and those are not shared by the other child. That is to say, a common ancestor may provide some information to one of its descandants
AC CE
exclusively, and this specific information will vanish when we delete that common ancestor, even though other shared information remains unchanged. Thus, previous similarity measures that considered all the common ancestors or only the most informative ancestor may have their limitations. In order to address this problem, we define a common ancestor C as the exclusively inherited common ancestor (EICA) of its descendants A and B, if C is directly inherited by some nodes that are inherited by A or B, exclusively. According to this definition, an EICA provides some information specific to either A or B through an exclusive path connecting the EICA and either of the two descendant terms. Thus, the exclusively inherited common ancestors (EICAs) contain information that contributes to the relationship between two terms, and they can be used to derive the similarity measurement of GO terms. Figure 1 gives three toy examples to illustrate the inheritance relationship and
ACCEPTED MANUSCRIPT exclusively inherited common ancestors of two terms in GO graph. Figure 1(a) shows a small example of single inherited. In this scenario, n3 is an exclusively inherited common ancestor
T
of n4 and n5 , as it has two direct child nodes that are not included in the common ancestor
RI P
set {n1 , n2 , n3 } . While n1 is inherited by n2 , and n2 itself is inherited by n3 , that means n3 is the most informative ancestral node with the most specific semantics to describe the relationship
SC
between n4 and n5 . This example makes no difference from the case that the ontology is a tree;
{n1 , n2 , n3 , n4 } is
NU
Figure 1(b) is a multiple inheritance example where n4 is the only EICA. In this context, the common ancestor set of n5 and n6 , information of n1 is inherited by n2 , and that
MA
of n2 is shared by n3 and n4 , then combines into n4 , which is inherited by n5 and n6 . All informations from the common ancestors concentrates on n4 and shared by n5 and n6 , it means that the
ED
information two terms share can be characterized by the most informative node if all direct
PT
children of a common ancestor are included in the common ancestor set; Figure 1(c) gives another example of multiple inheritance with two exclusively inherited common ancestors. In
AC CE
this graph, n2 inherits semantics from n1 and shared by n3 and n4 , n3 is an EICA as it is shared by n5 and n6 through distinct paths. n4 is a direct child of n2 and is not included in the common ancestor set {n1 , n2 , n3 } , so n2 is also an EICA of n5 and n6 .
n1
n1
n2
n2
n2
n3
n4
n1
n4
n5 (a)
n5
n3 n6
(b)
n3
n4
n5
n6 (c)
Figure 1. Toy examples of inheritance relationship and the exclusively inherited common ancestors (EICAs) in GO. In (a), n1 , n2 and n3 are common ancestors of n4 and n5 , n3 is their EICA; In (b), n1 , n2 , n3 and n4 are common ancestors of n5 and n6 , n4 is their EICA; In (c), n1 , n2 , n3 and n4 are common ancestors of n5 and n6 , n2 and n3 are their EICAs.
ACCEPTED MANUSCRIPT Given two terms t1 and t 2 in the GO graph, we can find their common ancestor set CA(t1 , t 2 ) , and defined their exclusively inherited common ancestor set as
RI P
(t x A(t1 ) t x A(t2 )))}
T
EICA(t1 , t2 ) {t : t CA(t1 , t2 ) t x : (t x Dchild (t ) t x CA(t1 , t2 )
(9)
where Dchild (t ) is the set of direct descendants of node t, A(t1 ) and A(t2 ) are two sets containing
SC
the ancestral nodes of t1 and t 2 , as well as the node itself, respectively. Then the information
NU
content shared by t1 and t 2 is called exclusively inherited shared information (EISI), which can be quantified by simply taking the average of the information content of all the EICAs 1 IC (ti ) N ti EICA(t1 ,t2 )
MA
shareEISI (t1 , t2 )
(10)
where N denotes the number of elements in EICA(t1 , t 2 ) .
ED
EISI is a metrics that quantify the information shared by two GO terms, it can be used to
PT
measure the similarity of two terms as the shared information content in Resnik’s measurement. Moreover, it can also be imbedded into other information content-based
AC CE
semantic similarity measures, such as Lin’s and Jiang and Conrath’s measurements. After replacing the shared information content in Resnik’s, Lin’s and Jiang and Conrath’s similarity measurements with EISI, respectively, we get three variants of similarity measures: simRe snik _ EISI (t1 , t2 ) ShareEISI (t1 , t2 )
(11)
simLin _ EISI (t1 , t 2 )
2 ShareEISI (t1 , t 2 ) IC (t1 ) IC (t 2 )
(12)
simJC _ EISI (t1 , t 2 )
1 1 IC (t1 ) IC (t 2 ) 2 ShareEISI (t1 , t 2 )
(13)
To explain how to identify the EICAs of two GO terms and compute the EISI based on EICAs, we take a fragment of GO for example. Figure 2 is a sub-graph consisting of eight GO terms and the corresponding relationships among these terms. CA(n6 , n7 ) {n0 , n1 , n2 , n3 } is a common ancestor set shared by n6 and n7 , for the elements in the collection, n0 and n1 are
ACCEPTED MANUSCRIPT redundant as they are both inherited by a single direct child n1 and n2 ,respectively. Whereas, n2 and n3 are non-redundant since there are elements in the corresponding direct child
T
set {n3 , n4 } and {n6 , n7 } not being inherited by both n6 and n7 ( n4 in {n3 , n4 } , n6 and n7 in {n6 , n7 } ). Thus,
RI P
n2 and n3 are the EICAs of n6 and n7 , i.e., EICA(n6 , n7 ) {n2 , n3 } . The probability that each term
occurs in the GO database and the corresponding information content are listed in Table 1.
SC
The information content shared by n6 and n7 is shareEISI (c1, c2 ) (IC(n2 ) IC(n3 )) / 2 3.3658 . Accordingly,
NU
the semantic similarity between n6 and n7 with and without using EICAs can be calculated as simRe snik (n6 , n7 ) shareRe snik (n6 , n7 ) 3.6424 2 shareRe snik (n6 , n7 ) 0.6508 IC (n6 ) IC (n7 )
simJC (n6 , n7 )
1 0.2037 1 IC (n6 ) IC (n7 ) 2 ShareRe snik (n6 , n7 )
MA
simLin (n6 , n7 )
ED
simRe snik :EISI (n6 , n7 ) shareEISI (n6 , n7 )
1 IC (ai ) 3.3658 N ai EICA( n6 , n7 )
2 shareEISI (n6 , n7 ) 0.6014 IC (n6 ) IC (n7 )
simJC:EISI (n6 , n7 )
1 0.1831 1 IC (n6 ) IC (n7 ) 2 ShareEISI (n6 , n7 )
AC CE
PT
simLin:EISI (n6 , n7 )
Table 1. The information content (IC) of the GO terms presented in Figure 2 GO term Frequency Probability n0: biological process 23254 1 n1: establishment of localization 3864 0.1662 n2: protein localization 1059 0.0455 n3: cellular protein localization 609 0.0262 n4: establishment of protein localization 889 0.0382 n5: protein transport 866 0.0372 n6: intracellular protein transport 465 0.0200 n7: protein localization to paranode region of axon 16 0.0007
IC 0 1.7948 3.0892 3.6424 3.2641 3.2903 3.9122 7.2816
As shown in Figure 2, protein localization and cellular protein localization are two EICAs of protein localization to paranode region of axon and intracellular protein transport. Intracellular protein transport integrates information inherited from protein localization through two paths, one is related with cellular protein localization, and the other is related
ACCEPTED MANUSCRIPT with protein transport. Whereas, protein localization to paranode region of axon only inherites information from cellular protein localization. This means the two terms (e.i., n6 and n7 ) are
T
associated by their EICAs from two aspects. Since the information content of protein
RI P
localization is smaller than that of cellular protein localization, the semantic similarity with
n0: biological process
NU
n1: establishment of localization
SC
EICAs is smaller than those without EICAs.
MA
n2: protein localization
ED
n3: cellular protein localization
n5: protein transport
n6: intracellular protein transport
PT
n7: protein localization to paranode region of axon
n4: establishment of protein localization
AC CE
Figure 2. A example of multiple inherence and the exclusively inherited common ancestors (EICAs) from the biological aspect of GO. The nodes in blue color ( n0 , n1 , n2 and n3 ) are the common ancestors of n6 and n7 , while those in blue color with dashed circles ( n2 and n3 ) are their EICAs.
2.3 Computational aspect of EISI The procedure for the implementation of EISI includes four steps: (1) identify the common ancestors of two terms; (2) sift the exclusively inherited common ancestors from the common ancestor set; (3) compute the information content of each term in the exclusively inherited common ancestor set; (4) compute the average of information content. Step (1) and (3) are the most time-consuming in this procedure, as they must first find out the ancestor or descendant nodes of each term with time complexity of O(n log n) , and thus the time complexity for the implementations of these two steps are O(2n log n) and O(n2 log n) , respectively. In order to reduce the run time in real-time application, we can perform a preliminary computation step to identify the ancestors and descendants, and calculate the
ACCEPTED MANUSCRIPT information content of each GO term for step (1) and (3). In step (2), each term in the common ancestor set should be checked whether the exclusive inheritance condition is O(n) .
Thus in step (1) and step (3),
T
satisfied, and the time complexity of screening of EICAs is
RI P
we only need to search for the ancestor and descendant sets for each term from the pre-computed results with time complexity of O(log n) , and the computation of EISI can be
SC
achieved in time O(n (2 n) log n) . The algorithm for the implementation of EISI is shown in
NU
Figure 3.
AC CE
PT
ED
MA
Algorithm cal_EISI(t1, t2) Input: t1, t2, AncestorSet, ChildSet , ICSet Output: EISI Begin 1. CommonAnSet ← GetCommonAnSet(t1, t2, AncestorSet) 2. EICommonAnSet ← 3. UnionAnSet ← GetAnSet(t1, AncestorSet) ∪ GetAnSet(t2, AncestorSet) 4. DiffAnSet ← UnionAnSet- CommonAnSet 5. for each a in CommonAnSet do 6. DirectChildSet ← GetDirectDescendant(a, ChildSet) 7. tmpset ← DiffAnSet ∩ DirectChildSet 8. if tmpset ≠ 9. EICommonAnSet ← EICommonAnSet ∪ { a } 10. endif 11. endfor 12. EISI ← 0 13. n ← 0 14. for each eica in EICommonAnc do 15. EISI ← EISI + get_IC(eica,ICSet) 16. n ← n + 1 17. endfor 18. return EISI /n End Figure 3. The algorithm for the implementation of EISI
2.4 The relationship between EICA and DCA According to Couto[28], the paths from different disjunctive ancestors to a child node in GO graph are independent, as there exist at least one node in a path but not in other paths, and thus different disjunctive ancestors represent distinct interpretations of a term. For a
ACCEPTED MANUSCRIPT disjunctive common ancestor of two terms, they meant that the paths from this ancestor and all its children (those are also the common ancestors of the two terms) to either term are
T
independent, it implies that a disjunctive common ancestor can provide distinct interpretations
RI P
to either term. As mentioned above, an exclusively inherited common ancestor can provide different informations to its descandant nodes. From this aspect, a disjunctive common
SC
ancestor has something to do with an exclusively inherited common ancestor, i.e., they both can provide distinct interpretations of a term. However, they are not exactly the same.
NU
According to the definition of EICA, a common ancestor is an EICA of two terms means that
MA
from this node downwards, we can find a direct child inherited by either of the terms exclusively, and the path from the direct child to that term and those from the EICA or any
ED
ancestor more informative to the same term are independent, that means this EICA is a DCA, as the EICA and each ancestor more informative are disjunctive ancestors of the term. But on
PT
the other hand, a DCA may not be an EICA, this is because there is tougher restriction for an
AC CE
EICA, i.e., an EICA should be inherited by either term through some paths exclusively. Based on the above intuition, we formalize the relationship between EICA and DCA as the following proposition.
Proposition 1: Suppose that c1 and c2 are two GO terms, DCA(c1, c2 ) and EICA(c1, c2 ) are their DCA and EICA set, respectively, if c EICA(c1 , c2 ) then c DCA(c1 , c2 ) holds. Proof.
Suppose Ancestor (c) represents the ancestor set of a term c , Dchild (c) denotes all
direct children of term c , and CA(c1 , c2 ) is the common ancestor set of c1 and c2 . According to the definition of EICA, if c EICA(c1 , c2 ) , we have (cx Dchild (c) cx CA(c1 , c2 )) and (cx Ancestor (c1 ) cx Ancestor (c2 )) .
Without loss of generality, let cx Ancestor (c1 ) and (cx Dchild (c) cx CA(c1 , c2 )) ,
c x is the ancestor of c1 and is only inherited by c1 and its ancestors that is more informative
than c x , there must be a path, denoted by cx c1 , from c x to c1 passing cx ' s children that only
ACCEPTED MANUSCRIPT inherited by c1 ,but not passing any common ancestors. On the other hand, for c and each common ancestor, ci is more informative than c , we can find a path from the node
T
to c1 passing other common ancestors, but not passing any nodes in the path cx c1 , that
RI P
means c x and any elements in the common ancestor set {ci | ci CA(c1 , c2 ) IC (ci ) IC (c)} are the
of c1 and c2 , i.e., c DCA(c1 , c2 ) .
n5
n3
n2 n4 n6
AC CE
PT
ED
n1
MA
NU
n0
SC
disjunctive common ancestors of c1 , that is to say, c is a disjunctive common ancestor
Figure 4. A toy example of Directed Acyclic Graphs to show the difference between disjunctive common ancestors (DCAs) and exclusively inherited common ancestors (EICAs). n0 , n3 and n4 are the EICAs of n5 and n6 ,while n0 , n2 , n3 and n4 are the DCAs of n5 and n6 .
Figure 4 illustrates a toy example of a Directed Acyclic Graphs, the common ancestor set of node n5 and n6 is {n0 , n2 , n3 , n4 } , by checking the inheritance property of the direct children of each element in this set, we see that n0 , n3 and n4 are the EICAs of n5 and n6 , while n2 is not an EICA as all its direct children (i.e. n3 and n4 ) are inherited by both of them. As for disjunctive common ancestors, n0 and n2 are disjunctive common ancestors of n5 , since the path n2 , n3 , n5 does not pass through n0 , and the path n0 , n1 , n5 does not pass through n2 either. In the same way, we have DCA(n5 ) {n0 , n1 , n2 , n3 , n4 } and DCA(n5 , n6 ) {n0 , n2 , n3 , n4 } . From this example, we see that an exclusively inherited common ancestor of two terms is also a disjunctive common
ACCEPTED MANUSCRIPT ancestor, whereas, the converse is not true. That is to say, the EICA set may involved with less terms than those of DCA, this will result in less amount of calculation for our approach to get
T
the shared information content than other methods based on DCA, and make it more feasible
RI P
for large-scale dataset in practice.
To show the efficiency of EISI in terms of detecting the EICAs for a pair of GO terms,
SC
we conducted three groups of experiments with different sample sizes of GO term pairs. Each group of term pairs was randomly selected from the GO graph, the first group contained 10
NU
term pairs, the second group included 100 term pairs, and the third group consisted of 1000
MA
terms pairs. The comparison experiments on each sample size were conducted 10 times to detect EICAs and disjunctive common ancestors (DCAs) for each term pair by different
ED
methods. For each time, the average time (the average time used to detect the DCAs or EICAs of a term pair), standard deviation and coefficient of variation were computed. Then the
PT
averages of these three indexes at different sample sizes were calculated and the experimental results are listed in table 2. From this table we see that our approach takes less time to detect
AC CE
the EICAs for a pair of terms on average, with less standard deviation and coefficient of variation. Taking the annotation dataset of yeast for an example, each gene is annotated by 9.7 GO terms on average, in order to compute the similarity between a pair of genes in pairwise strategy, we should detect the EICAs of 94 pairs of terms, the time needed by EISI is about 4.7 second, while those needed by GraSM and DiShIn to detect the DCAs are about 795 and 175 seconds, respectively. In this sense, EISI is more suitable for real-time application without a preliminary calculation of the EICAs. It should be noted that, once the EICAs and the DCAs were detected, the time complexity of computing the similarity values is O(n). Thus, if they (GraSM and DiShIn) perform a preliminary calculation and store the results in a database, our approach do not show superiority in time complexity. However, as the GO database often changes (the number
ACCEPTED MANUSCRIPT of terms in CESSM database released of August 2008 is 26563, which has increased to 41827 in the release of September 2014), to keep up with this change, the preliminary calculation
T
should be often carried out with increasing time. While our approach does not need such
RI P
preliminary calculation, it can compute the similarity values in real-time sense.
EISI
GraSM
DiShIn
1.6026
0.3469
1.7100
0.7539
13.4633
1.2810
0.3485
1.8666
0.7150
19.4611
1.6879
0.3541
2.4831
0.8748
Table 2. Efficiencies of different methods on the three datasets of various sample sizes sample size
average time(second)
EISI
GraSM
0.0502 10.8577
1.9225
0.0178
22.4257
100
0.0474
6.8311
1.7736
0.0165
1000
0.0518
7.6945
1.9137
0.0184
DiShIn
SC
GraSM
standard deviation
DiShIn
10
EISI
coefficient of variation
NU
The experiments were conducted in MATLAB 2008a on a machine with 2.67GHz Intel quad core processors and 4 GB of RAM.
MA
3. VALIDATION OF OUR APPROACH 3.1 Dataset
ED
The GO database and gene annotations
PT
The GO consortium provides publicly available releases of Gene Ontology database and gene annotation dataset. In this study, the GO database and the gene annotation datasets
AC CE
released in April 2013 were used to test our approach. The GO database contains 25370 BP, 3295 CC, and 10445 MF terms. The gene annotation dataset contains 91133 annotations of 6381 genes for the yeast genome. Artificial scored dataset
A dataset with artificial scored semantic similarity measure is used to validate our measurement. It contains 30 pairs of terms selected from the GO database, and 10 biological researchers not in our research group were invited to score how a term is similar to the other for each term pair according to their knowledge. The semantic similarity values ranged from 0 to 10, that is to say, the score was 0 if two terms were orthogonal, and it reached a maximum score of 10 when the two terms were identical. Then the average of the scores for each term pair was taken as the artificial scored semantic similarity. The 30 pairs of GO terms and their
ACCEPTED MANUSCRIPT artificial semantic similarity values were listed in table 3. Table 3. The GO terms and the corresponding scores in the artificial dataset GO:0015422 GO:0015279 GO:0016871 GO:0015576
5 GO:0015164 6 7 8 9 10 11 12
GO:0004620 GO:0045517 GO:0004004 GO:0003693 GO:0000034 GO:0016618 GO:0046524
oligosaccharide-transporting ATPase activity GO:0015423 store-operated calcium channel activity GO:0005245 cycloartenol synthase activity GO:0009982 glucitol transporter activity GO:0015170 glucuronoside transmembrane transporter GO:0015170 activity phospholipase activity GO:0004630 interleukin-20 receptor binding GO:0045518 ATP-dependent RNA helicase activity GO:0015611 P-element binding GO:0004525 adenine deaminase activity GO:0019239 hydroxypyruvate reductase activity GO:0004450 sucrose-phosphate synthase activity GO:0045509
13 GO:0045509 interleukin-27 receptor activity 14 15 16 17 18
GO:0001609
wishful thinking binding GO:0005127 atrazine catabolic process GO:0018965 spermidine-importing ATPase activity GO:0046923 phosphatidylinositol-3-phosphatase activity GO:0004478 phosphomevalonate kinase activity GO:0000033 lipid-linked peptidoglycan transporter 19 GO:0015648 GO:0015233 activity 20 GO:0004791 thioredoxin-disulfide reductase activity GO:0034061 21 GO:0008296 3'-5'-exodeoxyribonuclease activity GO:0008296
PT
ED
MA
GO:0005117 GO:0019381 GO:0015595 GO:0004438 GO:0004631
GO:0008830
23 GO:0047458 beta-pyrazolylalanine synthase activity
GO:0046408
AC CE
22 GO:0005245 voltage-gated calcium channel activity
24 GO:0046565 3-dehydroshikimate dehydratase activity
GO:0045075
25 26 27 28 29 30
GO:0046905 GO:0003909 GO:0017150 GO:0018251 GO:0051021 GO:0044440
GO:0008031 GO:0008142 GO:0019194 GO:0019184 GO:0000975 GO:0031300
term2
eclosion hormone activity oxysterol binding sorbose transmembrane transporter activity nonribosomal peptide biosynthetic process regulatory region DNA binding intrinsic to organelle membrane
score
maltose-transporting ATPase activity voltage-gated calcium channel activity pseudouridine synthase activity Propanediol transporter activity propanediol transmembrane transporter activity phospholipase D activity interleukin-22 receptor binding D-ribose-importing ATPase activity ribonuclease III activity deaminase activity isocitrate dehydrogenase (NADP+) activity interleukin-27 receptor activity G-protein coupled adenosine receptor activity ciliary neurotrophic factor receptor binding s-triazine compound metabolic process ER retention sequence binding methionine adenosyltransferase activity alpha-1,3-mannosyltransferase activity pantothenate transmembrane transporter activity DNA polymerase activity 3'-5'-exodeoxyribonuclease activity dTDP-4-dehydrorhamnose 3,5-epimerase activity chlorophyll synthetase activity regulation of interleukin-12 biosynthetic process phytoene synthase activity DNA ligase activity tRNA dihydrouridine synthase activity peptidyl-tyrosine dehydrogenation GDP-dissociation inhibitor binding endosomal part
T
1 2 3 4
GO ID
RI P
term1
SC
GO ID
NU
id
8.3 8.0 4.35 2.15 2.15 8.45 7.5 3.3 2.1 1.5 5.25 0.8 3.3 4.7 2.6 0.9 2.1 2.9 2.53 1.35 10 0.5 6.8 0 0.2 0.2 0.4 0.5 1.2 1.7
Pathway dataset The saccharomyces genome database (SGD) (http://pathway.yeastgenome.org/biocyc/) was used for validation in this study, it is a collection of manually curated metabolic pathways and enzymes of saccharomyces cerevisiae. The pathway dataset contains classification and annotation information of genes in each pathway. There are 187 biological pathways in the SGD database (as of September 23, 2013). Most of these pathways contain more than three genes that are manually annotated by both Enzyme Commission (EC)
ACCEPTED MANUSCRIPT numbers and molecular function GO terms. For instance, there are six genes, GPX1, GPX2, HYR1, GLR1, GTT1 and GTT2, in the glutathione-glutaredoxin redox reactions. Among
T
these genes, GPX1, GPX2 and HYR1 are annotated by the same EC number and mostly by
RI P
the same GO terms; Analogously, GTT1 and GTT2 are annotated by another EC number and mostly the same group of GO terms. Conversely, the EC number and GO terms used to
SC
annotate gene GLR1 are less similar to those annotating the other five genes. According to SGD, the six genes in this pathway are manually divided into three classes as illustrated in
NU
Figure 5. This kind of priori knowledge provides similarity information between genes at
MA
functional level, i.e., the genes with the same EC number is more similar than those with different EC numbers. Based on this priori functional information, we constructed a model
ED
tree for each pathway and taken it as ground truth to validate our approach by comparing it to the clustering tree derived from EISI. In order to demonstrate the clustering results of our
PT
measurement, the entities lacking EC number or gene name, as well as pathway with less three genes, were removed from SGD. The final dataset contains 109 pathways with at least
AC CE
three genes annotated by EC numbers and GO terms. glutathione
an oxidized glutaredoxin a reduced glutaredoxin
NADP NADPH
glutathione oxidoreductase:GLR1 1.8.1.7
GPX1 GPX2
H2O2 H2O
glutathione-peroxidase:GPX1 glutathione-peroxidase:HYR1 glutathione-peroxidase:GPX2 1.11.1.9 glutathione transferase:GTT2 glutathione transferase:GTT1 2.5.1.18
glutathione disulfide
(a) glutathione-glutaredoxin redox reactions
HYR1
RX
GTT1 GTT2
HX
R-S-glutathione
GLR1
(b) Manually clustering result of pathway (a)
Figure 5. Functions of genes in a S.cerevisiae pathway and the corresponding manually clustering result. (b) is the model tree that consistent with the clustering result based on the Resnik with EISI measure.
3.2 Validation criteria In this study, three indicators were used to assess the performance of our similarity measure, the Pearson correlation coefficient, Robinson-Foulds( RF ) distance measure[41] and
ACCEPTED MANUSCRIPT Percentage of Correct pathway ( PC ). The Pearson correlation coefficient quantified how strong the relationship between the artificial score and semantic similarity value derived from
T
EISI on the artificial scored dataset, a larger coefficient value means that the semantic
RI P
similarity value with EISI is more relevant with experts’ scores. The RF and PC measures were used to validate the performance of our method on the pathway dataset.
SC
As mentioned previously, the EC number provides similarity information between genes at functional level, i.e., genes with the same EC number tend to perform the same biological
NU
function in a pathway, thus we can cluster genes into different functional classes according to
MA
their EC numbers, and the clustering result based on this kind of prior knowledge was referred to as a model tree in this study. In order to assess our semantic similarity measurement, we
ED
clustered the genes in each pathway based on the similarity values derived from EISI, and compared the structures of our clustering trees with those of the model trees, the more
PT
consistent they were, the better performance of our approach. RF and PC were adopted to quantified to what extent the model trees and our results were different and consistent,
AC CE
respectively. The RF distance between two trees quantifies the number of bipartitions that differentiate between them, it is defined as follows. Suppose T is a tree, whose leaves are labeled by a set S of entities, e (u, v) is an edge in T
, the deletion of
e
produces a bipartition on the leaves, and divides S into two subsets, one
subset of all leaves on one side of the edge, and the subset of all the other leaves, so that we can define a bipartition set Bp(T ) { e : e Edge(T )} , where Edge(T ) is the set of all edges in T . If MTree
is a model tree and ITree is a tree inferred by a certain classification method, we
define the false negatives to be the cardinality of the set Bp(MTree) Bp( ITree) and false positives to be that of Bp( ITree) Bp(MTree) , respectively, then the false negative rate and false positive rate can be defined as follows FNR Bp(MTree) Bp( ITree) / N1
(14)
ACCEPTED MANUSCRIPT FPR Bp( ITree) Bp(MTree) / N2
(15)
where N1 and N 2 are the number of branches in MTree and ITree ,respectively, FNR and FPR
T
take values between 0 and 1. When MTree and ITree are both binary trees, we have FPR FNR ,
RF value
Bp( ITree) Bp( MTree) Bp( MTree) Bp( ITree) 2N
(16)
NU
FPR FNR 2
SC
RF (MTree, ITree)
RI P
and N1 N2 N , the RF distance between them is simply the average of these two rates
of 0 implies that two trees have identical topology, while the value of 1 means no
MA
bipartitions they have in common, that is to say, the smaller RF value, the larger similarity value between the model tree and the inferred one, and the better performance of our semantic
ED
similarity measurement.
In order to compute the PC measure, we first defined the fitness of a clustering tree as
PT
1 RF ( MTree, ITree) 0 Fit _ tree otherwise 0
(17)
AC CE
note that the index Fit _ tree indicates whether the clustering result is completely consistent with the model tree or not, the PC measure can be quantified by the percentage of pathways that are correctly clustered, and computed as PC 100
1 n Fit _ tree(k ) n k 1
(18)
where n is the number of pathways in SGD. 3.3 Results and discussion To show the effectiveness of our approach, the semantic measurements derived from EISI were evaluated against those of Resnik[36], Lin[26] and Jiang and Conrath[29], as they are also based on shared information and widely used for comparison in many investigations. Meanwhile, the method we used to measure the semantic similarity was based on the multiple inheritance attribute of the GO structure, it is similar to previous methods proposed by Couto
ACCEPTED MANUSCRIPT et al. [28, 37], we also compared our results with those produced by these approaches. In the next two subsections, we will provide comparisons between our measurements and those of
T
others on both artificial and SGD dataset, respectively. In our experiments, the information
RI P
content of a GO term was estimated by adopting a commonly used strategy applied to GO[42], where the frequency of a term occurring in a corpus is determined by the number of gene
SC
products annotated with it, as well as with all its descendants in the GO graph. 3.3.1 Results on artificial scored dataset
NU
In the present study, similarity measurements based on different kinds of shared
MA
information content, (i.e., EISI, GraSM, DiShIn and MICA) were adopted for evaluation. For each measure, the information content two terms have in common was computed and
ED
integrated in Resnik’s, Lin’s and Jiang and Conrath’s methods to derive the corresponding semantic similarity measures, respectively. Thus, for each kind of shared information, we got
PT
three kinds of semantic similarity values, which were named as Resnik, Lin and Jiang for the sake of convenience. Then, the semantic similarities of the three kinds were compared to the
AC CE
artificial scores to demonstrate the advantage of our approach. Table 4 lists the Pearson correlation coefficients between the artificial scores and different similarity measures. As shown in this table, the Resnik’s semantic similarity measure seemed to show the best performance among the three kinds of measures with EISI having the maximum coefficient value of 0.9036; conversely, Jiang’s measure performed the worst with the minimum coefficient value of 0.4355 derived by DiShIn. As for different types of shared information content, the measurements based on multiple common ancestors generally performed better than those based on MICA. In comparison to GraSM and DiShIn, EISI achieved greater correlation coefficients, and DiShIn also outperformed GraSM, which was followed by MICA in general. It is worth noting that the gap of correlation coefficient between Lin’s and Jiang’s measures based on EISI was about 0.08, which was much smaller than those
ACCEPTED MANUSCRIPT based on GraSM and DiShIn (about 0.3 and 0.4 respectively), this implies that EISI produced more stable results in comparison with other methods based on multiple common ancestors.
0.8229
0.8348
Lin
0.8336
0.8361
Jiang
0.7830
0.5484
RI P
Resnik
DiShIn
SC
GraSM
EISI
0.8798
0.9036
0.8543
0.9028
0.4355
0.8248
NU
MICA
T
Table 4. Pearson’s correlation coefficients between the artificial scores and semantic similarity values produced by EISI and other methods
3.3.2 Results on pathway dataset
MA
To demonstrate the effectiveness of EISI in real biological scenarios, we applied the similarity measure to investigate the relationship among genes in each pathway of SGD. As a
ED
gene may be annotated with multiple GO terms, the pair-wise strategy with average-maximum rule[43] was adopted to estimate the semantic similarity value between two genes in this study.
PT
After getting the semantic similarity measure for each pair of genes in a pathway, we constructed a similarity matrix M, with each element Mij in it denoting the similarity value
AC CE
between gene i and j. Then the spectral graph clustering algorithm[44] was applied to M to cluster the genes into various classes and form a clustering tree, and the RF distance and fitness of the clustering trees were computed according to formula (16) and (17), respectively. Next, the PC value and average RF distance value were calculated to evaluate the consistency of the clustering trees with model trees for all pathways, and then validate the performance of our method. Resnik’s, Lin’s, and Jiang’s similarity measures based on EISI, MICA, GraSM and DiShIn were computed to construct a similarity matrix of genes corresponding to a pathway in SGD, respectively, then the genes were clustered into a clustering tree and compared with the corresponding model tree, the RF distance value and Percentage of Correct tree (PC) were calculated for every pathway. The experimental results of our method and those of others on
ACCEPTED MANUSCRIPT the 109 pathways are shown in table 5. According to this table, the measures based on EISI got the least average RF distance values and the best scores of Percentage of Correct tree (PC)
T
in all scenarios. The RF distance values of our measures tended to be smaller than those of
RI P
others based on shared information content, and the PC values were always greater. In general, the methods based on the information derived from multiple common ancestors (i.e., EISI,
SC
GraSM and DiShIn) tended to outperformed those based on single common ancestor (i.e., MICA), with smaller average RF values and larger PC values. Among the methods based on
NU
multiple inheritance, our approach performed better than GraSM and DiShIn, this suggests
MA
that semantic similarity based on EISI are more consistent with prior knowledge of gene function in the pathway.
Table 5. The comparison results on pathway dataset produced by EISI and other methods DiShIn
EISI
Resnik
66.97
67.31
67.94
68.81
Lin
62.39
64.24
65.35
66.06
Jiang
62.39
63.31
64.13
64.22
Resnik
0.0639
0.0604
0.0612
0.0595
Lin
0.0727
0.0684
0.0693
0.0676
Jiang
0.0732
0.0803
0.0713
0.0696
ED
GraSM
AC CE
PT
PC value (%)
Average RF distance
MICA
The PC value in table 5 denotes the percentage of trees whose Robinson-Foulds (RF) distance values equal zero, the larger PC value means more trees’ topologies are correctly inferred. The average RF distance is the average value of RF distance between the inferred tree and the model tree, the smaller average RF distance means better performance of corresponding similarity measure.
Compared with Lin’s and Jiang’s measures, Resnik’s measure produced better results on the pathway dataset, with larger PC values and smaller average RF distance values. In the four scenarios of shared information content, Resnik’s measure got about 3% higher PC values and about 0.01 lower average RF distance values. Meanwhile, Lin’s measure also outperformed Jiang’s on these two indicators. It is interesting to note that, pervious investigations conducted by Lord[45] and Sevilla[46] have compared the performance of different semantic similarities,
ACCEPTED MANUSCRIPT including those proposed by Resnik, Lin, and Jiang and Conrath, they both suggested that Resnik’s measure was superior to others. This is consistent with the experimental results of the
T
present study.
RI P
Additionally, here we give a biological interpretation of the clustering result of our method on the above mentioned pathway, glutathione-glutaredoxin redox reactions, to show
SC
how the clustering tree is consistent with the corresponding model tree. The clustering tree based on Resnik measurement with EISI is consistent with the model tree of genes in this
NU
pathway, and the clustering result is illustrated in Figure 5 (b). We see that the six genes are
MA
clustered into three groups, the first group contains three genes (GPX1, GPX2 and HYR1), the second class is composed of two genes (GTT1 and GTT2), and the last one has a single gene
ED
GLR1. According to the annotation information in the SGD database, the six genes are functionally related with glutathione activity. Specifically, GPX1, GPX2 and HYR1 tend to be
activity
PT
involved with the peroxidase activity, as they are annotated with glutathione peroxidase (GO:0004602),
oxidoreductase
activity
(GO:0016491),
peroxidase
activity
AC CE
(GO:0004601) and phospholipid- hydroperoxide glutathione peroxidase activity (GO:00470 66). By contrast, GTT1 and GTT2 are inclined to participate in the transfer activity, as they are annotated with glutathione transferase activity (GO:0004364) and transferase activity (GO:0016741). As for GLR1, its function is more similar to those of the first group, since it is annotated with glutathione-disulfide reductase activity (GO:0004362) and oxidoreductase activity (GO:0016491); For the aspect of biological process, GPX1, GPX2 and HYR1 tend to be involved in the process of oxidation-reduction, as they are annotated with oxidationreduction process(GO:0055114), cellular response to oxidative stress (GO:0034599), response to oxidative stress (GO:0006979). While GTT1 and GTT2 are involved in the process of metabolic, as they are annotated with glutathione metabolic process (GO:0006749). For GLR1, it tends to participate in both metabolic and oxidation-reduction process, since it is annotated
ACCEPTED MANUSCRIPT with cell redox homeostasis (GO:0045454), cellular response to oxidative stress (GO:0034599), oxidation-reduction process (GO:0055114) and glutathione metabolic process
T
(GO:0006749); In terms of the subcellular localization, the six genes are mainly localized in
RI P
the cytoplasm or endoplasmic reticulum, they are annotated with cytoplasm (GO:0005737) or endoplasmic reticulum (GO:0005783). In comparison to the other three genes, GPX1, GPX2,
SC
HYR1 are mainly localized on the membrane outside the mitochondrion, as they are annotated with extrinsic component of mitochondrial outer membrane (GO:0005741) and peroxisomal
NU
matrix (GO:0005782). While GTT1, GTT2 are localized inside the mitochondrion, as they are
MA
annotated with mitochondrion (GO:0005739). GLR1 acts on mitochondrion and nucleus, as it is annotated with nucleus (GO:0005634) and mitochondrion (GO:0005739). Thus, from the
ED
view point of biological meaning, GPX1, GPX2 and HYR1 are more similar to each other than to the other three genes, GTT1 and GTT2 are also more similar to each other, while
biologically.
PT
GLR1 tends to have the biological properties of the both groups and forms another class
AC CE
It is worth noting that recent studies proposed DiShIn as a solution to the next generation of similarity measures that can fully explore the semantics in ontologies[40]. Like DiShIn, EISI intends to explore more information related with the semantic similarity between two concepts in ontologies, the basic idea of DiShIn is that two disjunctive ancestors represent two distinct interpretations of a concept, while EISI addresses this issue from another aspect, it originates from the idea that a common ancestor exclusively inherited by two child concepts means that it provides distinct information to its descendant concepts, and some are exclusively inherited by either child. So we believe that EISI also has the similar ability as DiShIn to fully explore the semantics in ontologies.
4. CONCLUSIONS This paper presented a novel semantic similarity measurement based on exclusively
ACCEPTED MANUSCRIPT inherited shared information (EISI) to address the problem of multiple inheritance in calculating the similarity between two terms in GO. The EISI quantified shared information
T
content derived from the common ancestors that are exclusively inherited by either term, it
RI P
was established on the observation that only the exclusively inherited common ancestors provide distinguishing information to its descendants. Our approach is effective and efficient,
SC
the semantic similarity measure between GO terms had good correlation with experts’ scores on artificial dataset, and those between genes were consistent with the prior knowledge of
NU
functional relationship among genes on the pathway of SGD database. Moreover, it is
MA
consistent with previous investigations, which concluded that Resnik’s measure outperformed Lin’s and Jiang’s measures. Our method has O(n (2 n) log n) time complexity, which makes
ED
it feasible for real time application in large-scale investigation, it is a promising alternative to
Acknowledgements
PT
other methods based on the multiple inheritance of GO.
AC CE
The authors thank the biologists for their enthusiastic help and support in constructing the artificial dataset. This work was supported in part by the National Natural Sciences Foundation of China grants 60675016 and 60633030, and fully by the Natural Sciences Foundation of Guangzhou Maritime Institute grants K31012B09.
Conflict of interest statement
The authors declare that they have no conflict of interest.
ACCEPTED MANUSCRIPT
Reference
T
[1] Z. Teng, M. Guo, X. Liu, Q. Dai, C. Wang, P. Xuan, Measuring gene functional similarity based on group-wise comparison of GO terms. Bioinformatics 29 (2013) 1424-1432.
RI P
[2] Taha K. GOtoGene: a method for determining the functional similarity among gene products. In: Proceedings of the Tenth Australasian Data Mining Conference-Volume; 2012. 43-51. [3] N. Nariai, E. D. Kolaczyk, S. Kasif, Probabilistic protein function prediction from heterogeneous genome-wide data. PLoS One 2 (2007) e337.
SC
[4] Y. Tao, L. Sam, J. Li, C. Friedman, Y. A. Lussier, Information theory applied to the sparse gene ontology annotation network to predict novel gene function. Bioinformatics 23 (2007) i529-i538.
NU
[5] A. Alexa, J. Rahnenführer, T. Lengauer, Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics 22 (2006) 1600-1607. [6] P. Khatri, S. Drăghici, Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics 21 (2005) 3587-3595.
MA
[7] D.W. Huang, B.T. Sherman, Q. Tan, The DAVID Gene Functional Classification Tool: a novel biological module-centric algorithm to functionally analyze large gene lists. Genome biology 8 (2007) R183.
ED
[8] D. Yang, Y. Li, H. Xiao, L. Qing, Q. Zhang, M. Zhu, J. Ma, W. Yao, J. Wang, D. Wang, Z. Guo, B. Yang, Gaining confidence in biological interpretation of the microarray data: the functional consistence of the significant GO categories. Bioinformatics 24 (2008) 265-271. [9] S. Mathur, D. Dinakarpandian, Finding disease similarity based on implicit semantic similarity. Journal of biomedical informatics 45 (2012) 363-371.
PT
[10] A. Schlicker, T. Lengauer, M. Albrecht, Improving disease gene prioritization using the semantic similarity of Gene Ontology terms. Bioinformatics 26 (2010) i561-i567.
AC CE
[11] A. Schlicker , C. Huthmacher, F. Ramírez, T. Lengauer, M. Albrecht, Functional evaluation of domain–domain interactions and human protein interaction networks. Bioinformatics 23 (2007) 859-865. [12] H. Wang, H. Zheng, F. Browne, Integration of Gene Ontology-based similarities for supporting analysis of protein–protein interaction networks. Pattern Recognition Letters 31 (2010) 2073-2082. [13] E. Camon, M. Magrane, D. Barrell, V. Lee, E. Dimmer, J. Maslen, D. Binns, N. Harte, R. Lopez, R. Apweiler, The Gene Ontology annotation (GOA) database: sharing knowledge in Uniprot with Gene Ontology. Nucleic acids research 32 (2004) D262-D266. [14] N. R.C.G. Massjouni, T. M. Murali, VIRGO: computational prediction of gene functions. Nucleic acids research 34 (2006) W340-W344. [15] E. Zeng, C. Ding, G. Narasimhan, S. R. Holbrook, Estimating support for protein-protein interaction data with applications to function prediction. In: Proceedings of 2008 Computer Systems Bioinformatics (CSB) Conference; 2008. 73-84. [16] F. M. Couto, M. J. Silva, P. M. Coutinho, Implementation of a functional semantic similarity measure between gene-products. In: Tech Rep DI/FCUL TR 03-29: Department of Informatics, University of Lisbon; 2003. [17] A. Schlicker, M. Albrecht, FunSimMat: a comprehensive functional similarity database. Nucleic acids research 36 (2008) D434-D439. [18] Z. Du, L. Li, C. F. Chen, S. Y. Philip, J. Z. Wang, G-SESAME: web tools for GO-term-based gene similarity analysis and knowledge discovery. Nucleic acids research 37 (2009) W345-W349. [19] Y. Xu, M. Guo, W. Shi, X. Liu, C. Wang, A novel insight into Gene Ontology semantic similarity. Genomics 101 (2013) 368-375. [20] G. Yu, F. Li, Y. Qin, X. Bo, Y. Wu, S. Wang, GOSemSim: an R package for measuring semantic similarity among GO terms and gene products. Bioinformatics 26 (2010) 976-978.
ACCEPTED MANUSCRIPT [21] R. Rada, H. Mili, E. Bicknell, M. Blettner, Development and application of a metric on semantic nets. IEEE Transactions on Systems, Man and Cybernetics 19 (1989) 17-30. [22] A. Nagar, H. Al-Mubaid, A new path length measure based on go for gene similarity with evaluation using sgd pathways. In: Proceedings of the 21st International Symposium on Computer-Based Medical Systems, 2008. CBMS'08. (2008) 590-595.
RI P
T
[23] V. Pekar, S. Staab, Taxonomy learning: factoring the structure of a taxonomy into a semantic classification decision. In: Proceedings of the 19th international conference on Computational linguistics. (2002) 1-7. [24] S. Jain, G. Bader, An improved method for scoring protein-protein interactions using semantic similarity within the gene ontology. BMC bioinformatics 11 (2010) 562.
SC
[25] P. Resnik, Using information content to evaluate semantic similarity in a taxonomy. In: Proceeding of the 14th International Joint Conference on Artificial Intelligence. (1995) 448-453.
NU
[26] D. Lin, An information-theoretic definition of similarity. In: In Proceedings of the 15th international conference on Machine Learning. (1998) 296-304.
MA
[27] F. M. Couto, M. J. Silva, P. M. Coutinho, Semantic similarity over the gene ontology: family correlation and selecting disjunctive ancestors. In: Proceedings of the 14th ACM international conference on Information and knowledge management. (2005) 343-344. [28] F. M. Couto, M. J. Silva, Disjunctive shared information between ontology concepts: application to Gene Ontology. Journal of Biomedical Semantics 2 (2011) 1-5.
ED
[29] J. J. Jiang, D. W. Conrath, Semantic similarity based on corpus statistics and lexical taxonomy. In: Proceedings of International Conference Research on Computational Linguistics. (1997) 19-33. [30] J. Z. Wang, Z. Du, R. Payattakool, P. S. Yu,C. F. Chen, A new method to measure the semantic similarity of GO terms. Bioinformatics 23 (2007) 1274-1281.
PT
[31] S. J. Bien, C. H. Park, H. J. Shim, W. Yang, J. Kim, J. H. Kim, Bi-directional semantic similarity for gene ontology to optimize biological and clinical analyses. Journal of the American Medical Informatics Association 19 (2012) 765-774.
AC CE
[32] R. M. Othman, S. Deris, R.M. Illias, A genetic similarity algorithm for searching the Gene Ontology terms and annotating anonymous protein sequences. Journal of biomedical informatics 41 (2008) 65-81. [33] X. Wu, E. Pang, K. Lin, Z. M. Pei, Improving the Measurement of Semantic Similarity between Gene Ontology Terms and Gene Products: Insights from an Edge-and IC-Based Hybrid Method. PloS one 8 (2013) e66745. [34] S. Benabderrahmane, M. Smail-Tabbone, O. Poch, A. Napoli, M. D. Devignes, IntelliGO: a new vector-based semantic similarity measure including annotation origin. BMC bioinformatics 11 (2010) 588. [35] N. Seco, T. Veale, J. Hayes, An intrinsic information content metric for semantic similarity in WordNet. In: Proceedings of the 16th European Conference on Artificial Intelligence. (2004) 1089–1090. [36] P. Resnik, Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research 11 (1999) 95-130. [37] F. M. Couto, M. J. Silva, P. M. Coutinho, Measuring semantic similarity between Gene Ontology terms. Data & knowledge engineering 61 (2007) 137-152. [38] A. Schlicker, F. S. Domingues, J. Rahnenführer, T. Lengauer, A new measure for functional similarity of gene products based on Gene Ontology. BMC bioinformatics 7 (2006) 302. [39] H. Yu, R. Jansen, G. Stolovitzky, M. Gerstein, Total ancestry measure: quantifying the similarity in tree-like classification, with genomic applications. Bioinformatics 23 (2007) 2163-2173. [40] F. M. Couto, H. S. Pinto, The next generation of similarity measures that fully explore the semantics in biomedical ontologies. Journal of bioinformatics and computational biology 11 (2013) 1-12. [41] D. F. Robinson, L. R. Foulds, Comparison of phylogenetic trees. Mathematical Biosciences 53 (1981) 131-147.
ACCEPTED MANUSCRIPT [42] C. Pesquita, D. Faria, A. O. Falcao, P. Lord, F. M. Couto, Semantic similarity in biomedical ontologies. PLoS computational biology 5 (2009) e1000443. [43] F. Azuaje, H. Wang, O. Bodenreider, Ontology-driven similarity approaches to supporting gene functional assessment. In: Proceedings of the ISMB'2005 SIG meeting on Bio-ontologies. (2005) 9-10.
T
[44] S.B. Zhang, S.Y. Zhou, J. G. He, J. H. Lai, Phylogeny inference based on spectral graph clustering. Journal of Computational Biology 18 (2011) 627-637.
RI P
[45] P. W. Lord, R. D. Stevens, A. Brass, C. A. Goble, Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics 19 (2003) 1275-1283.
AC CE
PT
ED
MA
NU
SC
[46] J. L. Sevilla, V. Segura, A. Podhorski, E. Guruceaga, J. M. Mato, L. A. Martinez-Cruz, F. J. Corrales, A. Rubio, Correlation between gene expression and GO semantic similarity. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2 (2005) 330-338.
Graphical abstract
List of abbreviations BP MF CC DGA GO GraSM DiShIn MICA DCA DCAs SGD
Biological Process Molecular Function Cellular Component Directed Acyclic Graph Gene Ontology Graph-based Similarity Measure Dubbed Disjunctive Shared Information Most Informative Common Ancestor Disjunctive Common Ancestor Disjunctive Common Ancestors Saccharomyces Genome Database
ACCEPTED MANUSCRIPT
PT
ED
MA
NU
SC
RI P
T
Exclusively Inherited Shared Information Exclusively Inherited Common Ancestor Exclusively Inherited Common Ancestors Robinson-Foulds Percentage of Correct pathway
AC CE
EISI EICA EICAs RF PC
ACCEPTED MANUSCRIPT Highlights
AC CE
PT
ED
MA
NU
SC
RI P
T
A semantic similarity measurement between two GO terms is proposed. The similarity value is quantified by the information shared by two terms. The measurement takes into account multiple common ancestors that are exoteric inherited by either term exclusively. The algorithm is effective with time complexity of O(n). The results on real dataset support the prior knowledge of biological pathway.