Semantic similarity measurement between gene ontology terms based on exclusively inherited shared information.

Semantic Similarity Measurement between Gene Ontology Terms Based on Exclusively Inherited Shared Information Shu-Bo Zhang, Jian-Huang Lai PII: DOI: Reference:

S0378-1119(14)01488-7 doi: 10.1016/j.gene.2014.12.062 GENE 40176

To appear in:

Gene

Received date: Revised date: Accepted date:

16 July 2014 15 December 2014 24 December 2014

Please cite this article as: Zhang, Shu-Bo, Lai, Jian-Huang, Semantic Similarity Measurement between Gene Ontology Terms Based on Exclusively Inherited Shared Information, Gene (2014), doi: 10.1016/j.gene.2014.12.062

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

T

Semantic Similarity Measurement between Gene Ontology Terms Based on Exclusively Inherited Shared Information

Shu-Bo Zhang*

SC

RI P

Affiliation: Department of Computer Science, Guangzhou Maritime Institute, Guangzhou, P.R. China. Postal address: Room 803 Building 88, Dashabei Road, Guangzhou, 510275, P.R. China. Telephone: +86-20-82384406 Fax: +86-20-82384406 E-mail: [email protected]

Jian-Huang Lai

MA

NU

Affiliation: School of Information Science and Technology, Sun Yat-sen University, Guangzhou, P.R. China. Postal address: Room 105 Building 110 East District, 135 Xingangxi Road, Guangzhou, 510275, China. Telephone: +86-20-84110175 Fax: +86-20-84110175 E-mail: [email protected]

AC CE

PT

ED

* Corresponding author Competing Interests: The authors have declared that no competing interests exist.

ACCEPTED MANUSCRIPT

ABSTRACT Quantifying the semantic similarities between pairs of terms in the Gene Ontology

RI P

T

(GO) structure can help to explore the functional relationships between biological entities. A common approach to this problem is to measure the information they have in common based

SC

on the information content of their common ancestors. However, many studies have their limitations in measuring the information two GO terms share. This study presented a new

NU

measurement, exclusively inherited shared information (EISI) that captured the information shared by two terms based on an intuitive observation on the multiple inheritance relationships

MA

among the terms in the GO graph. EISI was derived from the information content of the exclusively inherited common ancestors (EICAs), which were screened from the common

ED

ancestors according to the attribute of their direct children. The effectiveness of EISI was

PT

evaluated against some state-of-the-art measurements on both artificial and real datasets, it produced more relevant result with experts’ scores on the artificial dataset, and supported the

AC CE

prior knowledge of gene function in pathways on the Saccharomyces genome database (SGD). The promising features of EISI are the following: (1) it provides a more effective way to characterize the semantic relationship between two GO terms by taking account multiple common ancestors related, and (2) can quickly detect all EICAs with time complexity of O(n) , which is much more efficient than other methods based on disjunctive common ancestors. It is a promising alternative to multiple inheritance based methods for practical applications on large-scale dataset. The algorithm EISI was implemented in Matlab and is freely available from http://treaton.evai.pl/EISI/.

Key words: Semantic similarity measurement; Gene Ontology; Information content; Common ancestors; Exclusively inherited common ancestors;

ACCEPTED MANUSCRIPT

1. INTRODUCTION

T

Comparison of biological entities is important in biological research as it can help to

RI P

explore the relationship of regulation or function between gene products or genes (collectively called genes hereafter for simplicity), and contribute to the inference of biological roles and

SC

functions of genes. Traditional approach to address this issue is based on comparative

NU

experiment, which is costly and time consuming, other methods include comparing the sequences or structures between genes[1] by means of bioinformatics approaches. The advent

MA

of high-throughput technologies has produced a wealth of heterogeneous biological data related to functional annotation of gene. This provides us with a promising way to compare

ED

genes on functional level on aspects that could othewise not be comparable. However, comparing genes based on such huge amount of diverse biomedical datasets is a challenging

PT

task, as they are usually constructed in an unconsolidated way. This led to the introduction of

AC CE

various biological ontologies. Gene Ontology (GO) project is one of those that provide consolidated description of gene function for data from different resources. It can be used to explore the functional relationship between two biological entities[2], and has a variety of applications in the fields such as gene function prediction[3, 4], gene expression data analysis [5, 6], gene clustering[7, 8], disease gene prioritization[9, 10], analysis of protein interactions[11, 12], and so on. The Gene Ontology comprises two parts, the GO graph and the GO annotation [13]. The former is composed of controlled terms and organized in three orthogonal aspects: biological process (BP), molecular function (MF), and cellular component (CC), and is structured as a Directed Acyclic Graph (DAG)[14]. In the GO graph, the nodes are controlled terms with specific biological meaning, such as biological process, molecular function, or cellular localization, the edges link nodes and characterize the relationships among terms. The

ACCEPTED MANUSCRIPT most common relationships are ‘is–a’ and ‘part–of’. The GO annotation builds a bridge between GO terms and genes, it provides annotation information for genes with controlled

T

terms in the GO graph. When a gene is annotated with a GO term, it is also annotated with all

RI P

the ancestors of that term in the GO graph[15]. Moreover, this gene is relevant to other genes that are annotated with the same term, as well as the ancestors and descendants of that term[2].

SC

This suggests that we can compare two genes on functional level by measuring the semantic similarity of their GO terms.

NU

In recent years, the research on semantic similarity measurement between two GO

MA

terms has drawn more and more attentions from the community of bioinformatics. A variety of metrics have been introduced, and some software tools have been proposed for calculating

ED

semantic similarities of GO terms, including Fussimeg [16]，FunSimMat[17], G-SESAME[18], GFSAT[19]，GOSemSim[20] and SORA[1]. Traditional approaches to the semantic similarity

PT

measurement between GO terms are generally classified into three categories: edge-based

AC CE

methods[21-24] (also called structure based approaches), which define the semantic similarity based on the conceptual distance derived from the information related with the length or type of edges in the GO graph; node-based methods[25-28] (also called annotation-based, information content-based methods), where the nodes and their properties are adopted to compute the information content for similarity; and hybrid methods[29-33], that combine the information content with GO graph structure of GO terms for semantic similarity. Node-based semantic similarity measures are possibly the most frequently mentioned metrics in the literatures[34]. This category of approaches are established on the basis of information theory, and the underlying principle behind is that the more information two concepts have in common, the more similar they are. The information of a concept is quantified by its information content (IC), which rests upon the possibility that it occurs in the GO graph[32, 35] or in a corpus[25]. The information content is an indicator that measures

ACCEPTED MANUSCRIPT how informative and specific a concept is, and is defined as the negative logarithm of the probability that concept appears. Resnik[36] proposed a measure based on the information

T

content of the most informative ancestor, which is identified by calculating the IC values of all

RI P

common ancestors two terms shared and selecting the one with the maximal values. Since the similarity value of Resnik’s measurement may be larger than one, Lin[26] and Jiang and

SC

Conrath[29] proposed their improved schemes to normalize the similarity value to (0,1). Nevertheless, these two kinds of measurements defined similarity based on Resnik’s

NU

measurement that only consider the information content of a single common ancestor, namely,

MA

the Most Informative Common Ancestor (MICA) that inherited by both terms. This is proper in the case that the GO graph is a tree, but it will become problematic in the DAG structure of

ED

GO, as a node may have more than one parent nodes and thus some biological information inherited from some ancestors will be neglected.

PT

To address the problem caused by multiple inheritance, Couto et al. employed the concept of disjunctive common ancestors and defined a graph-based similarity measure

AC CE

(GraSM)[37], where the information two terms share was derived from all their disjunctive common ancestors by taking the average of their information content. They later updated GraSM and proposed a new method, dubbed Disjunctive Shared Information (DiShIn) [28], to address the computational complexity problem caused by its recursive definition for disjunctive common ancestors and the problem caused by parallel interpretations shared by two terms. Both GraSM and DiShIn can be directly integrated into any semantic similarity measure based on the MICA[28]. However, the dynamic implementation of GraSM and DiShIn is rather time consuming, as they need to search for the paths between pairs of nodes in the GO graph. To circumvent this problem, they performed a preliminary calculation and stored the results in a database for later computation. The focus of this paper is to follow the information theoretic vein and propose a novel

ACCEPTED MANUSCRIPT approach that measures the semantic similarity between two GO terms. We introduced a new measurement based on shared information, exclusively inherited shared information (EISI),

T

which quantifies the information two terms have in common base on some informative

RI P

common ancestors. The EISI is proposed based on the observation that, only those common ancestors that are inherited exclusively contribute to the shared information of two terms. The

SC

common ancestor set was first constructed, each element in which denoted a node that is inherited by both terms. Then all common ancestors were checked, and those had direct

NU

descendants inherited by either term exclusively were considered to be exclusively inherited

MA

common ancestors (EICAs). Finally, the information content shared by two terms was calculated by taking the average of the information content of all EICAs. Experiments were

ED

conducted on artificial dataset and the Saccharomyces genome database (SGD), and the results shown that the similarity measurement based on EISI correlated better with experts’

PT

scores on artificial dataset, and supported the prior knowledge of classification information in pathways on SGD. Our measurement has the following advanced properties: (1) it provides a

AC CE

more effective way to characterize the relationship between two GO terms by considering multiple common ancestors, and (2) it can quickly detect all EICAs with time complexity of O(n) , which is much more efficient than other methods based on disjunctive common ancestors. It is a promising alternative to multiple inheritance based methods for practical application.

2. METHODS To take into account multiple common ancestors in an effective way, this paper proposes a new measurement to quantify the information shared by two GO terms, based on the exclusively inherited common ancestors (EICAs) they have in common. Like GraSM and DiShIn, EISI takes into account multiple common ancestors two GO terms share, and defines

ACCEPTED MANUSCRIPT their common information as the average of the inforamtion content of their common ancestors. However, EISI consideres a common ancestor to be informative only if it has direct

T

descendant that is inherited by either of terms exclusively, this means that not all common

RI P

ancestors are considered in EISI algorithm, which may reduce the computational complexity for calculating the shared information content.

SC

2.1 Related work

NU

Over the years, great efforts have been devoted to measuring the semantic similarity between GO terms based on information content [25, 26, 28, 29, 36-39]. Among these, the

MA

methods proposed by Resnik[36],Lin[26] and Jiang and Conrath[29] received much attentions. According to Resnik, the information two terms share is derived from their most informative

ED

common ancestors, which can be defined as

(1)

PT

CA(t1 , t2 )  {t : t  parent (t1 )  t  parent (t2 )}

where parent (t1 ) and parent (t2 ) are the parent node sets of t1 and t2 , respectively.

AC CE

Let t1 and t2 are two terms, the information they share can be calculated as the information content of their most informative common ancestor, shareRe snik (t1 , t2 )  {IC (t ) | t  CA(t1 , t2 )  (ti  CA(t1 , t2 ), IC (ti )  IC (t ))}

(2)

IC (t ) and IC (ti ) in formula (2) denote the information content of term t and ti , respectively. The

information content of a GO term t is difined as IC (t )   log p(t )

(3)

where p(t ) is the probability that t occurs in a certain GO annotation corpus(e.g., GOA database). Thus, the shared information content is further used to measure the semantic similarity between t1 and t2 , simRe snik (t1 , t2 )  shareRe snik (t1 , t2 )  max{IC(t ) | t  CA(t1 , t2 )}

(4)

ACCEPTED MANUSCRIPT Since the maximum of Resnik’s measurement can be greater than one, Lin[26] proposed a normalized version, which defined the similarity as the ratio of information content between

2  shareRe snik (t1 , t2 ) IC (t1 )  IC (t2 )

RI P

simLin (t1 , t2 ) 

T

the most informative common ancestor and both terms,

(5)

SC

Another information theoretic semantic similarity metric based on the information content was introduced by Jiang and Conrath[29]. They proposed a combined approach that takes into

NU

account both the information content and the conceptual distance. The primary model is rather complex as it considers several other factors, such as node depth, local density and link type,

MA

and is often simplified as follows by considering only the information content, DJiang (t1 , t2 )  2  shareRe snik (t1 , t2 )  IC(t1 )  IC(t2 )

(6)

ED

Note that this measure quantifies semantic distance rather than semantic similarity

formula:

PT

metrics, we can convert this distance value into a similarity metrics according to the following

AC CE

simJiang (t1 , t2 ) 

1 1  DJiang (t1 , t2 )

(7)

To address the problem caused by multiple inheritance, Couto et al.[37] employed the concept of disjunctive common ancestors (DCAs) and proposed a graph-based similarity measure (GraSM), where they redefined the shared information content as the average of information content of all their disjunctive common ancestors. According to Couto, A common ancestor of two terms is regarded as a disjunctive common ancestor, if there exists a path from one of the terms to that ancestor, which is different from any other paths from the other term to the same common ancestors. By this concept, Couto defined GraSM as follows: simGraSM (t1 , t2 )  { log p(t ) | t  DCA(t1 , t2 )}

(8)

where DCA(t1 , t2 ) denotes the disjunctive common ancestors of the two terms: DCA(t1 , t2 ) 

ACCEPTED MANUSCRIPT {a1 | a1  CA(t1 , t2 )  a2 : (a2  CA(t1 , t2 )  IC (a1 )  IC (a2 )  a1  a2 )  ((a1 , a2 )  DA(t1 )  DA(t2 ))} , CA(t1 , t2 )

represents the common ancestors of t1 and t2 , and DA(t ) is the disjunctive ancestors of a

T

term: DA(t )  {(a1 , a2 ) | (p : p  paths(a1 , t )  a2  p)  (p : p  paths(a2 , t )  a1  p)} , where paths(a, t ) gives

RI P

the set of distinct paths from t to a in the DCA(t1 , t2 ) .

For GraSM, the disjunctive common ancestors of a term are defined in a recursive way,

SC

and the computational complexity of detecting the DCAs of two GO terms is non-linear. This

NU

will cause limitations for large-scale study in real-time scenarios. Moreover, the shared information will decrease even when the disjunctive common ancestors are inherited in

MA

parallel. To overcome the problem caused by parallel inheritence, Couto and colleagues proposed an updated measurement DiShIn, to calculate the shared information content

ED

between two terms by counting the number of distinct paths from common ancestors to the terms[28]. DiShIn increases the similarity value of terms that inherit information from a

PT

disjunctive common ancestor in parallel, as it shows stronger relation than those being

AC CE

inherited by either term exclusively. Both GraSM and DiShIn opened up a new door to the semantic similarity measurement between GO terms by taking into account multiple common ancestors. Recently, DiShIn was even proposed as a solution to the next generation of similarity measures that can fully explore the semantics in ontologies[40].

2.2 Exclusively inherited shared information (EISI) In the DAG structure of GO graph, the nodes denote some terms that represent the controlled biological vocabularies related to the molecular function, biological process or cellular component information of gene products, and the edges in the graph link different terms to each other by certain relationships, such as “part-of”, “is-a”, “regularized”, and so on. All terms in the graph are organized in a hierarchical way, and the one closer to the root of the graph has more general biological meaning, whereas a term far from the root has more

ACCEPTED MANUSCRIPT specific biological meaning. In the hierarchical structure of GO graph, there exist inheritance relationships among terms in a path, that is to say, a term in lower position inherits some

T

meanings from its ancestors, which are more general biologically. Thus, the information

RI P

becomes more and more specific from ancestral terms to descendant terms, and the semantics of GO terms also become more and more specific.

SC

Due to the semantics inheritance property between terms in the GO graph, information transmits from parents to children step by step. In a given path from ancestor to descendant,

NU

the information of a node’s ancestors is redundant for its children, as that node concentrates

MA

all informations related to its child nodes from its ancestors. Based on this observation, one may naturally think that all the ancestor nodes of a given common ancestor are redundant for

ED

two of its child terms investigated. This is right in the case of single inheritance, however it may be problematic in the case of multiple inheritance, as there may exist some paths from a

PT

common ancestor to either of the children, and those are not shared by the other child. That is to say, a common ancestor may provide some information to one of its descandants

AC CE

exclusively, and this specific information will vanish when we delete that common ancestor, even though other shared information remains unchanged. Thus, previous similarity measures that considered all the common ancestors or only the most informative ancestor may have their limitations. In order to address this problem, we define a common ancestor C as the exclusively inherited common ancestor (EICA) of its descendants A and B, if C is directly inherited by some nodes that are inherited by A or B, exclusively. According to this definition, an EICA provides some information specific to either A or B through an exclusive path connecting the EICA and either of the two descendant terms. Thus, the exclusively inherited common ancestors (EICAs) contain information that contributes to the relationship between two terms, and they can be used to derive the similarity measurement of GO terms. Figure 1 gives three toy examples to illustrate the inheritance relationship and

ACCEPTED MANUSCRIPT exclusively inherited common ancestors of two terms in GO graph. Figure 1(a) shows a small example of single inherited. In this scenario, n3 is an exclusively inherited common ancestor

T

of n4 and n5 , as it has two direct child nodes that are not included in the common ancestor

RI P

set {n1 , n2 , n3 } . While n1 is inherited by n2 , and n2 itself is inherited by n3 , that means n3 is the most informative ancestral node with the most specific semantics to describe the relationship

SC

between n4 and n5 . This example makes no difference from the case that the ontology is a tree;

{n1 , n2 , n3 , n4 } is

NU

Figure 1(b) is a multiple inheritance example where n4 is the only EICA. In this context, the common ancestor set of n5 and n6 , information of n1 is inherited by n2 , and that

MA

of n2 is shared by n3 and n4 , then combines into n4 , which is inherited by n5 and n6 . All informations from the common ancestors concentrates on n4 and shared by n5 and n6 , it means that the

ED

information two terms share can be characterized by the most informative node if all direct

PT

children of a common ancestor are included in the common ancestor set; Figure 1(c) gives another example of multiple inheritance with two exclusively inherited common ancestors. In

AC CE

this graph, n2 inherits semantics from n1 and shared by n3 and n4 , n3 is an EICA as it is shared by n5 and n6 through distinct paths. n4 is a direct child of n2 and is not included in the common ancestor set {n1 , n2 , n3 } , so n2 is also an EICA of n5 and n6 .

n1

n1

n2

n2

n2

n3

n4

n1

n4

n5 (a)

n5

n3 n6

(b)

n3

n4

n5

n6 (c)

Figure 1. Toy examples of inheritance relationship and the exclusively inherited common ancestors (EICAs) in GO. In (a), n1 , n2 and n3 are common ancestors of n4 and n5 , n3 is their EICA; In (b), n1 , n2 , n3 and n4 are common ancestors of n5 and n6 , n4 is their EICA; In (c), n1 , n2 , n3 and n4 are common ancestors of n5 and n6 , n2 and n3 are their EICAs.

ACCEPTED MANUSCRIPT Given two terms t1 and t 2 in the GO graph, we can find their common ancestor set CA(t1 , t 2 ) , and defined their exclusively inherited common ancestor set as

RI P

(t x  A(t1 )  t x  A(t2 )))}

T

EICA(t1 , t2 )  {t : t  CA(t1 , t2 )  t x : (t x  Dchild (t )  t x  CA(t1 , t2 )

(9)

where Dchild (t ) is the set of direct descendants of node t, A(t1 ) and A(t2 ) are two sets containing

SC

the ancestral nodes of t1 and t 2 , as well as the node itself, respectively. Then the information

NU

content shared by t1 and t 2 is called exclusively inherited shared information (EISI), which can be quantified by simply taking the average of the information content of all the EICAs 1 IC (ti )  N ti EICA(t1 ,t2 )

MA

shareEISI (t1 , t2 ) 

(10)

where N denotes the number of elements in EICA(t1 , t 2 ) .

ED

EISI is a metrics that quantify the information shared by two GO terms, it can be used to

PT

measure the similarity of two terms as the shared information content in Resnik’s measurement. Moreover, it can also be imbedded into other information content-based

AC CE

semantic similarity measures, such as Lin’s and Jiang and Conrath’s measurements. After replacing the shared information content in Resnik’s, Lin’s and Jiang and Conrath’s similarity measurements with EISI, respectively, we get three variants of similarity measures: simRe snik _ EISI (t1 , t2 )  ShareEISI (t1 , t2 )

(11)

simLin _ EISI (t1 , t 2 ) 

2  ShareEISI (t1 , t 2 ) IC (t1 )  IC (t 2 )

(12)

simJC _ EISI (t1 , t 2 ) 

1 1  IC (t1 )  IC (t 2 )  2  ShareEISI (t1 , t 2 )

(13)

To explain how to identify the EICAs of two GO terms and compute the EISI based on EICAs, we take a fragment of GO for example. Figure 2 is a sub-graph consisting of eight GO terms and the corresponding relationships among these terms. CA(n6 , n7 )  {n0 , n1 , n2 , n3 } is a common ancestor set shared by n6 and n7 , for the elements in the collection, n0 and n1 are

ACCEPTED MANUSCRIPT redundant as they are both inherited by a single direct child n1 and n2 ,respectively. Whereas, n2 and n3 are non-redundant since there are elements in the corresponding direct child

T

set {n3 , n4 } and {n6 , n7 } not being inherited by both n6 and n7 ( n4 in {n3 , n4 } , n6 and n7 in {n6 , n7 } ). Thus,

RI P

n2 and n3 are the EICAs of n6 and n7 , i.e., EICA(n6 , n7 )  {n2 , n3 } . The probability that each term

occurs in the GO database and the corresponding information content are listed in Table 1.

SC

The information content shared by n6 and n7 is shareEISI (c1, c2 )  (IC(n2 )  IC(n3 )) / 2  3.3658 . Accordingly,

NU

the semantic similarity between n6 and n7 with and without using EICAs can be calculated as simRe snik (n6 , n7 )  shareRe snik (n6 , n7 )  3.6424 2  shareRe snik (n6 , n7 )  0.6508 IC (n6 )  IC (n7 )

simJC (n6 , n7 ) 

1  0.2037 1  IC (n6 )  IC (n7 )  2  ShareRe snik (n6 , n7 )

MA

simLin (n6 , n7 ) 

ED

simRe snik :EISI (n6 , n7 )  shareEISI (n6 , n7 ) 

1 IC (ai )  3.3658  N ai EICA( n6 , n7 )

2  shareEISI (n6 , n7 )  0.6014 IC (n6 )  IC (n7 )

simJC:EISI (n6 , n7 ) 

1  0.1831 1  IC (n6 )  IC (n7 )  2  ShareEISI (n6 , n7 )

AC CE

PT

simLin:EISI (n6 , n7 ) 

Table 1. The information content (IC) of the GO terms presented in Figure 2 GO term Frequency Probability n0: biological process 23254 1 n1: establishment of localization 3864 0.1662 n2: protein localization 1059 0.0455 n3: cellular protein localization 609 0.0262 n4: establishment of protein localization 889 0.0382 n5: protein transport 866 0.0372 n6: intracellular protein transport 465 0.0200 n7: protein localization to paranode region of axon 16 0.0007

IC 0 1.7948 3.0892 3.6424 3.2641 3.2903 3.9122 7.2816

As shown in Figure 2, protein localization and cellular protein localization are two EICAs of protein localization to paranode region of axon and intracellular protein transport. Intracellular protein transport integrates information inherited from protein localization through two paths, one is related with cellular protein localization, and the other is related

ACCEPTED MANUSCRIPT with protein transport. Whereas, protein localization to paranode region of axon only inherites information from cellular protein localization. This means the two terms (e.i., n6 and n7 ) are

T

associated by their EICAs from two aspects. Since the information content of protein

RI P

localization is smaller than that of cellular protein localization, the semantic similarity with

n0: biological process

NU

n1: establishment of localization

SC

EICAs is smaller than those without EICAs.

MA

n2: protein localization

ED

n3: cellular protein localization

n5: protein transport

n6: intracellular protein transport

PT

n7: protein localization to paranode region of axon

n4: establishment of protein localization

AC CE

Figure 2. A example of multiple inherence and the exclusively inherited common ancestors (EICAs) from the biological aspect of GO. The nodes in blue color ( n0 , n1 , n2 and n3 ) are the common ancestors of n6 and n7 , while those in blue color with dashed circles ( n2 and n3 ) are their EICAs.

2.3 Computational aspect of EISI The procedure for the implementation of EISI includes four steps: (1) identify the common ancestors of two terms; (2) sift the exclusively inherited common ancestors from the common ancestor set; (3) compute the information content of each term in the exclusively inherited common ancestor set; (4) compute the average of information content. Step (1) and (3) are the most time-consuming in this procedure, as they must first find out the ancestor or descendant nodes of each term with time complexity of O(n log n) , and thus the time complexity for the implementations of these two steps are O(2n log n) and O(n2 log n) , respectively. In order to reduce the run time in real-time application, we can perform a preliminary computation step to identify the ancestors and descendants, and calculate the

ACCEPTED MANUSCRIPT information content of each GO term for step (1) and (3). In step (2), each term in the common ancestor set should be checked whether the exclusive inheritance condition is O(n) .

Thus in step (1) and step (3),

T

satisfied, and the time complexity of screening of EICAs is

RI P

we only need to search for the ancestor and descendant sets for each term from the pre-computed results with time complexity of O(log n) , and the computation of EISI can be

SC

achieved in time O(n  (2  n) log n) . The algorithm for the implementation of EISI is shown in

NU

Figure 3.

AC CE

PT

ED

MA

Algorithm cal_EISI(t1, t2) Input: t1, t2, AncestorSet, ChildSet , ICSet Output: EISI Begin 1. CommonAnSet ← GetCommonAnSet(t1, t2, AncestorSet) 2. EICommonAnSet ←  3. UnionAnSet ← GetAnSet(t1, AncestorSet) ∪ GetAnSet(t2, AncestorSet) 4. DiffAnSet ← UnionAnSet- CommonAnSet 5. for each a in CommonAnSet do 6. DirectChildSet ← GetDirectDescendant(a, ChildSet) 7. tmpset ← DiffAnSet ∩ DirectChildSet 8. if tmpset ≠  9. EICommonAnSet ← EICommonAnSet ∪ { a } 10. endif 11. endfor 12. EISI ← 0 13. n ← 0 14. for each eica in EICommonAnc do 15. EISI ← EISI + get_IC(eica,ICSet) 16. n ← n + 1 17. endfor 18. return EISI /n End Figure 3. The algorithm for the implementation of EISI

2.4 The relationship between EICA and DCA According to Couto[28], the paths from different disjunctive ancestors to a child node in GO graph are independent, as there exist at least one node in a path but not in other paths, and thus different disjunctive ancestors represent distinct interpretations of a term. For a

ACCEPTED MANUSCRIPT disjunctive common ancestor of two terms, they meant that the paths from this ancestor and all its children (those are also the common ancestors of the two terms) to either term are

T

independent, it implies that a disjunctive common ancestor can provide distinct interpretations

RI P

to either term. As mentioned above, an exclusively inherited common ancestor can provide different informations to its descandant nodes. From this aspect, a disjunctive common

SC

ancestor has something to do with an exclusively inherited common ancestor, i.e., they both can provide distinct interpretations of a term. However, they are not exactly the same.

NU

According to the definition of EICA, a common ancestor is an EICA of two terms means that

MA

from this node downwards, we can find a direct child inherited by either of the terms exclusively, and the path from the direct child to that term and those from the EICA or any

ED

ancestor more informative to the same term are independent, that means this EICA is a DCA, as the EICA and each ancestor more informative are disjunctive ancestors of the term. But on

PT

the other hand, a DCA may not be an EICA, this is because there is tougher restriction for an

AC CE

EICA, i.e., an EICA should be inherited by either term through some paths exclusively. Based on the above intuition, we formalize the relationship between EICA and DCA as the following proposition.

Proposition 1: Suppose that c1 and c2 are two GO terms, DCA(c1, c2 ) and EICA(c1, c2 ) are their DCA and EICA set, respectively, if c  EICA(c1 , c2 ) then c  DCA(c1 , c2 ) holds. Proof.

Suppose Ancestor (c) represents the ancestor set of a term c , Dchild (c) denotes all

direct children of term c , and CA(c1 , c2 ) is the common ancestor set of c1 and c2 . According to the definition of EICA, if c  EICA(c1 , c2 ) , we have (cx  Dchild (c)  cx  CA(c1 , c2 )) and (cx  Ancestor (c1 ) cx  Ancestor (c2 )) .

Without loss of generality, let cx  Ancestor (c1 ) and (cx  Dchild (c)  cx  CA(c1 , c2 )) ,

c x is the ancestor of c1 and is only inherited by c1 and its ancestors that is more informative

than c x , there must be a path, denoted by  cx  c1  , from c x to c1 passing cx ' s children that only

ACCEPTED MANUSCRIPT inherited by c1 ,but not passing any common ancestors. On the other hand, for c and each common ancestor, ci is more informative than c , we can find a path from the node

T

to c1 passing other common ancestors, but not passing any nodes in the path  cx  c1  , that

RI P

means c x and any elements in the common ancestor set {ci | ci  CA(c1 , c2 )  IC (ci )  IC (c)} are the

of c1 and c2 , i.e., c  DCA(c1 , c2 ) .

n5

n3

n2 n4 n6

AC CE

PT

ED

n1

MA

NU

n0

SC

disjunctive common ancestors of c1 , that is to say, c is a disjunctive common ancestor

Figure 4. A toy example of Directed Acyclic Graphs to show the difference between disjunctive common ancestors (DCAs) and exclusively inherited common ancestors (EICAs). n0 , n3 and n4 are the EICAs of n5 and n6 ,while n0 , n2 , n3 and n4 are the DCAs of n5 and n6 .

Figure 4 illustrates a toy example of a Directed Acyclic Graphs, the common ancestor set of node n5 and n6 is {n0 , n2 , n3 , n4 } , by checking the inheritance property of the direct children of each element in this set, we see that n0 , n3 and n4 are the EICAs of n5 and n6 , while n2 is not an EICA as all its direct children (i.e. n3 and n4 ) are inherited by both of them. As for disjunctive common ancestors, n0 and n2 are disjunctive common ancestors of n5 , since the path  n2 , n3 , n5  does not pass through n0 , and the path  n0 , n1 , n5  does not pass through n2 either. In the same way, we have DCA(n5 )  {n0 , n1 , n2 , n3 , n4 } and DCA(n5 , n6 )  {n0 , n2 , n3 , n4 } . From this example, we see that an exclusively inherited common ancestor of two terms is also a disjunctive common

ACCEPTED MANUSCRIPT ancestor, whereas, the converse is not true. That is to say, the EICA set may involved with less terms than those of DCA, this will result in less amount of calculation for our approach to get

T

the shared information content than other methods based on DCA, and make it more feasible

RI P

for large-scale dataset in practice.

To show the efficiency of EISI in terms of detecting the EICAs for a pair of GO terms,

SC

we conducted three groups of experiments with different sample sizes of GO term pairs. Each group of term pairs was randomly selected from the GO graph, the first group contained 10

NU

term pairs, the second group included 100 term pairs, and the third group consisted of 1000

MA

terms pairs. The comparison experiments on each sample size were conducted 10 times to detect EICAs and disjunctive common ancestors (DCAs) for each term pair by different

ED

methods. For each time, the average time (the average time used to detect the DCAs or EICAs of a term pair), standard deviation and coefficient of variation were computed. Then the

PT

averages of these three indexes at different sample sizes were calculated and the experimental results are listed in table 2. From this table we see that our approach takes less time to detect

AC CE

the EICAs for a pair of terms on average, with less standard deviation and coefficient of variation. Taking the annotation dataset of yeast for an example, each gene is annotated by 9.7 GO terms on average, in order to compute the similarity between a pair of genes in pairwise strategy, we should detect the EICAs of 94 pairs of terms, the time needed by EISI is about 4.7 second, while those needed by GraSM and DiShIn to detect the DCAs are about 795 and 175 seconds, respectively. In this sense, EISI is more suitable for real-time application without a preliminary calculation of the EICAs. It should be noted that, once the EICAs and the DCAs were detected, the time complexity of computing the similarity values is O(n). Thus, if they (GraSM and DiShIn) perform a preliminary calculation and store the results in a database, our approach do not show superiority in time complexity. However, as the GO database often changes (the number

ACCEPTED MANUSCRIPT of terms in CESSM database released of August 2008 is 26563, which has increased to 41827 in the release of September 2014), to keep up with this change, the preliminary calculation

T

should be often carried out with increasing time. While our approach does not need such

RI P

preliminary calculation, it can compute the similarity values in real-time sense.

EISI

GraSM

DiShIn

1.6026

0.3469

1.7100

0.7539

13.4633

1.2810

0.3485

1.8666

0.7150

19.4611

1.6879

0.3541

2.4831

0.8748

Table 2. Efficiencies of different methods on the three datasets of various sample sizes sample size

average time(second)

EISI

GraSM

0.0502 10.8577

1.9225

0.0178

22.4257

100

0.0474

6.8311

1.7736

0.0165

1000

0.0518

7.6945

1.9137

0.0184

DiShIn

SC

GraSM

standard deviation

DiShIn

10

EISI

coefficient of variation

NU

The experiments were conducted in MATLAB 2008a on a machine with 2.67GHz Intel quad core processors and 4 GB of RAM.

MA

3. VALIDATION OF OUR APPROACH 3.1 Dataset

ED

The GO database and gene annotations

PT

The GO consortium provides publicly available releases of Gene Ontology database and gene annotation dataset. In this study, the GO database and the gene annotation datasets

AC CE

released in April 2013 were used to test our approach. The GO database contains 25370 BP, 3295 CC, and 10445 MF terms. The gene annotation dataset contains 91133 annotations of 6381 genes for the yeast genome. Artificial scored dataset

A dataset with artificial scored semantic similarity measure is used to validate our measurement. It contains 30 pairs of terms selected from the GO database, and 10 biological researchers not in our research group were invited to score how a term is similar to the other for each term pair according to their knowledge. The semantic similarity values ranged from 0 to 10, that is to say, the score was 0 if two terms were orthogonal, and it reached a maximum score of 10 when the two terms were identical. Then the average of the scores for each term pair was taken as the artificial scored semantic similarity. The 30 pairs of GO terms and their

ACCEPTED MANUSCRIPT artificial semantic similarity values were listed in table 3. Table 3. The GO terms and the corresponding scores in the artificial dataset GO:0015422 GO:0015279 GO:0016871 GO:0015576

5 GO:0015164 6 7 8 9 10 11 12

GO:0004620 GO:0045517 GO:0004004 GO:0003693 GO:0000034 GO:0016618 GO:0046524

oligosaccharide-transporting ATPase activity GO:0015423 store-operated calcium channel activity GO:0005245 cycloartenol synthase activity GO:0009982 glucitol transporter activity GO:0015170 glucuronoside transmembrane transporter GO:0015170 activity phospholipase activity GO:0004630 interleukin-20 receptor binding GO:0045518 ATP-dependent RNA helicase activity GO:0015611 P-element binding GO:0004525 adenine deaminase activity GO:0019239 hydroxypyruvate reductase activity GO:0004450 sucrose-phosphate synthase activity GO:0045509

13 GO:0045509 interleukin-27 receptor activity 14 15 16 17 18

GO:0001609

wishful thinking binding GO:0005127 atrazine catabolic process GO:0018965 spermidine-importing ATPase activity GO:0046923 phosphatidylinositol-3-phosphatase activity GO:0004478 phosphomevalonate kinase activity GO:0000033 lipid-linked peptidoglycan transporter 19 GO:0015648 GO:0015233 activity 20 GO:0004791 thioredoxin-disulfide reductase activity GO:0034061 21 GO:0008296 3'-5'-exodeoxyribonuclease activity GO:0008296

PT

ED

MA

GO:0005117 GO:0019381 GO:0015595 GO:0004438 GO:0004631

GO:0008830

23 GO:0047458 beta-pyrazolylalanine synthase activity

GO:0046408

AC CE

22 GO:0005245 voltage-gated calcium channel activity

24 GO:0046565 3-dehydroshikimate dehydratase activity

GO:0045075

25 26 27 28 29 30

GO:0046905 GO:0003909 GO:0017150 GO:0018251 GO:0051021 GO:0044440

GO:0008031 GO:0008142 GO:0019194 GO:0019184 GO:0000975 GO:0031300

term2

eclosion hormone activity oxysterol binding sorbose transmembrane transporter activity nonribosomal peptide biosynthetic process regulatory region DNA binding intrinsic to organelle membrane

score

maltose-transporting ATPase activity voltage-gated calcium channel activity pseudouridine synthase activity Propanediol transporter activity propanediol transmembrane transporter activity phospholipase D activity interleukin-22 receptor binding D-ribose-importing ATPase activity ribonuclease III activity deaminase activity isocitrate dehydrogenase (NADP+) activity interleukin-27 receptor activity G-protein coupled adenosine receptor activity ciliary neurotrophic factor receptor binding s-triazine compound metabolic process ER retention sequence binding methionine adenosyltransferase activity alpha-1,3-mannosyltransferase activity pantothenate transmembrane transporter activity DNA polymerase activity 3'-5'-exodeoxyribonuclease activity dTDP-4-dehydrorhamnose 3,5-epimerase activity chlorophyll synthetase activity regulation of interleukin-12 biosynthetic process phytoene synthase activity DNA ligase activity tRNA dihydrouridine synthase activity peptidyl-tyrosine dehydrogenation GDP-dissociation inhibitor binding endosomal part

T

1 2 3 4

GO ID

RI P

term1

SC

GO ID

NU

id

8.3 8.0 4.35 2.15 2.15 8.45 7.5 3.3 2.1 1.5 5.25 0.8 3.3 4.7 2.6 0.9 2.1 2.9 2.53 1.35 10 0.5 6.8 0 0.2 0.2 0.4 0.5 1.2 1.7

Pathway dataset The saccharomyces genome database (SGD) (http://pathway.yeastgenome.org/biocyc/) was used for validation in this study, it is a collection of manually curated metabolic pathways and enzymes of saccharomyces cerevisiae. The pathway dataset contains classification and annotation information of genes in each pathway. There are 187 biological pathways in the SGD database (as of September 23, 2013). Most of these pathways contain more than three genes that are manually annotated by both Enzyme Commission (EC)

ACCEPTED MANUSCRIPT numbers and molecular function GO terms. For instance, there are six genes, GPX1, GPX2, HYR1, GLR1, GTT1 and GTT2, in the glutathione-glutaredoxin redox reactions. Among

T

these genes, GPX1, GPX2 and HYR1 are annotated by the same EC number and mostly by

RI P

the same GO terms; Analogously, GTT1 and GTT2 are annotated by another EC number and mostly the same group of GO terms. Conversely, the EC number and GO terms used to

SC

annotate gene GLR1 are less similar to those annotating the other five genes. According to SGD, the six genes in this pathway are manually divided into three classes as illustrated in

NU

Figure 5. This kind of priori knowledge provides similarity information between genes at

MA

functional level, i.e., the genes with the same EC number is more similar than those with different EC numbers. Based on this priori functional information, we constructed a model

ED

tree for each pathway and taken it as ground truth to validate our approach by comparing it to the clustering tree derived from EISI. In order to demonstrate the clustering results of our

PT

measurement, the entities lacking EC number or gene name, as well as pathway with less three genes, were removed from SGD. The final dataset contains 109 pathways with at least

AC CE

three genes annotated by EC numbers and GO terms. glutathione

an oxidized glutaredoxin a reduced glutaredoxin

NADP NADPH

glutathione oxidoreductase:GLR1 1.8.1.7

GPX1 GPX2

H2O2 H2O

glutathione-peroxidase:GPX1 glutathione-peroxidase:HYR1 glutathione-peroxidase:GPX2 1.11.1.9 glutathione transferase:GTT2 glutathione transferase:GTT1 2.5.1.18

glutathione disulfide

(a) glutathione-glutaredoxin redox reactions

HYR1

RX

GTT1 GTT2

HX

R-S-glutathione

GLR1

(b) Manually clustering result of pathway (a)

Figure 5. Functions of genes in a S.cerevisiae pathway and the corresponding manually clustering result. (b) is the model tree that consistent with the clustering result based on the Resnik with EISI measure.

3.2 Validation criteria In this study, three indicators were used to assess the performance of our similarity measure, the Pearson correlation coefficient, Robinson-Foulds( RF ) distance measure[41] and

ACCEPTED MANUSCRIPT Percentage of Correct pathway ( PC ). The Pearson correlation coefficient quantified how strong the relationship between the artificial score and semantic similarity value derived from

T

EISI on the artificial scored dataset, a larger coefficient value means that the semantic

RI P

similarity value with EISI is more relevant with experts’ scores. The RF and PC measures were used to validate the performance of our method on the pathway dataset.

SC

As mentioned previously, the EC number provides similarity information between genes at functional level, i.e., genes with the same EC number tend to perform the same biological

NU

function in a pathway, thus we can cluster genes into different functional classes according to

MA

their EC numbers, and the clustering result based on this kind of prior knowledge was referred to as a model tree in this study. In order to assess our semantic similarity measurement, we

ED

clustered the genes in each pathway based on the similarity values derived from EISI, and compared the structures of our clustering trees with those of the model trees, the more

PT

consistent they were, the better performance of our approach. RF and PC were adopted to quantified to what extent the model trees and our results were different and consistent,

AC CE

respectively. The RF distance between two trees quantifies the number of bipartitions that differentiate between them, it is defined as follows. Suppose T is a tree, whose leaves are labeled by a set S of entities, e  (u, v) is an edge in T

, the deletion of

e

produces a bipartition on the leaves, and divides S into two subsets, one

subset of all leaves on one side of the edge, and the subset of all the other leaves, so that we can define a bipartition set Bp(T )  { e : e  Edge(T )} , where Edge(T ) is the set of all edges in T . If MTree

is a model tree and ITree is a tree inferred by a certain classification method, we

define the false negatives to be the cardinality of the set Bp(MTree)  Bp( ITree) and false positives to be that of Bp( ITree)  Bp(MTree) , respectively, then the false negative rate and false positive rate can be defined as follows FNR  Bp(MTree)  Bp( ITree) / N1

(14)

ACCEPTED MANUSCRIPT FPR  Bp( ITree)  Bp(MTree) / N2

(15)

where N1 and N 2 are the number of branches in MTree and ITree ,respectively, FNR and FPR

T

take values between 0 and 1. When MTree and ITree are both binary trees, we have FPR FNR ,

RF value

Bp( ITree)  Bp( MTree)  Bp( MTree)  Bp( ITree) 2N

(16)

NU



FPR  FNR 2

SC

RF (MTree, ITree) 

RI P

and N1  N2  N , the RF distance between them is simply the average of these two rates

of 0 implies that two trees have identical topology, while the value of 1 means no

MA

bipartitions they have in common, that is to say, the smaller RF value, the larger similarity value between the model tree and the inferred one, and the better performance of our semantic

ED

similarity measurement.

In order to compute the PC measure, we first defined the fitness of a clustering tree as

PT

1 RF ( MTree, ITree)  0 Fit _ tree   otherwise 0

(17)

AC CE

note that the index Fit _ tree indicates whether the clustering result is completely consistent with the model tree or not, the PC measure can be quantified by the percentage of pathways that are correctly clustered, and computed as PC  100

1 n   Fit _ tree(k ) n k 1

(18)

where n is the number of pathways in SGD. 3.3 Results and discussion To show the effectiveness of our approach, the semantic measurements derived from EISI were evaluated against those of Resnik[36], Lin[26] and Jiang and Conrath[29], as they are also based on shared information and widely used for comparison in many investigations. Meanwhile, the method we used to measure the semantic similarity was based on the multiple inheritance attribute of the GO structure, it is similar to previous methods proposed by Couto

ACCEPTED MANUSCRIPT et al. [28, 37], we also compared our results with those produced by these approaches. In the next two subsections, we will provide comparisons between our measurements and those of

T

others on both artificial and SGD dataset, respectively. In our experiments, the information

RI P

content of a GO term was estimated by adopting a commonly used strategy applied to GO[42], where the frequency of a term occurring in a corpus is determined by the number of gene

SC

products annotated with it, as well as with all its descendants in the GO graph. 3.3.1 Results on artificial scored dataset

NU

In the present study, similarity measurements based on different kinds of shared

MA

information content, (i.e., EISI, GraSM, DiShIn and MICA) were adopted for evaluation. For each measure, the information content two terms have in common was computed and

ED

integrated in Resnik’s, Lin’s and Jiang and Conrath’s methods to derive the corresponding semantic similarity measures, respectively. Thus, for each kind of shared information, we got

PT

three kinds of semantic similarity values, which were named as Resnik, Lin and Jiang for the sake of convenience. Then, the semantic similarities of the three kinds were compared to the

AC CE

artificial scores to demonstrate the advantage of our approach. Table 4 lists the Pearson correlation coefficients between the artificial scores and different similarity measures. As shown in this table, the Resnik’s semantic similarity measure seemed to show the best performance among the three kinds of measures with EISI having the maximum coefficient value of 0.9036; conversely, Jiang’s measure performed the worst with the minimum coefficient value of 0.4355 derived by DiShIn. As for different types of shared information content, the measurements based on multiple common ancestors generally performed better than those based on MICA. In comparison to GraSM and DiShIn, EISI achieved greater correlation coefficients, and DiShIn also outperformed GraSM, which was followed by MICA in general. It is worth noting that the gap of correlation coefficient between Lin’s and Jiang’s measures based on EISI was about 0.08, which was much smaller than those

ACCEPTED MANUSCRIPT based on GraSM and DiShIn (about 0.3 and 0.4 respectively), this implies that EISI produced more stable results in comparison with other methods based on multiple common ancestors.

0.8229

0.8348

Lin

0.8336

0.8361

Jiang

0.7830

0.5484

RI P

Resnik

DiShIn

SC

GraSM

EISI

0.8798

0.9036

0.8543

0.9028

0.4355

0.8248

NU

MICA

T

Table 4. Pearson’s correlation coefficients between the artificial scores and semantic similarity values produced by EISI and other methods

3.3.2 Results on pathway dataset

MA

To demonstrate the effectiveness of EISI in real biological scenarios, we applied the similarity measure to investigate the relationship among genes in each pathway of SGD. As a

ED

gene may be annotated with multiple GO terms, the pair-wise strategy with average-maximum rule[43] was adopted to estimate the semantic similarity value between two genes in this study.

PT

After getting the semantic similarity measure for each pair of genes in a pathway, we constructed a similarity matrix M, with each element Mij in it denoting the similarity value

AC CE

between gene i and j. Then the spectral graph clustering algorithm[44] was applied to M to cluster the genes into various classes and form a clustering tree, and the RF distance and fitness of the clustering trees were computed according to formula (16) and (17), respectively. Next, the PC value and average RF distance value were calculated to evaluate the consistency of the clustering trees with model trees for all pathways, and then validate the performance of our method. Resnik’s, Lin’s, and Jiang’s similarity measures based on EISI, MICA, GraSM and DiShIn were computed to construct a similarity matrix of genes corresponding to a pathway in SGD, respectively, then the genes were clustered into a clustering tree and compared with the corresponding model tree, the RF distance value and Percentage of Correct tree (PC) were calculated for every pathway. The experimental results of our method and those of others on

ACCEPTED MANUSCRIPT the 109 pathways are shown in table 5. According to this table, the measures based on EISI got the least average RF distance values and the best scores of Percentage of Correct tree (PC)

T

in all scenarios. The RF distance values of our measures tended to be smaller than those of

RI P

others based on shared information content, and the PC values were always greater. In general, the methods based on the information derived from multiple common ancestors (i.e., EISI,

SC

GraSM and DiShIn) tended to outperformed those based on single common ancestor (i.e., MICA), with smaller average RF values and larger PC values. Among the methods based on

NU

multiple inheritance, our approach performed better than GraSM and DiShIn, this suggests

MA

that semantic similarity based on EISI are more consistent with prior knowledge of gene function in the pathway.

Table 5. The comparison results on pathway dataset produced by EISI and other methods DiShIn

EISI

Resnik

66.97

67.31

67.94

68.81

Lin

62.39

64.24

65.35

66.06

Jiang

62.39

63.31

64.13

64.22

Resnik

0.0639

0.0604

0.0612

0.0595

Lin

0.0727

0.0684

0.0693

0.0676

Jiang

0.0732

0.0803

0.0713

0.0696

ED

GraSM

AC CE

PT

PC value (%)

Average RF distance

MICA

The PC value in table 5 denotes the percentage of trees whose Robinson-Foulds (RF) distance values equal zero, the larger PC value means more trees’ topologies are correctly inferred. The average RF distance is the average value of RF distance between the inferred tree and the model tree, the smaller average RF distance means better performance of corresponding similarity measure.

Compared with Lin’s and Jiang’s measures, Resnik’s measure produced better results on the pathway dataset, with larger PC values and smaller average RF distance values. In the four scenarios of shared information content, Resnik’s measure got about 3% higher PC values and about 0.01 lower average RF distance values. Meanwhile, Lin’s measure also outperformed Jiang’s on these two indicators. It is interesting to note that, pervious investigations conducted by Lord[45] and Sevilla[46] have compared the performance of different semantic similarities,

ACCEPTED MANUSCRIPT including those proposed by Resnik, Lin, and Jiang and Conrath, they both suggested that Resnik’s measure was superior to others. This is consistent with the experimental results of the

T

present study.

RI P

Additionally, here we give a biological interpretation of the clustering result of our method on the above mentioned pathway, glutathione-glutaredoxin redox reactions, to show

SC

how the clustering tree is consistent with the corresponding model tree. The clustering tree based on Resnik measurement with EISI is consistent with the model tree of genes in this

NU

pathway, and the clustering result is illustrated in Figure 5 (b). We see that the six genes are

MA

clustered into three groups, the first group contains three genes (GPX1, GPX2 and HYR1), the second class is composed of two genes (GTT1 and GTT2), and the last one has a single gene

ED

GLR1. According to the annotation information in the SGD database, the six genes are functionally related with glutathione activity. Specifically, GPX1, GPX2 and HYR1 tend to be

activity

PT

involved with the peroxidase activity, as they are annotated with glutathione peroxidase (GO:0004602),

oxidoreductase

activity

(GO:0016491),

peroxidase

activity

AC CE

(GO:0004601) and phospholipid- hydroperoxide glutathione peroxidase activity (GO:00470 66). By contrast, GTT1 and GTT2 are inclined to participate in the transfer activity, as they are annotated with glutathione transferase activity (GO:0004364) and transferase activity (GO:0016741). As for GLR1, its function is more similar to those of the first group, since it is annotated with glutathione-disulfide reductase activity (GO:0004362) and oxidoreductase activity (GO:0016491); For the aspect of biological process, GPX1, GPX2 and HYR1 tend to be involved in the process of oxidation-reduction, as they are annotated with oxidationreduction process(GO:0055114), cellular response to oxidative stress (GO:0034599), response to oxidative stress (GO:0006979). While GTT1 and GTT2 are involved in the process of metabolic, as they are annotated with glutathione metabolic process (GO:0006749). For GLR1, it tends to participate in both metabolic and oxidation-reduction process, since it is annotated

ACCEPTED MANUSCRIPT with cell redox homeostasis (GO:0045454), cellular response to oxidative stress (GO:0034599), oxidation-reduction process (GO:0055114) and glutathione metabolic process

T

(GO:0006749); In terms of the subcellular localization, the six genes are mainly localized in

RI P

the cytoplasm or endoplasmic reticulum, they are annotated with cytoplasm (GO:0005737) or endoplasmic reticulum (GO:0005783). In comparison to the other three genes, GPX1, GPX2,

SC

HYR1 are mainly localized on the membrane outside the mitochondrion, as they are annotated with extrinsic component of mitochondrial outer membrane (GO:0005741) and peroxisomal

NU

matrix (GO:0005782). While GTT1, GTT2 are localized inside the mitochondrion, as they are

MA

annotated with mitochondrion (GO:0005739). GLR1 acts on mitochondrion and nucleus, as it is annotated with nucleus (GO:0005634) and mitochondrion (GO:0005739). Thus, from the

ED

view point of biological meaning, GPX1, GPX2 and HYR1 are more similar to each other than to the other three genes, GTT1 and GTT2 are also more similar to each other, while

biologically.

PT

GLR1 tends to have the biological properties of the both groups and forms another class

AC CE

It is worth noting that recent studies proposed DiShIn as a solution to the next generation of similarity measures that can fully explore the semantics in ontologies[40]. Like DiShIn, EISI intends to explore more information related with the semantic similarity between two concepts in ontologies, the basic idea of DiShIn is that two disjunctive ancestors represent two distinct interpretations of a concept, while EISI addresses this issue from another aspect, it originates from the idea that a common ancestor exclusively inherited by two child concepts means that it provides distinct information to its descendant concepts, and some are exclusively inherited by either child. So we believe that EISI also has the similar ability as DiShIn to fully explore the semantics in ontologies.

4. CONCLUSIONS This paper presented a novel semantic similarity measurement based on exclusively

ACCEPTED MANUSCRIPT inherited shared information (EISI) to address the problem of multiple inheritance in calculating the similarity between two terms in GO. The EISI quantified shared information

T

content derived from the common ancestors that are exclusively inherited by either term, it

RI P

was established on the observation that only the exclusively inherited common ancestors provide distinguishing information to its descendants. Our approach is effective and efficient,

SC

the semantic similarity measure between GO terms had good correlation with experts’ scores on artificial dataset, and those between genes were consistent with the prior knowledge of

NU

functional relationship among genes on the pathway of SGD database. Moreover, it is

MA

consistent with previous investigations, which concluded that Resnik’s measure outperformed Lin’s and Jiang’s measures. Our method has O(n  (2  n) log n) time complexity, which makes

ED

it feasible for real time application in large-scale investigation, it is a promising alternative to

Acknowledgements

PT

other methods based on the multiple inheritance of GO.

AC CE

The authors thank the biologists for their enthusiastic help and support in constructing the artificial dataset. This work was supported in part by the National Natural Sciences Foundation of China grants 60675016 and 60633030, and fully by the Natural Sciences Foundation of Guangzhou Maritime Institute grants K31012B09.

Conflict of interest statement

The authors declare that they have no conflict of interest.

ACCEPTED MANUSCRIPT

Reference

T

[1] Z. Teng, M. Guo, X. Liu, Q. Dai, C. Wang, P. Xuan, Measuring gene functional similarity based on group-wise comparison of GO terms. Bioinformatics 29 (2013) 1424-1432.

RI P

[2] Taha K. GOtoGene: a method for determining the functional similarity among gene products. In: Proceedings of the Tenth Australasian Data Mining Conference-Volume; 2012. 43-51. [3] N. Nariai, E. D. Kolaczyk, S. Kasif, Probabilistic protein function prediction from heterogeneous genome-wide data. PLoS One 2 (2007) e337.

SC

[4] Y. Tao, L. Sam, J. Li, C. Friedman, Y. A. Lussier, Information theory applied to the sparse gene ontology annotation network to predict novel gene function. Bioinformatics 23 (2007) i529-i538.

NU

[5] A. Alexa, J. Rahnenführer, T. Lengauer, Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics 22 (2006) 1600-1607. [6] P. Khatri, S. Drăghici, Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics 21 (2005) 3587-3595.

MA

[7] D.W. Huang, B.T. Sherman, Q. Tan, The DAVID Gene Functional Classification Tool: a novel biological module-centric algorithm to functionally analyze large gene lists. Genome biology 8 (2007) R183.

ED

[8] D. Yang, Y. Li, H. Xiao, L. Qing, Q. Zhang, M. Zhu, J. Ma, W. Yao, J. Wang, D. Wang, Z. Guo, B. Yang, Gaining confidence in biological interpretation of the microarray data: the functional consistence of the significant GO categories. Bioinformatics 24 (2008) 265-271. [9] S. Mathur, D. Dinakarpandian, Finding disease similarity based on implicit semantic similarity. Journal of biomedical informatics 45 (2012) 363-371.

PT

[10] A. Schlicker, T. Lengauer, M. Albrecht, Improving disease gene prioritization using the semantic similarity of Gene Ontology terms. Bioinformatics 26 (2010) i561-i567.

AC CE

[11] A. Schlicker , C. Huthmacher, F. Ramírez, T. Lengauer, M. Albrecht, Functional evaluation of domain–domain interactions and human protein interaction networks. Bioinformatics 23 (2007) 859-865. [12] H. Wang, H. Zheng, F. Browne, Integration of Gene Ontology-based similarities for supporting analysis of protein–protein interaction networks. Pattern Recognition Letters 31 (2010) 2073-2082. [13] E. Camon, M. Magrane, D. Barrell, V. Lee, E. Dimmer, J. Maslen, D. Binns, N. Harte, R. Lopez, R. Apweiler, The Gene Ontology annotation (GOA) database: sharing knowledge in Uniprot with Gene Ontology. Nucleic acids research 32 (2004) D262-D266. [14] N. R.C.G. Massjouni, T. M. Murali, VIRGO: computational prediction of gene functions. Nucleic acids research 34 (2006) W340-W344. [15] E. Zeng, C. Ding, G. Narasimhan, S. R. Holbrook, Estimating support for protein-protein interaction data with applications to function prediction. In: Proceedings of 2008 Computer Systems Bioinformatics (CSB) Conference; 2008. 73-84. [16] F. M. Couto, M. J. Silva, P. M. Coutinho, Implementation of a functional semantic similarity measure between gene-products. In: Tech Rep DI/FCUL TR 03-29: Department of Informatics, University of Lisbon; 2003. [17] A. Schlicker, M. Albrecht, FunSimMat: a comprehensive functional similarity database. Nucleic acids research 36 (2008) D434-D439. [18] Z. Du, L. Li, C. F. Chen, S. Y. Philip, J. Z. Wang, G-SESAME: web tools for GO-term-based gene similarity analysis and knowledge discovery. Nucleic acids research 37 (2009) W345-W349. [19] Y. Xu, M. Guo, W. Shi, X. Liu, C. Wang, A novel insight into Gene Ontology semantic similarity. Genomics 101 (2013) 368-375. [20] G. Yu, F. Li, Y. Qin, X. Bo, Y. Wu, S. Wang, GOSemSim: an R package for measuring semantic similarity among GO terms and gene products. Bioinformatics 26 (2010) 976-978.

ACCEPTED MANUSCRIPT [21] R. Rada, H. Mili, E. Bicknell, M. Blettner, Development and application of a metric on semantic nets. IEEE Transactions on Systems, Man and Cybernetics 19 (1989) 17-30. [22] A. Nagar, H. Al-Mubaid, A new path length measure based on go for gene similarity with evaluation using sgd pathways. In: Proceedings of the 21st International Symposium on Computer-Based Medical Systems, 2008. CBMS'08. (2008) 590-595.

RI P

T

[23] V. Pekar, S. Staab, Taxonomy learning: factoring the structure of a taxonomy into a semantic classification decision. In: Proceedings of the 19th international conference on Computational linguistics. (2002) 1-7. [24] S. Jain, G. Bader, An improved method for scoring protein-protein interactions using semantic similarity within the gene ontology. BMC bioinformatics 11 (2010) 562.

SC

[25] P. Resnik, Using information content to evaluate semantic similarity in a taxonomy. In: Proceeding of the 14th International Joint Conference on Artificial Intelligence. (1995) 448-453.

NU

[26] D. Lin, An information-theoretic definition of similarity. In: In Proceedings of the 15th international conference on Machine Learning. (1998) 296-304.

MA

[27] F. M. Couto, M. J. Silva, P. M. Coutinho, Semantic similarity over the gene ontology: family correlation and selecting disjunctive ancestors. In: Proceedings of the 14th ACM international conference on Information and knowledge management. (2005) 343-344. [28] F. M. Couto, M. J. Silva, Disjunctive shared information between ontology concepts: application to Gene Ontology. Journal of Biomedical Semantics 2 (2011) 1-5.

ED

[29] J. J. Jiang, D. W. Conrath, Semantic similarity based on corpus statistics and lexical taxonomy. In: Proceedings of International Conference Research on Computational Linguistics. (1997) 19-33. [30] J. Z. Wang, Z. Du, R. Payattakool, P. S. Yu,C. F. Chen, A new method to measure the semantic similarity of GO terms. Bioinformatics 23 (2007) 1274-1281.

PT

[31] S. J. Bien, C. H. Park, H. J. Shim, W. Yang, J. Kim, J. H. Kim, Bi-directional semantic similarity for gene ontology to optimize biological and clinical analyses. Journal of the American Medical Informatics Association 19 (2012) 765-774.

AC CE

[32] R. M. Othman, S. Deris, R.M. Illias, A genetic similarity algorithm for searching the Gene Ontology terms and annotating anonymous protein sequences. Journal of biomedical informatics 41 (2008) 65-81. [33] X. Wu, E. Pang, K. Lin, Z. M. Pei, Improving the Measurement of Semantic Similarity between Gene Ontology Terms and Gene Products: Insights from an Edge-and IC-Based Hybrid Method. PloS one 8 (2013) e66745. [34] S. Benabderrahmane, M. Smail-Tabbone, O. Poch, A. Napoli, M. D. Devignes, IntelliGO: a new vector-based semantic similarity measure including annotation origin. BMC bioinformatics 11 (2010) 588. [35] N. Seco, T. Veale, J. Hayes, An intrinsic information content metric for semantic similarity in WordNet. In: Proceedings of the 16th European Conference on Artificial Intelligence. (2004) 1089–1090. [36] P. Resnik, Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research 11 (1999) 95-130. [37] F. M. Couto, M. J. Silva, P. M. Coutinho, Measuring semantic similarity between Gene Ontology terms. Data & knowledge engineering 61 (2007) 137-152. [38] A. Schlicker, F. S. Domingues, J. Rahnenführer, T. Lengauer, A new measure for functional similarity of gene products based on Gene Ontology. BMC bioinformatics 7 (2006) 302. [39] H. Yu, R. Jansen, G. Stolovitzky, M. Gerstein, Total ancestry measure: quantifying the similarity in tree-like classification, with genomic applications. Bioinformatics 23 (2007) 2163-2173. [40] F. M. Couto, H. S. Pinto, The next generation of similarity measures that fully explore the semantics in biomedical ontologies. Journal of bioinformatics and computational biology 11 (2013) 1-12. [41] D. F. Robinson, L. R. Foulds, Comparison of phylogenetic trees. Mathematical Biosciences 53 (1981) 131-147.

ACCEPTED MANUSCRIPT [42] C. Pesquita, D. Faria, A. O. Falcao, P. Lord, F. M. Couto, Semantic similarity in biomedical ontologies. PLoS computational biology 5 (2009) e1000443. [43] F. Azuaje, H. Wang, O. Bodenreider, Ontology-driven similarity approaches to supporting gene functional assessment. In: Proceedings of the ISMB'2005 SIG meeting on Bio-ontologies. (2005) 9-10.

T

[44] S.B. Zhang, S.Y. Zhou, J. G. He, J. H. Lai, Phylogeny inference based on spectral graph clustering. Journal of Computational Biology 18 (2011) 627-637.

RI P

[45] P. W. Lord, R. D. Stevens, A. Brass, C. A. Goble, Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics 19 (2003) 1275-1283.

AC CE

PT

ED

MA

NU

SC

[46] J. L. Sevilla, V. Segura, A. Podhorski, E. Guruceaga, J. M. Mato, L. A. Martinez-Cruz, F. J. Corrales, A. Rubio, Correlation between gene expression and GO semantic similarity. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2 (2005) 330-338.

Graphical abstract

List of abbreviations BP MF CC DGA GO GraSM DiShIn MICA DCA DCAs SGD

Biological Process Molecular Function Cellular Component Directed Acyclic Graph Gene Ontology Graph-based Similarity Measure Dubbed Disjunctive Shared Information Most Informative Common Ancestor Disjunctive Common Ancestor Disjunctive Common Ancestors Saccharomyces Genome Database

ACCEPTED MANUSCRIPT

PT

ED

MA

NU

SC

RI P

T

Exclusively Inherited Shared Information Exclusively Inherited Common Ancestor Exclusively Inherited Common Ancestors Robinson-Foulds Percentage of Correct pathway

AC CE

EISI EICA EICAs RF PC

ACCEPTED MANUSCRIPT Highlights   

AC CE

PT

ED

MA

NU

SC

RI P

T

A semantic similarity measurement between two GO terms is proposed. The similarity value is quantified by the information shared by two terms. The measurement takes into account multiple common ancestors that are exoteric inherited by either term exclusively.  The algorithm is effective with time complexity of O(n).  The results on real dataset support the prior knowledge of biological pathway.

Correlating information contents of gene ontology terms to infer semantic similarity of gene products.

TopoICSim: a new semantic similarity measure based on gene ontology.

The effects of shared information on semantic calculations in the gene ontology.

A weighted multipath measurement based on gene ontology for estimating gene products similarity.

Clinical phenotype-based gene prioritization: an initial study using semantic similarity and the human phenotype ontology.

Semantic similarity between sentences.

Clustering of gene ontology terms in genomes.

An improved method for functional similarity analysis of genes based on Gene Ontology.

Information content-based Gene Ontology functional similarity measures: which one to use for a given biological data type?

Evaluating the significance of protein functional similarity based on gene ontology.

Interspecies gene function prediction using semantic similarity.

The use of semantic similarity measures for optimally integrating heterogeneous Gene Ontology data from large scale annotation pipelines.

MeSH-Informed Enrichment Analysis and MeSH-Guided Semantic Similarity Among Functional Terms and Gene Products in Chicken.

Algorithmic approach for removing the redundancy in diabetic gene categories based on semantic similarity and gene expression data.

Enabling Ontology Based Semantic Queries in Biomedical Database Systems.

Gene function prediction based on the Gene Ontology hierarchical structure.

U-path: An undirected path-based measure of semantic similarity.

OmniSearch: a semantic search system based on the Ontology for MIcroRNA Target (OMIT) for microRNA-target gene interaction data.

Approach for text classification based on the similarity measurement between normal cloud models.

Hybrid ontology for semantic information retrieval model using keyword matching indexing system.

A grammar-based semantic similarity algorithm for natural language sentences.

An approach to semantic query expansion system based on Hepatitis ontology.

Improving chemical entity recognition through h-index based semantic similarity.

Drug repositioning by applying 'expression profiles' generated by integrating chemical structure similarity and gene semantic similarity.