OntoPIN: an ontology-annotated PPI database.

Interdiscip Sci Comput Life Sci (2013) 5: 187–195 DOI: 10.1007/s12539-013-0173-x

OntoPIN: An Ontology-Annotated PPI Database Pietro Hiram Guzzi∗ , Pierangelo Veltri, Mario Cannataro (Bioinformatics Laboratory, Department of Medical and Surgical Sciences, Magna Graecia University of Catanzaro, 88100 Catanzaro, Italy)

Received 7 January 2013 / Revised 30 March 2013 / Accepted 12 June 2013

Abstract: Protein-protein interaction (PPI) data stored in publicly available databases are queried by the use of simple query interfaces allowing only key-based queries. A typical query on such databases is based on the use of protein identifiers and enables the retrieval of one or more proteins. Nevertheless, a lot of biological information is available and is spread on different sources and encoded in different ontologies such as Gene Ontology. The integration of existing PPI databases and biological information may result in richer querying interfaces and successively could enable the development of novel algorithms that may use biological information. The OntoPIN project showed the effectiveness of the introduction of a framework for the ontology-based management and querying of Protein-Protein Interaction Data. The OntoPIN framework first merges PPI data with annotations extracted from existing ontologies (e.g. Gene Ontology) and stores annotated data into a database. Then, a semantic-based query interface enables users to query these data by using biological concepts. OntoPIN allows: (a) to extend existing PPI databases by using ontologies, (b) to enable a key-based querying of annotated data, and (c) to offer a novel query interface based on semantic similarity among annotations. Key words: semantic similarity, annotation, protein interaction databases.

1 Introduction After the introduction of methods and tools for the determination of the biochemical characteristics of proteins, such as their spatial conformation, the interest of researchers has focused on the study of the roles of proteins within a living organism. Proteins play their role usually by interacting among them or with other macromolecules. An interaction usually involves a contact among surfaces of two or more proteins. Thus, different experimental techniques have been introduced for the determination of interactions among proteins, named protein-protein interactions (PPI) (Cannataro et al., 2010). As a result, a large number of experimental datasets has been produced causing the introduction of computer science methods to manage, store and analyse PPI data (Ciriello et al., 2012). The whole set of protein interactions of a single species is also referred to as Protein to protein Interaction Network (PIN). PINs have been easily modeled by using undirected graphs (West, 2000) where nodes are associated to proteins and edges represent interactions among proteins. PPI data have been collected in many public databases such as the Database of Interacting Proteins (DIP) (Salwinski et al., 2004), the Molecular Interac∗

Corresponding author. E-mail: [email protected]

tion Database (MINT) (Licata et al., 2012). Usually PPI databases contain raw data, e.g. the identifiers of the interacting proteins, and some kind of annotation related to the reliability of the stored data, but usually these data are not annotated with information already available in other sources, such as Gene Ontology. PPI databases are often publicly available on the Internet offering to the user the possibility to retrieve data of interest through simple querying interfaces. Users, in fact, can conduct a search through the insertion of: (i) one or more protein identifiers, (ii) a protein sequence, or (iii) the name of an organism. Results may consist of, respectively, a list of proteins that interact directly with the seed protein or that are at distance k from the seed protein, or the list of all the interactions of an organism. Often it is impossible to formulate even simple queries involving biological concepts, such as all the interactions that are related to glucose synthesis. The OntoPIN project (Cannataro et al., 2009), conversely, demonstrated the effectiveness of the use of ontologies for annotating interactions starting from the annotation of nodes and the subsequent use for querying interaction data (Cannataro et al., 2010b and 2010c; Mina and Guzzi, 2012). The OntoPIN project is based on three main modules. A framework able to extend PPI databases with annotations: At the bottom of the proposed

188

Interdiscip Sci Comput Life Sci (2013) 5: 187–195

software platform there is an annotation module able to extend an existing PPI database with annotation extracted from the Gene Ontology Annotation Database (Camon et al., 2004) (GOA). For each protein three kinds of annotations are currently provided: biological process, cellular compartment and molecular function. A system to annotate interactions starting from the annotations of interacting proteins: Usually annotated databases contain annotations only for single proteins, nor for interactions. Here we define the annotation of an interaction as the shared terms of the annotations of the interacting proteins, e.g., let A(P1 ) and A(P2 ) the sets of the annotations of the proteins P1 and P2 , then theannotations of (P1 , P2 ) are given by A(P1,2 ) = A(P1 ) A(P2 ). A system for querying such data based on semantic similarity: Two main approaches are proposed for querying data: (a) retrieving all the interactions that are annotated with one or more terms that are given as input, (b) retrieving all the interactions that are annotated with a similar term given as input. This paper proposes a software platform for the annotation, retrieval and analysis of PPI data enriched with Gene Ontology annotations. Such system is useful not only for the semantic search of data, but also for the analysis of PPI data. The analysis of PINs is usually done by using graph-based algorithms (Aittokallio and Schwikowski, 2006), and associating graph properties (e.g. random graphs or scale free networks) to biological properties of the modeled PIN (Erdos and Rnyi, 1960). The availability of annotated data may enable the development of novel algorithms able to gather such information.

2 Related work 2.1

Semantic similarity measures of proteins in OntoPIN

While sequence or structure-based similarity of genes and proteins has been largely investigated, the similarity based on functions presents a more complex scenario. In fact, while primary and tertiary structures can be compared in terms of number of shared ammino acids or in terms of spatial conformation, the comparison of the functions needs the introduction of a comparison metrics between terms that are expressed often in natural language. The adoption of ontologies for managing annotations provides a means to compare entities on aspects that would otherwise not be comparable. For instance, if two gene products are annotated within the same schema, we can compare them by comparing the terms with which they are annotated (Guzzi et al., 2012; Pesquita et al., 2009). The annotations of biological concepts are currently

organized in simple taxonomies or more complex ontologies, such as Gene Ontology. The use of ontologies enables the comparison of annotations in terms of analysis of the ontology schema. Thus, the problem to define the semantic similarity of two terms can be solved in terms of analysis of the underlying ontology. While the semantic similarity among two biomedical or biological concepts is not a trivial problem, the semantic similarity among terms that come from a common schema, e.g. a taxonomy, has been largely investigated and can be solved in an efficient way. In the same way, if two biological concepts, e.g. proteins, are annotated with terms organized by using an ontology, the problem of the determination of their semantic similarity can be solved in terms of semantic similarity of the annotating terms. Several approaches are available to quantify semantic similarity between terms or annotated entities in an ontology represented as a directed acyclic graph such as GO. The most common measures are: the Resnik’s (1995), Lin’s (1998) and Jiang and Conrath’s (1997) measures. The Resnik’s similarity measure of two terms T1 and T2 of GO is based on the determination of the Information Content (IC) of their most informative common ancestor (MICA): simres = IC(M ICA(T1 , T2 )).

(1)

Thus the calculation of the Resnik measure implies two main steps: (i) the determination of the common ancestors among the given terms, the calculation of the Information Content of these terms and the selection of the most informative. A drawback of the Resnik’s measure is that it considers mainly the common ancestor and it does not take into account of the distance among the compared terms and the shared ancestor. The Lin’s measure faces with this problem by considering both terms and yielding to the following formula: simLin =

IC(M ICA(T1 , T2 ) IC(T1 ) + IC(T2 )

(2)

In a similary way the Jiang and Conrath’s measure takes into account this distance by calculating the following formula: 1 (1 + IC(T 1) + IC(T 2) − 2 ∗ IC(M ICA)) (3) Proteins and genes are annotated with a set of GO terms, so to assess the functional similarity between gene products it is necessary to compare sets of terms rather than single terms. All the proposed approaches are based on the comparison of terms and on the combination of the results, i.e. the pairwise similarity of simJC =


annotations calculated using an existing measure. The simplest way to measure the semantic similarity among two gene products is to calculate the pairwise semantic similarity among the terms that annotate the gene products and successively to combine such pairwise similarity by using some formulas such as the average, the maximum or the sum. Other approaches are based on the representation of two gene products as the induced subgraph of annotation or as a point in a vector space induced by annotations (Popescu et al., 2006; Yu et al., 2010). 2.2

Using semantic similarity for searching interactions

In this work we use semantic similarity to guide the execution of queries in order to realise a system able to retrieve all the interactions that can be considered similar to a set of annotations given as input. In the scenario we envision, the user chooses a protein identifier P1 and an annotation A and receives as results the set of interactions of P1 whose annotations are similar to A. Finding a similar interaction as we implemented in ONTOPIN comprises three steps: User chooses an annotation term A; System compares A with all the stored interactions using the Lin’s measure; The interaction reporting the maximum similarity is given as results to the user.

189

Annotation Module: It is responsible for searching interactions stored in an existing PPI database, for integrating them with annotations available in existing knowledge bases and finally for populating the annotated PPI database. Annotated PPI Database: It stores the annotated protein interaction data. Querying Module: It receives the queries from the user, retrieves the corresponding interactions and sends those as results. 3.1

The annotation module

The annotation of a PIN consists of three main phases as depicted in Fig. 2: (i) Retrieval of PPI data performed by Data Extraction Module, (ii) Retrieval of existing annotations performed by the Metadata Extraction Module, (iii) Generation of annotated interactions and storage into the PPI annotated database. Initially, the system queries the existing interaction database PPI DB and retrieves the data about interactions. Then the protein identifiers are used to find related annotations. For instance, the GOA is queried by using the Uniprot identifiers. Finally, PPI data and annotations are merged together and then stored into the annotated database AnnotatedDB. PPI DB

3 The OntoPIN architecture

Data extraction module

Interaction data

OntoPIN is a framework for the annotation, retrieval and analysis of PPI data enriched with Gene Ontology knowledge. Main modules of OntoPIN are, as depicted in Fig. 1. Graphical user interface

Annotated data

Annotated PPI-DB

Annotated DB

Annotation module

Data extraction module

Metadata extraction module

Data Ontology sources DB

The architecture for annotating PPI databases.

The annotation of PPI data.

In particular, the first version of the proposed annotated database contains the following annotations extracted from GOA: (i) Gene Ontology Component, (ii) Gene Ontology Cellular Process, (ii) Gene Ontology Molecular Function. 3.2

PPI DB

Ontology DB

Annotation module

Fig. 2

Fig. 1

Metadata extraction module

The annotated PPI database

Existing PPI databases store protein interactions as a list of binary interactions (P1 , P2 ) where the pair (P1 , P2 ) identifies the interacting proteins. It should be noted that the lack of common accepted identifiers for the interactions may be a problem to define the

190


modalities of annotation of an interaction (e.g. with the kind of interaction, with the direction of the biochemical reaction, etc.). Usually such PPI databases do not contains annotations about proteins (the nodes Pi ) nor about interactions (the pairs (Pi , Pj )). The availability of protein identifiers allows to retrieve further information such as the protein sequence. The resulting annotated PPI database will store both protein interaction data and annotations. The annotated PPI database will contain, for each protein, the protein identifier and main annotations extracted from Gene Ontology, such as Biological process, Molecular function and Cellular component. When available, the binary interactions (i.e. the edges of the PPI network described by the PPI database) will be annotated using information extracted form other public ontologies or knowledge bases. First of all, we need to define the annotation of an interaction (annotated interaction). The most simple way to define an annotated interaction is to consider the annotation of both interactors and then derive the annotation of the interaction. The most simple way to realize this is by considering the shared terms among interactors. Let us consider for example proteins P 39768 and Q9V AU 9 of Drosophila Melanogaster that are known to be interacting partners and their annotation reported in GOA as summarised in Table 1 (for simplicity for P 39768 is reported only a subset of annotations. The related interaction (P 39768, Q9V AU 9) is annotated with the shared terms among these proteins, i.e. the intersection of the annotations of the two interactions.

ple queries such as: find all the interactions whose at least one interactor is involved into a specified process and that happens into the nucleus. Moreover, user should not face with syntactical problems, so the queries should be formulated in a visual way. Nonetheless user has not to remember all the possible annotations, but he/she should be able to retrieve and to browse them. We defined two possible strategies for querying: (a) searching for interactions starting from a protein identifier or an annotation (Key-Based Querying), (b) searching for interaction that are annotated with a similar term to one given as input (Approximate Querying). In the first case ONTOPIN searches for interaction that are annotated with one or more terms chosen by the user while in the second case system will search for interactions that are annotated with the most similar terms with respect to the input terms. Consequently to the annotation scheme above described we designed a query module that receives as input a conjunctive list of annotations, A = {T1 T2 ..., Tn } and will return all the interactions that are annotated with all the terms contained in A. For instance, let us consider the interactions of the proteins P1, P2 and P3, their annotations, respectively A(P1 ) = {T1 , T2 , T3 }, A(P2 ) = {T1 , T2 , T4 }, and A(P3 ) = {T7 } and the interactions (P1 , P2 ) and (P2 , P3 ). As described previously, the interaction (P1 , P2 ) will be annotated with the terms T1 ,T2 , while he interaction (P2 , P3 ) will be annotated with the terms T4 , and T5 . If the user queries the system by using the terms T1 , T2 , the system will return the interaction (P1 , P2 ).

Table 1

3.4

Annotation of the interaction (P 39768, Q9V AU 9)

Term

Annotations

P 39768

GO:0005515, GO:0008270, GO:0046872 ,... . . .

Q9V AU 9

GO:0005515, GO:0008270, GO:0046872

(P 39768, Q9V AU 9)

GO:0005515, GO:0008270, GO:0046872

Key based querying

The simplest way to query the database is to insert a protein identifier (e.g. Uniprot Identifier). OntoPIN retrieves all the binary interactions related to the protein and the related annotations. Currently these annotation are grouped by the Gene Ontology Taxonomy. For instance, by inserting the protein identifier P 39768 the system retrieves all the information that are summarized in Table 2. Table 2

3.3

The querying module

The querying interface is a key component of ONTOPIN because it defines the granularity through which queries may be formulated. In fact, while powerful queries may be formulated by using high level ontology query languages, they require a training for the users, so they are not user friendly. Conversely, in our opinion, a query interface should hide main details about the underlying data model and query languages, enabling to formulate queries in a simple way. For instance, the user has to be able to formulate sim-

Interacting proteins of P 39768

Query Protein

Interacting Partner

P39768

Q9VAU9

P39768

Q8T3Y0

P39768

Q01642

P39768

P08175

The second way to query the database is to specify a protein identifier, e.g. P 39768, and an annotation, e.g. GO : 0005515 (system presents a list of all the stored annotations, so the user can simply choose one


191

from this list). The system will retrieve all the binary interactions participated by the protein and that are annotated with the chosen term. Table 3 summarizes the results. Obviously, this case can be easily extended by inserting more annotation and refining the query results. Table 3

Interactions of protein P 39768 annotated with GO : 0005515

P39768

Q9VAU9

The third way to query the database is to specify a list of annotations, e.g. GO : 0030528 and GO : 0043565. The system will retrieve all the binary interactions participated by the protein and that are annotated with the chosen term(s). Table 4 summarizes these results. Table 4

Interactions annotated with A and B

P09089

P42003

P09089

O96660

P09089

Q9XZS4

Q9VS38

P09089

OntoPin provides a graphical user interface that make easy the query building phase and provides simple graph manipulation operations as depicted in Fig. 3.

The list of binary interactions Simple graph manipulation operations P09250

P09286

P09283

P09281 P09287

P09294 System builds automatically the related interaction graph

Fig. 3

3.5

P09291 P09278

P09277

Visualization of results in OntoPin.

Approximate querying

This section will explain the querying of the annotated database by using semantic similarity among a set of GO labels chosen as input by the user and the interactions stored into the annotated PPI databases. Thus, given a set of GO terms, the database will retrieve all the interactions that are annotated with a set of labels that are similar to the query terms. As

previously discussed, OntoPIN implements following steps: User choose an annotation term A and a threshold of similarity σ; System compares A with all the stored interactions using the Lin’s measure; The interactions reporting a similarity score bigger σ are given as results to the user. The first release of OntoPIN stores the interaction of Homo Sapiens. Interactions of H. Sapiens are extracted from the IntAct database. It currently stores 56,924 interactions and 11,338 proteins. 3.6

Example on a human subset of Ontopin

We show the process of annotations and the query possibilities of OntoPin through an example of a set of proteins and interactions that are related to a wellknown mechanims of regulation and transformation of sugar in the human body. In particular we consider a subset of OntoPin that comprises proteins TPI1, GAPDH, PGK1, PGK2, PGAM1, PGAM2, PGAM4, ENO1, ENO2, ENO3, and PKM2. Such proteins constitutes a functional module related to metabolism of Glycolysis, as reported in KEGG database. Each protein of this dataset is annotated with the GO terms depicted in Table 5. Ontopin stores the interactions and the related annotations as summarized in Table 6. In order to retrieve interactions related to glycolysis and, for instance, to catalytic activity (GO:0003824), the researcher can extract all the proteins belonging to Human that are annotated with this term. Using, for instance, the GOA database, researcher finds 38 proteins and more than 200 interactions, than has to filter out these proteins to retrieve related interactions interactions. Using OntoPIN, conversely, the number of retrieved interactions decrease to 5, then user has to find the desired interaction(s) among these.

4 Experimental evaluation In an attempt to assess the effectiveness of the OntoPIN approach, it is essential to have a validated Gold Standard Dataset (GSTD), in order to evaluate the Precision and the Recall of the querying systems on the basis of relevant interactions extracted, and non-relevant ones. Unfortunately, to the best of our knowledge, there are not Gold Standard Datasets that can be used for such experiments, so we defined a dataset to test by assembling protein interaction data in human in such a way: we considered as GSTD a set of 263 proteins and 1891 interactions. Then we defined some measures to test the effectiveness of the improvement of using ontologies during search. In information retrieval contexts classical measures of benchmark are precision and recall, that are defined

192

Interdiscip Sci Comput Life Sci (2013) 5: 187–195 Table 5

GO terms for proteins of dataset 1

TPI1

GO:0008152 GO:0004807 GO:0016853 GO:0003824

GAPDH

GO:0006094 GO:0006006 GO:0005975 GO:0055114 GO:0008152 GO:0006096 GO:0051402 GO:0050821 GO:0035606 GO:0006915 GO:0005515 GO:0016620 GO:0004365 GO:0005488 GO:0016740 GO:0051287 GO:0016491 GO:0008943 GO:0035605 GO:0003824 GO:0005829 GO:0005634 GO:0005829 GO:0005737 GO:0016020

PGK1

GO:0006096 GO:0016310 GO:0006094 GO:0006006 GO:0005975 GO:0006096 GO:0004618 GO:0016740 GO:0004618 GO:0000166 GO:0005524 GO:0005737 GO:0005829

PGK2

GO:0006096 GO:0016310 GO:0004618 GO:0005524 GO:0016740 GO:0005737 GO:0005829

PGAM1

GO:0006096 GO:0008152 GO:0046538 GO:0003824 GO:0016868 GO:0016853

PGAM2

GO:0006096 GO:0008152 GO:0016853 GO:0003824 GO:0016868

PGAM4

GO:0006096 GO:0008152 GO:0004619 GO:0003824 GO:0016868 GO:0016853 GO:0004083 GO:0004082 GO:0016787 GO:0005575

ENO1

GO:0006096 GO:0000287 GO:0004634 GO:0000287 GO:0016829 GO:0000015

ENO2

GO:0006096 GO:0042493 GO:0014070 GO:0006094 GO:0005975 GO:0032355 GO:0004634 GO:0000287 GO:0042803 GO:0046872 GO:0046982 GO:0016829 GO:0005622 GO:0000015 GO:0005829 GO:0043204 GO:0005886 GO:0005737 GO:0043025 GO:0001917 GO:0005626 GO:0005625 GO:0019717 GO:0016020

ENO3

GO:0006096 GO:0004634 GO:0000287 GO:0016829 GO:0000015

PKM2

GO:0006096 GO:0004743 GO:0003824 GO:0004743 GO:0030955 GO:0000287

Table 6

Annotation of the interactions of dataset 1

Proteins

Annotations

TPI1, PGK2

GO:0008152

TPI1, PGAM1

GO:0003824 GO:0016853,GO:0008152

TPI1 PGAM2

GO:0008152 GO:0016853 GO:0003824

TPI1 PGAM4

GO:0008152 GO:0016853 GO:0003824

TPI1 PKM2

GO:0003824

the interactions retrieved from a query), and a set of relevant interactions (RelInt) (e.g. the list of all the interactions produced by the search engine that are annotated with a certain GO term). Formally, we define: RelInt RetInt P recision = (4) RetInt RelInt RetInt . (5) Recall = RelInt Finally, we defined the F-measure as the traditional harmonic mean of the previous two: and

Fscore = in terms of a set of retrieved documents (e.g. the list of documents produced by a web search engine for a query) and a set of relevant documents (e.g. the list of all documents on the internet that are relevant for a certain topic). For example for a text search on a set of documents precision is the number of correct results divided by the number of all returned results. Similarly recall is defined as the number of correct results divided by the number of results that should have been returned. In our context we defined such measures in terms of retrieved interactions (RetInt) (e.g. the list of all

2 · precision · recall precision + recall

(6)

Let us consider following example to better explain the calculation of precision and recall. A dataset of 5 proteins, Pj , j = 1, 2, · · · , 5, four interactions P1 , and a set annotations for both proteins (as contained in other databases) and interactions as contained in OntoPIN as summarized in Table 7. Let us suppose that a researcher needs to find all the interactions that are related to the function coded with the terms T2 . In a traditional database, researchers should insert the term T2 retrieving all the interactions for which at least a protein has such term of annotation.

Interdiscip Sci Comput Life Sci (2013) 5: 187–195 Table 7

193

Example

Proteins

4.1

Annotations

P1

T1 , T2 , T3 , T10 , T11

P2

T2 , T3 , T5

P3

T1 , T2 , T7

P4

T1 , T2 , T4 , T5

P5

T3 , T8 , T9

P6

T10 , T11 , T12

Finally we compared benchmark of OntoPin with respect to existing databases considering the above defined dataset a set of queries with an increasing number of GO Terms. In particular we defined a set of GO Terms belonging to three ontologies, GOD ataset. Then we defined three classes of GO terms, Class1 , Class2 , and Class3 . Each class contains ten groups of GO Terms that are randomly extracted from the GOD ataset with an increasing number of GO Terms. In particular groups of Class1 contain three GO Terms, groups of Class2 , contains five GO Terms and groups of Class3 contains seven GO Terms. For each Class we queried OntoPIN searching for interactions annotated with this term. Queries in OntoPIN have the following structure:

P7 T2 (P1 , P2 )

T2 , T3

(P3 , P4 )

T1 , T2

(P1 , P3 )

T1 , T2

(P1 , P5 )

T3

(P1 , P6 )

T10 , T11

(P1 , P7 )

T2

Select "Interactions" From OntoPIN where Interaction.annotation contains=GOTerm AND Interaction.Species="Human"

Then researcher finds following interactions: (P1 , P2 ), (P3 , P4 ),(P1 , P3 ), (P1 , P5 ), (P1 , P6 ). Dataset contains 3 relevant interaction, and query retrieves 3 relevant interactions and a total of 5 interactions. Following the previous definitions we can calculate the precision as the fraction of retrieved interactions that are relevant to the search: P recision =

3 = 0.6 5

(7)

Similarly recall is is the fraction of the interactions that are relevant to the query that are successfully retrieved. 3 Recall = = 1 (8) 3 Finally the F-score is: F − Score =

2 ∗ 0.6 ∗ 1 = 0.75 0.6 + 1

(9)

Alternatively, let us consider the querying strategy of OntoPIN. As consequence by inserting T2 , OntoPIN returns (P1 , P2 ), (P3 , P4 ), (P1 , P3 ), reporting following values: 3/3 P recision = =1 (10) 3 3/3 Recall = =1 3 F − Score =

2∗1∗1 =1 1+1

Benchmark of key based querying

(11) (12)

Then we queried MINT and Intact searching for proteins annotated with the same terms, searching for proteins which are annotated with the same terms specified before. Finally we evaluated precision and recall for each class as the average of the precision and recall of all the groups belonging to those class. Table 8 summarizes these results. Finally we compared benchmark of OntoPin with respect to existing databases considering the above defined dataset a set of queries with an increasing number of GO Terms. In particular we defined a set of GO Terms belonging to three ontologies, GOD ataset. Then we defined three classes of GO terms, Class1 , Class2 , and Class3 . Each class contains ten groups of GO Terms that are randomly extracted from the GOD ataset with an increasing number of GO Terms. In particular groups of Class1 contain three GO Terms, groups of Class2 , contains five GO Terms and groups of Class3 contains seven GO Terms. For each Class we queried OntoPIN searching for interactions annotated with this term. Queries in OntoPIN have the following structure: Select "Interactions" From OntoPIN where Interaction.annotation contains=GOTerm AND Interaction.Species="Human" 4.2

Benchmark of approximate querying

Finally we compared benchmark of OntoPin with respect to existing databases considering the above defined dataset and set of GO Terms belonging to three ontologies, GOD ataset hereafter.

194

Interdiscip Sci Comput Life Sci (2013) 5: 187–195 Table 8 One term

Database - Terms Prec

Rec

OntoPIN

0.75

MINT

0.80

Intact

0.80

Two terms Prec

Rec

F-Score

0.62

0.8

0.72

0.65

0.60

0.68

0.79

0.5

0.55

0.68

0.75

0.51

Prec

Rec

F-Score

0.68

0.6

0.68

0.6375

0.61

0.65

0.5

0.565217391

0.6

0.6

0.51

0.551351351

Benchmark of Ontopin

One Term

Database - Terms

Three terms

F-Score

Table 9

OntoPIN

Benchmark of Ontopin

Two Terms

Three Terms

Prec

Rec

F-Score

Prec

Rec

F-Score

Prec

Rec

F-Score

0.65

0.62

0.63

0.66

0.64

0.68

0.6

0.68

0.63

Then we defined three classes of GO terms, Class1 , Class2 , and Class3 . Each class contains ten groups of GO Terms that are randomly extracted from the GOD ataset with an increasing number of GO Terms. In particular groups of Class1 contain three GO Terms, groups of Class2 , contains five GO Terms and groups of Class3 contains seven GO Terms. For each Class we queried OntoPIN searching for interactions containing similar annotation with this term. Queries in OntoPIN have the following structure: Select "Interactions" From OntoPIN where Interaction.annotation isSimilar to= GOTerm AND Interaction.Species="Human" Finally we evaluated precision and recall for each class as the average of the precision and recall of all the groups belonging to those class. Table 9 summarizes these results.

5 Conclusions As discussed before, a lot of biological information about interactions is currently spread on different databases or organized through ontologies. Nevertheless, such information is often not used for querying interaction databases or for analyzing interaction data. Existing databases, in fact, have query interfaces with poor semantics For instance they enable the retrieval of one or more proteins that interact with a query protein but do not allow to select proteins or interactions on the basis of their functions or on the basis of a similar function. Moreover existing algorithms usually have to manually integrate interactions or proteins with semantic information. The paper presented OntoPIN, a system to annotate PPI databases with biological information extracted

from Gene Ontology and a powerful query interface to query the resulting annotated PPI database. Main contribution of OntoPIN is to enable the semantic searching of interactions searching for exact corresponding terms or for similar terms.

References [1] Aittokallio, T., Schwikowski, B. 2006. Graph-based methods for analyzing networks in cell biology. Brief Bioinform 7, 243-255. [2] Camon, E., Magrane, M., Barrell, D., Lee, V., Dimmer, E., Maslen, J., Binns, D., Harte, N., Lopez, R., Apweiler, R. 2004. The gene ontology annotation (GOA) database: Sharing knowledge in uniprot with gene ontology. Nucl Acid Res 32, D262-D266. [3] Cannataro, M., Guzzi, P.H., Veltri, P. 2009. Using ontologies for annotating and retrieving protein-protein interactions data. In: Proceedings of the 22nd IEEE International Symposium on Computer-Based Medical Systems, Albuquerque, USA, 1-5. [4] Cannataro, M., Guzzi, P.H., Veltri, P. 2010a. Proteinto-protein interactions: Technologies, databases, and algorithms. ACM Comput Surv 43, 1-42. [5] Cannataro, M., Guzzi, P.H., Veltri, P. 2010b. Impreco: Distributed prediction of protein com-plexes. Future Generation Comp Syst 26, 434-440. [6] Cannataro, M., Guzzi, P.H., Veltri, P. 2010c. Using ontologies for querying and analyzing protein-protein interaction data. Procedia CS 1, 997-1004. [7] Ciriello, G., Mina, M., Guzzi, P.H., Cannataro, M., Guerra, C. 2012. Alignnemo: A local network alignment method to integrate homology and topology. PLoS ONE 7, e38107. [8] Erdos, P., Rnyi, A. 1960. On the evolution of random graphs. Publ Math Inst Hung Acad Sci 5, 17-61. [9] Guzzi, P.H., Mina, M., Guerra, C., Cannataro, M. 2012. Semantic similarity analysis of protein data: assessment with biological features and issues. Brief Bioinform 13, 569-585.


195

[10] Jiang, J.J., Conrath, D.W. 1997. Semantic similarity based on corpus statistics and lexical taxonomy. In: Proceedings of the International Conference Research on Computational Linguistics, Taiwan, 23-43.

[15] Popescu, M., Keller, J.M., Mitchell, J.A. 2006. Fuzzy measures on the gene ontology for gene product similarity. IEEE/ACM Trans Comput Biol Bioinformatics 3, 263-274.

[11] Licata, L., Briganti, L., Peluso, D., Perfetto, L., Iannuccelli, M., Galeota, E., Sacco, F., Palma, A., Nardozza, A.P.P., Santonico, E., Castagnoli, L., Cesareni, G. 2012. MINT, the molecular interaction database: 2012 update. Nucl Acid Res 40, D857-D861.

[16] Resnik, P. 1995. Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence, Montréal, 448-453.

[12] Lin, D. 1998. An information-theoretic definition of similarity. In: Shavlik, J.W. (Editor), ICML, Morgan Kaufmann, San Francisco, 296-304. [13] Mina, M., Guzzi, P.H. 2012. Alignmcl: Comparative analysis of protein interaction net-works through markov clustering. In: Proceedings of IEEE BIBM Workshops, Philadelphia, USA, 174-181. [14] Pesquita, C., Faria, D., Falco, A.O., Lord, P., Couto, F.M. 2009. Semantic similarity in biomedical ontologies. PLoS Comput Biol 5, DOI: 10.1371/journal.pcbi.1000443.

[17] Salwinski, L., Miller, C.S., Smith, A.J., Pettit, F.K., Bowie, J.U., Eisenberg, D. 2004. The database of interacting proteins: 2004 update. Nucl Acid Res 32, D449-451. [18] West, D.B. 2000. Introduction to Graph Theory. 2nd Edition, Prentice Hall, NY. [19] Yu, G., Li, F., Qin, Y., Bo, X., Wu, Y., Wang, S. 2010. GOSemSim: An R package for measuring semantic similarity among GO terms and gene products. Bioinformatics 26, 976-978.

PPI-Refractory GERD: an Intriguing, Probably Overestimated, Phenomenon.

An Algorithm for Building an Electronic Database.

Integrating PPI datasets with the PPI data from biomedical literature for protein complex detection.

Further call for PPI expertise in HEX.

ATP-PPi exchange activity of progesterone receptor.

An assessment of human gastric fluid composition as a function of PPI usage.

On the fallacy of "undivided attention" in an attentional task: the ppi-pi effect.

[Thailand]health database in an information society.

[Japan]health database in an information society.

RNAcentral: an international database of ncRNA sequences.

AmphibiaChina: an online database of Chinese Amphibians.

[Development and use of an APHAB database].

The SADE system: an endoscopic database manager.

Challenges for an enzymatic reaction kinetics database.

[Singapore]Health Database in an IT Society.

[India]health database in an information society.

Stereotactic angiography: an inadequate database for radiosurgery?

Organic materials database: An open-access online database for data mining.

Dietary MicroRNA Database (DMD): An Archive Database and Analytic Tool for Food-Borne microRNAs.

The 2016 database issue of Nucleic Acids Research and an updated molecular biology database collection.

The National NeuroAIDS Tissue Consortium (NNTC) Database: an integrated database for HIV-related studies.

A method for predicting protein complex in dynamic PPI networks.

ABC and IFC: modules detection method for PPI network.

New Approaches to Management of PPI-Refractory Gastroesophageal Reflux Disease.