Evaluating the significance of protein functional similarity based on gene ontology.

JOURNAL OF COMPUTATIONAL BIOLOGY Volume 21, Number 11, 2014 # Mary Ann Liebert, Inc. Pp. 809–822 DOI: 10.1089/cmb.2014.0181

Evaluating the Significance of Protein Functional Similarity Based on Gene Ontology BOGUMIL M. KONOPKA,1 TOMASZ GOLDA,1 and MALGORZATA KOTULSKA1

ABSTRACT Gene ontology is among the most successful ontologies in the biomedical domain. It is used to describe, unambiguously, protein molecular functions, cellular localizations, and processes in which proteins participate. The hierarchical structure of gene ontology allows quantifying protein functional similarity by application of algorithms that calculate semantic similarities. The scores, however, are meaningless without a given context. Here, we propose how to evaluate the significance of protein function semantic similarity scores by comparing them to reference distributions calculated for randomly chosen proteins. In the study, thresholds for significant functional semantic similarity, in four representative annotation corpuses, were estimated. We also show that the score significance is influenced by the number and specificity of gene ontology terms that are annotated to compared proteins. While proteins with a greater number of terms tend to yield higher similarity scores, proteins with more specific terms produce lower scores. The estimated significance thresholds were validated using protein sequence–function and structure–function relationships. Taking into account the term number and term specificity improves the distinction between significant and insignificant semantic similarity comparisons. Key words: gene ontology, protein function, semantic similarity.

1. INTRODUCTION

R

apid growth of reliable biological databases, which are freely available for the scientific society, is an unquestionable proof of impressive progress in the field of biomedical research, especially in highthroughput sequencing technologies. Optimistic estimates state that the human interactome will be solved within the next decade (Nebel, 2012). However, if the available experimental data are to be useful, they need careful annotation and further processing. Computer science provided ontology framework to support processing and analyzing large amounts of data. Ontologies allow for modeling and describing any domain of interest in a unified and standardized way (Gruber, 1995). One of the most often used biomedical ontologies is the ontology of genes and gene products—the Gene Ontology (GO) (Ashburner et al., 2000). GO is used to describe three different aspects of proteins that are coded by genes. These aspects are molecular function, which standardizes the naming of molecular actions performed by proteins; biological process, which groups terms describing biological processes in which

1

Institute of Biomedical Engineering and Instrumentation, Wroclaw University of Technology, Wroclaw, Poland.

809

810

KONOPKA ET AL.

proteins participate; and cellular component, with terms that can be used for describing the location within the cell where the protein is active. Those three vocabularies form separate subontologies; each is a directed acyclic graph with terms interconnected mainly by is_a and part_of relations. The most general terms form the upper part of GO and their specificity increases with increasing depth in the ontology tree. GO is widely used in analyzing and interpreting the outcome of high-throughput genomic or proteomic experiments, for example, gene expression microarray experiments (Haugen et al., 2010; Gruca et al., 2011; Warita et al., 2012). Huang et al. (2009) provided a comprehensive review of methods used for the socalled GO enrichment analysis. GO annotations have also been used in studies that investigated protein sequence, structure, and function relationships (Hvidsten et al., 2009; Pascual-Garcia et al., 2010). In Konopka et al. (2012) we applied protein GO terms in a protein model quality assessment program. The GO term hierarchy, which represents relations between terms, allows for calculating similarities between term semantics. This supports automated inference based on annotated data and allows for quantification of concepts that would be hard to quantify otherwise. Functional similarity of proteins is an example of such a concept. A number of approaches for calculating semantic similarity (SemSim) of protein annotations have been proposed. They can be organized into algorithms based on term annotation frequency (Lord et al., 2003; Schlicker et al., 2006), ontology structure (Pekar and Staab, 2002; Wu et al., 2005, 2006), and vector space model (Chabalier et al., 2007; Benabderrahmane et al., 2010). There is also a number of hybrid approaches (Pesquita et al., 2007; Wang et al., 2007; Othman et al., 2008). Pesquita et al. (2009) provided a comprehensive review of those methods. Similarity between proteins in aspects covered by GO, that is, function, process, and location, is hard to quantify without the formal description supplied by the ontology. For instance, there are no experiments that directly assess functional similarity of proteins. Thus, there is no gold standard for comparison and assessment of algorithms calculating SemSim. The resemblance of proteins in the GO-described aspects can be estimated indirectly based on other protein features. Sequence similarity was used extensively as a benchmark measure (Lord et al., 2003; Xu et al., 2008; Pesquita et al., 2008; del Pozo et al., 2008). It was established that sequence similarity is more strongly related with molecular function than biological process and cellular component (Lord et al., 2003). On the other hand, biological process and cellular component similarities are more strongly related with protein–protein interactions (PPI) (Wu et al., 2006; Xu et al., 2008). The assumption that underlies the use of PPI in assessment of SemSim is that proteins that contribute to the same biological process are more likely to interact. Several other sources of information have also been used to evaluate the performance of SemSim scores, for example, gene expression correlations (Xu et al., 2008), Pfam, or enzyme commission annotations (Alvarez et al., 2011). However, regardless of how well an algorithm correlates protein GO-based SemSim with actual biological similarities, a calculated similarity score is meaningless without a biological context. In this work, we propose a method to evaluate the significance of SemSim scores. In the approach, the significance of a SemSim score is estimated in relation to a reference distribution of semantic similarities calculated for a significant number of randomly chosen protein pairs. We evaluate the hypothesis that a SemSim score equal to the 95% significance level is not only significant from the statistical point of view but can also be treated as a biologically meaningful threshold.

2. METHODS 2.1. Data sets Four data sets of protein GO term annotations were compiled for the study: a multispecies set of proteins deposed in the Protein Data Bank (PDB) (Berman et al., 2000), and three organism-specific sets: human (Homo sapiens), ecoli (Escherichia coli) string K12, and athaliana (Arabidopsis thalaiana). Protein sequence redundancy was controlled. PDB, human, and athaliana (as of April 17, 2012) annotation files were downloaded from ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/ (PDB as of March 5, 2012; human and athaliana as of April 17, 2012). The ecoli annotations were downloaded from www.geneontology.org/ GO.downloads.annotations.shtml as of April 17, 2012. Human and athaliana annotation projects originally used protein Uniprot IDs. PDB and ecoli protein IDs, which were not originally Uniprot IDs, were mapped to unique Uniprot IDs. Respective protein amino acid sequences were downloaded from uniport.org, and sequence redundancy of the sets was reduced with CD-HIT (Li and Godzik, 2006). We followed a three-

GO SIGNIFICANCE

811

step procedure suggested by the authors of the application (Li, 2012). First, the redundancy was limited from 100% to 80%, then from 80% to 60%, and finally from 60% to 30%. In the analysis, we used 100%, 60%, and 30% redundancy sets. We focused only on protein function–structure relation; hence, only annotations of MF terms were considered in the study. In terms of evidence codes all available annotations were used (including those inferred from electronic annotations).

2.2. Protein GO annotation set parameters Every protein in our data set was annotated with a set of GO terms. We investigated the influence of those annotation set parameters on the value of SemSim. These parameters were (1) annotation set size (#GO), the number of unique GO terms annotated to the protein, and (2) annotation set specificity (SPEC), the GO hierarchy depth of the most specific term; specificity of a single GO term was its depth in the GO hierarchy.

2.3. Semantic similarity Pairwise SemSim of GO terms was calculated with Wang’s algorithm (Wang et al., 2007). It is a GO graph-based approach. For each term GOi in the ontology, a semantic value SV(GOi) was calculated based on semantic contributions SGOi(t), of GOi ancestor terms: X SGOi (t)‚ (1) SV (GOi ) = t2T GOi

where TGOi is a set of ancestor terms of the term GOi. Semantic contributions of ancestor terms to GOi term, SGOi(t) is defined iteratively as SGOi (GOi ) = 1 (2) SGOi (t) = maxfwe SGOi (t0 )jt0 2 children of (t)‚ t 6¼ GOi g where we is the weight of the relation between terms. The weights suggested by Wang et al. (2007) are 0.8 and 0.6 for is_a and part_of, respectively. SemSim of two terms was calculated as P (SGOi (t) + SGOj (t)) SimilarityWang (GOi ‚ GOj ) =

t2TGOi \TGOj

SV(GOi ) + SV(GOj )

:

(3)

The SemSim of proteins is the similarity of GO term sets annotated to compared proteins. In Wang’s algorithm, the best-matching average approach is applied. First, for each term in the set, the best matching term in the second set is identified: SimilarityWang (GO‚ GOset) = max (SimilarityWang (GO‚ GOi ))‚ i = 1‚ ...‚ m

(4)

where m is the number of terms in the GO set. Then, all best-matching pairwise similarities were averaged: Pn Pm i = 1 SimilarityWang (goi ‚ GOsetB ) j = 1 SimilarityWang (goj ‚ GOsetA ) (5) SemSim(GOsetA ‚ GOsetB ) = m+n Here m, n are numbers of terms annotated to proteins A and B. From now on in the text, we refer to SemSim (GOsetA, GOsetB) as SemSim. This is the semantic similarity value that is analyzed in the study.

2.4. Score significance–reference distributions To evaluate the significance of SemSim of two proteins, we propose to compare the calculated SemSim to a reference distribution of SemSim scores acquired for a large set of protein pairs. We assume that two proteins are significantly similar in terms of SemSim if their similarity is greater than the assumed 0.05 p-value threshold, derived from a reference SemSim distribution. The 0.95 quantile is an estimator

812

KONOPKA ET AL.

of this threshold. The threshold means that only 5% of all protein pairs from a test set produce a greater SemSim score. In the study, we used two approaches to generating reference to random distributions: we call them allagainst-all random distribution (all-vs-all) and subset-against-all random distribution (subset-vs-all). These methods are described below.

2.5. The all-vs-all random distribution In all-vs-all approach, SemSim between 2500 randomly chosen protein pairs, from a given set, was calculated. An exemplary distribution is presented in Figure 1b. A six-number summary of a distribution was obtained. The summary included 0.05, 0.25, 0.5, 0.75, and 0.95 quantiles, and the mean (Fig. 1a). Starting from here on we will refer to those parameters as min, max, Q0.05, Q0.25, median, Q0.75, Q0.95, and mean. The procedure was repeated 50 times for a data set, and then average values of all parameters and their standard deviations were calculated. The all-vs-all approach was used to obtain general estimates of SemSim significance thresholds.

2.6. The subset-vs-all distribution In the subset-vs-all approach, random proteins from a subset of proteins were compared against random proteins from a more general protein set (e.g., subset—proteins with 3 annotations; general set—proteins with any annotations). Again, 2500 pairs were compared and the resulting distribution was parametrized with Q0.05, Q0.25, median, Q0.75, Q0.95, and mean (Fig. 1). The parameters were averaged over 50 repetitions of distributions generation. The subset-vs-all approach was used in the study to investigate the influence of annotation parameters SPEC and #GO on the results of SemSim comparison.

2.7. Corrected SemSim significance thresholds In each data set, proteins were grouped based on their annotation set #GO and SPEC. Significance thresholds were estimated for each group of proteins separately, following the subset-vs-all protocol. Calculated Q0.95 values were organized into a 2D table, where the significance threshold was a function of #GO and SPEC parameters. If the number of proteins in a particular group was lower than 10, the threshold estimated for this group was considered as not representative and was replaced by the static significance threshold estimated for the respective data set (PDB, human, ecoli, and athaliana).

a 0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

2.5

0.2

1.5 1.0 0.5 0.0

Density

2.0

b FIG. 1. An exemplary distribution of semantic similarities calculated for randomly chosen proteins. The boxplot in (a) shows a six-number summary of the distribution in (b). Quantiles 0.05, 0.25, 0.5, 0.75, and 0.95 are marked by whiskers and the box; the mean was marked by a yellow diamond. The distribution shape (b) was fitted with an extreme value distribution model.

0.0

SemSim

GO SIGNIFICANCE

813

2.8. Model fitting Distributions generated in all-vs-all and subset-vs-all experiments were fitted with model distributions. Out of several models tested, the best fitting was provided by the general extreme value (GEV) distribution. We used the evd package in R (Stephenson, 2012) for fitting. A GEV model is defined by three parameters, that is, location, scale, and shape. Similarly to distribution statistics, those model parameters were averaged over 50 generated reference distributions. The goodness-of-fit was assessed using Anderson–Darling goodness-of-fit test from ADGofTest package in R.

2.9. Protein sequence–function relation BLAST was used to run an all-against-all sequence comparison for proteins in the PDB annotation set. Hits that yield an e-value below 10 - 4 were retained. For measuring protein sequence similarity, the relative reciprocal BLAST score (RRBS) measure proposed by Pesquita et al. (2008) was used. It was defined as RRBS(A‚ B) =

BLASTbit score (A‚ B) + BLASTbit score (B‚ A) BLASTbit score (A‚ A) + BLASTbit score (B‚ B)

(6)

where A and B are compared sequences.

2.10. Protein structure–function relation The structure–function relation was examined on a set of representative SCOP database (Murzin et al., 1995) nonredundant structures (30% sequence similarity cutoff, 5901 protein structures) and their structural neighbors (SNs) identified with DALI (Holm and Rosenstro¨m, 2010). SNs of SCOP proteins along with their structural similarity scores were retrieved from DALI database server (http://ekhidna.biocenter .helsinki.fi/dali/start as of May 2012) (Holm and Rosenstro¨m, 2010). SemSim values between SCOP proteins and their SNs were calculated as defined above. Structural resemblance of proteins was measured with DALI Z-score. As reported by DALI authors, all Z-score values greater than 2 are significant hits. The authors have also estimated an empirical Z-score cutoff value for structural ‘‘strong match’’ hits, which is Zcutoff =

n - 4‚ 10

(7)

where n is the length of a protein (Holm et al., 2008).

3. RESULTS AND DISCUSSION 3.1. Estimating static significance thresholds In order to assess the significance of SemSim scores, we generated and investigated reference distributions of SemSim between random proteins (see all-vs-all generation in Methods). Four annotation corpora were analyzed, that is, the PDB set, human, ecoli, and athaliana sets. Each corpora was analyzed at three different levels of the sequence redundancy: a redundant set, redundancy reduced to 60% and 30% (for the procedure, see Methods section). Distributions acquired for all corpora share the same general shape (Fig. 2). The difference between median and average values indicates that the distributions are asymmetric. This is also confirmed by the differences between min–median and median–Q0.95 distances. The latter distance is significantly greater. With the exception of the human set, we were able to fit the distributions with GEV model distributions (see Methods). The p-values provided by the Anderson–Darling goodness-of-fit test ranged from 0.05 to 0.12. The shapes of distributions are quite similar; however, for the PDB set, the distribution is significantly shifted toward greater values (Fig. 2 and Table 1). In this set, the median and Q0.95 values are much higher than the values acquired for other sets. We believe that the reason for this difference is the multiorganism content of the PDB. The set contains proteins that play similar roles in different organisms, which increases the overall similarity of proteins in the set. There is no clear, consistent relation between parameters of distributions and sequence redundancy in the sets. In ecoli, there are almost no differences in distribution parameters between different redundancy

814

KONOPKA ET AL. 1

PBD

HUMAN

ECOLI

ATHALIANA

Function Similarity

0.8

0.6

0.4

0.2

30

60

100

30

60

100

30

60

100

30

60

100

0

Dataset redundancy FIG. 2. The summary of SemSim distributions calculated for randomly chosen proteins. SemSim scores were investigated in four annotation corpora: one multiorganism set (the PDB [red]), and three single-organism sets, that is, human (green), ecoli (blue), and athaliana (magenta). Three different redundancy cutoffs were investigated: 100 (thick box), 60 (dash), and 30. The boxes show quantiles of distributions Q0.25, median, and Q0.75; whiskers show Q0.05 and Q0.95; average values are marked by squares. The parameters were averaged over 50 experiments.

cutoffs. In athaliana and human sets, the median and average values rise when redundancy decreases, while in the PDB set, the median and average do not change significantly between GOA and GOA60, and in the least redundant set, GOA30, they fall. In general, the threshold of the significant similarity, quantile 0.95, falls in the range from 0.602 (human GOA) to 0.752 (athaliana GOA30). In the PDB set, the value does not change with redundancy and it equals approximately 0.7. In human and athaliana sets, Q0.95 moves toward higher values with decreasing redundancy. The athaliana set turned out special, since in less redundant versions of the set, we observed increasing ratios of proteins sharing exactly the same annotations. This resulted in a greater number of protein pairs that acquired maximal SemSim (i.e., 1) when generating reference distributions. This led to increased values of all the statistics calculated for the athaliana sets, including the Q0.95. Table 1. Differences in Semantic Similarity Score, Median, and Q0.95 Values Between Different Gene Ontology Annotation Sets Data set

Redundancy

Median

Q0.95

PDB

100 60 30

0.360 0.362 0.348

0.695 0.700 0.705

Human

100 60 30

0.212 0.262 0.284

0.602 0.651 0.678

Ecoli

100 60 30

0.277 0.277 0.278

0.621 0.621 0.618

Athaliana

100 60 30

0.256 0.271 0.288

0.645 0.688 0.752

GO SIGNIFICANCE

815

The analysis of different organism data sets showed that the reference distributions of SemSim are very similar in terms of the shape; however, they vary in terms of values of statistics. The differences in parameter values may be partially explained by the fact that each annotation set comes from a different annotation project. Proteins are annotated with different techniques, and scientists involved in each project may be focused on annotation of different aspects or different types of proteins. We observe that the score significance differs between annotation corpora; therefore, significance thresholds should be estimated separately for every corpus prior using SemSim in proteins of the set.

3.2. GO term specificity influence

b SemSim

0.0 0.2 0.4 0.6 0.8 1.0

SemSim

a

spec1

spec4

spec7

spec10

0.0 0.2 0.4 0.6 0.8 1.0

Although SemSim scores for two different pairs of proteins may be the same, the actual meaning of the comparisons may differ. For example, high similarity of proteins described with detailed GO terms, that is, terms located at lower levels of the ontology graph, is more meaningful than the same similarity value calculated between two proteins annotated with some generic terms. This is called shallow annotation problem (Wang et al., 2007). To test the influence of specificity of annotations on the significance thresholds, we first grouped proteins by their specificity levels (see Methods for specificity definition). Then, we generated reference SemSim distributions by calculating SemSim value between proteins with a certain specificity and proteins sampled from the whole set (see Methods, subset-vs-all distribution). In all analyzed corpora, we could observe a general trend that Q0.95 values decreased as the specificity of annotations increased (Fig. 3). This confirmed the shallow annotation problem described by Wang et al. (2007). Detailed terms, which are located deep down the GO graph, are semantically less similar to other terms compared to more generic terms located in the middle or top part of the GO hierarchy. That is why subset-vs-all distributions are shifted toward lower values for more specifically described proteins.

spec13

spec1

spec4

d SemSim spec1

spec4

spec7

spec10

GO specificity

spec14

GO specificity

0.0 0.2 0.4 0.6 0.8 1.0

SemSim

c

spec10

spec14

0.0 0.2 0.4 0.6 0.8 1.0

GO specificity

spec7

spec1

spec4

spec7

spec10

spec14

GO specificity

FIG. 3. The influence of the specificity of protein annotations on distributions of sematic similarity. Proteins with a given specificity were compared against all other proteins in their annotation corpora: (a) PDB, (b) ecoli, (c) human, and (d) athaliana. The general tendency is that distributions move toward lower values as the specificity of annotations increases. The boxplot whiskers mark Q0.05 and Q0.95 qunatiles. The box marks Q0.25, Q0.5, and Q0.95 quantiles. The mean values are marked by diamonds.

816

KONOPKA ET AL.

Although the trend holds in general, some exceptions can be noticed. For instance, in the PDB set, the Q0.95 at SPEC equal to 2 is significantly higher than in all remaining subsets (Fig. 3a). We found that it is because many comparisons in this set yield the maximum value of 1. In this subset, a great number of proteins are annotated a single term GO:0005515—‘‘protein binding,’’ which yields multiple perfect-match scores.

3.3. GO term number influence

b SemSim

0.0 0.2 0.4 0.6 0.8 1.0

SemSim

a

GOn1

GOn3

GOn5

GOn7

0.0 0.2 0.4 0.6 0.8 1.0

In order to evaluate the influence of the number of GO terms annotated to proteins (#GO) on their semantic comparisons, we carried out two experiments. These experiments were ran separately for each corpora. First, the annotations were divided into subsets, based on #GO. Each subset, with the exception of the last one, comprised proteins with exactly the same #GO. The last subset grouped all proteins with more than 9 GO terms. Then, the reference distributions were generated for each subset with (1) the all-vs-all procedure (see Methods) and (2) subset-vs-all (see Supplementary Material and Supplementary Fig. S1, available online at www.liebertpub.com/cmb). Both experiments lead to consistent conclusions, but here we present the results only for the all-vs-all approach. The full description of the subset-vs-all can be found in the Supplementary Material. In general, in all analyzed data sets, the semantic similarities showed positive correlation with #GO (Fig. 4). A clear rising trend of median and mean values was observed. The exceptions were subsets of athaliana with #GO > 8. In those subsets, the reference distributions did not follow the rising trend observed in PDB, human, and ecoli annotation corpora. However, the numbers of proteins in those athaliana subsets were low; therefore, the acquired results may not be representative. The experiments showed that the number of GO terms that describe a protein has a significant influence on calculated semantic similarities. Proteins that have more terms tend to be more similar to all other

GOn9

GOn1

GOn3

GOn1

GOn3

GOn5

GOn7

GO number

GOn9

GOn9

0.0 0.2 0.4 0.6 0.8 1.0

d SemSim

SemSim

c

GOn7

GO number

0.0 0.2 0.4 0.6 0.8 1.0

GO number

GOn5

GOn1

GOn3

GOn5

GOn7

GOn9

GO number

FIG. 4. The influence of the number of annotated GO terms on the SemSim distribution—the number of terms was controlled in both proteins of each compared pair. The study was performed in four annotation corpora: (a) PDB, (b) ecoli, (c) human, and (d) athaliana. Proteins with more GO terms tend to yield higher semantic similarities. GO, gene ontology.

GO SIGNIFICANCE

817

proteins than proteins with only one or two terms. It may be caused by the fact that a greater number of terms that annotate a protein raises the probability of finding well-matching terms annotating other proteins.

3.4. Specificity/GO term number combined influence

1.0

We investigated the joined influence of #GO and annotation specificity on acquired reference distributions in all corpora. Each annotation set was divided into bins containing proteins with the same specificity level and #GO. Then, for each bin, reference distributions were generated using the subset-vs-all procedure (see Methods). Figure 5 shows the calculated Q0.95 values, which are the SemSim significance thresholds for given subsets of proteins. For a fixed #GO, the Q0.95 quantiles (Fig. 5) and median values (Supplementary Fig. S2) of reference distributions decrease as specificity increases. However, the larger the #GO, the weaker the effect. The experiment again shows that the significance and the actual meaning of SemSim may be different depending on annotation specificity and the number of GO terms annotated to proteins. For instance, if a protein is described with a single GO term with a specificity of 7, then a comparison that yields a SemSim score greater than 0.4 means that the similarity of the two proteins is significant and nonrandom, because only 5% out of all proteins can produce a score greater than that (Fig. 5). Conversely, the SemSim of 0.4 is not meaningful if the protein is described with a less specific GO term, for example, 3. In that case, there are plenty of proteins that can yield a similar SemSim score out of sheer luck. Similarly, the SemSim of 0.4 is less significant if the specificity of the description remains unchanged but there are more GO terms. The most rapid change in median and Q0.95 values occurs between #GO 1 and 2. GO term number and term specificity have an antagonistic influence on the SemSim of proteins. The value of Q0.95, estimated in the last study, could be used to take into account these effects when evaluating the significance of SemSim scores. Instead of using a static threshold to determine whether a similarity between a protein of interest and

Spec_1 Spec_2

0.8

Spec_3 Spec_4

0.6

Spec_6 Spec_7 Spec_8 Spec_9

0.4

GO specificity

Spec_5

Spec_10

0.2

Spec_11 Spec_12 Spec_13

GOn1

GOn3

GOn5

GOn7

GOn9

0.0

Spec_14

GO number FIG. 5. Combined influence of protein annotation specificity and the number of GO terms on SemSim distributions in the PDB set. The heat map shows the changes of the reference distribution Q0.95 values, depending on #GO and specificity of protein annotation. Red dots mark subsets of proteins (of given specificity and set size) that were smaller than 10—for those small subsets, the acquired parameter values might not be representative.

818

KONOPKA ET AL.

another protein is significant, we can use a threshold that depends on the characteristics of the GO term description of the protein of interest, that is, #GO terms and specificity, such as those presented in Figure 5.

3.5. Protein sequence–function relation We examined how the statistical significance of SemSim reflects similarities in other aspects of proteins. We used sequence and structure (section 3.6) similarities as benchmarks to validate the estimated significance thresholds. We ran an all-against-all BLAST search for protein sequences in all data sets. All BLAST hits with evalues below 10 - 4 were retained. For each protein pair, RRBS sequence similarity was calculated as proposed by Pesquita et al. (2008) (for details, see Methods). Functional SemSim is strongly related with sequence similarity, and it increases rapidly with increasing RRBS. Figure 6 presents the sequence– function relation for proteins in the PDB set. Unexpectedly, even for low sequence similarity values, that is, in the RRBS range (0, 0.1), the functional resemblance of proteins is quite high and the distribution statistics of all box plots in that range are greater than that of the SemSim distribution acquired for random proteins from the PDB set presented in section 3.1 (Fig. 2). Pascual-Garcia et al. (2010) reported that below a certain level of sequence similarity, a ‘‘structure divergence explosion’’ might be observed. This could be caused because, due to multiple mutations, sequence evolutionary information is lost. In such cases, sequence similarity measures are unable to distinguish between random and related proteins. Confusion matrices in Figure 7 summarize the quantitative evaluation of the compliance between significant sequence and function similarities in every annotation corpus. In all examined data sets in the region where RRBS is more reliable ( ‡ 0.1), the great majority of protein pairs showed significant functional SemSim. The lowest rate observed was 84.1% (PDB set), which meant that 84.1% of protein pairs with significant sequence similarity had also significantly similar molecular functions. However, in the RRBS ‘‘twilight’’ region (0, 0.1), the rate of compliance was much lower, 34.2% in the PDB set. This might have resulted because BLAST comparisons were filtered with a stringent e-value (e · 10 - 4), and although in our study some of them were treated as insignificant, they might be in fact nonrandom. Still, the overall accuracies reported in bottom-right-hand corners of matrices confirmed the general compliance between sequence–function similarity. Although high sequence similarity is a strong rationale for assuming functional similarity, functional similarity does not necessarily mean significant sequence similarity. The study confirmed that if two

1

Functional Similarity

0.8 Q0.95 0.6

0.4

0.2

0 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Sequence similarity (RRBS)

FIG. 6. Sequence–function relation in the PDB set investigated with function SemSim. A strict BLAST search criterion was used to select protein pairs for comparison (e-value 2) (Holm and Rosenstro¨m, 2010). In general, the calculated values of functional SemSim are in agreement with this statement (Fig. 8); however, the SemSim boxplots for the least similar SNs are very similar to boxplots derived for randomly chosen proteins in the PDB set (Fig. 2). This fact suggests that either those hits are not in fact significant or that structural similarity at such a low level does not result in functional resemblance. For the analyzed set of proteins, the overall strong-match Z-score cutoff equals 27.6 (for details, see Methods). Based on that, a confusion matrix of structural–functional compliance was built. An overwhelming majority of DALI strong matches had a SemSim over our estimated Q0.95 SemSim cutoff (Fig. 9), which is a strong point that supports our approach to estimating SemSim significance.

4. CONCLUSIONS In this study, we proposed a novel methodology to evaluate the significance of SemSim of protein ontological descriptions. We applied it to protein molecular function GO terms and Wang’s semantic similarity; however, the methodology can also be used with other GO subontologies and other measures of semantic similarity. Our approach is based on the statistical analysis of semantic similarity reference distributions. We proposed to use the 0.95 quantile as a threshold between significant and insignificant semantic similarities—similarities greater than this threshold can be considered as significant and nonrandom since

FIG. 9. Confusion matrix evaluates the agreement between significant structural and functional similarity. DALI Z-score of 27.6 was used as the structural strong-match cutoff, while the SemSim of 0.69 was used as a significant functional similarity cutoff.

GO SIGNIFICANCE

821

only 5% of randomly chosen protein pairs could produce similarity scores equal or greater. Four representative annotation corpora were analyzed (PDB, human, ecoli, and athaliana). The significance thresholds differ between annotation sets; therefore, such thresholds should be calculated separately for every set. We also showed that the significance of SemSim may change, depending on the number and the level of detail of GO terms used to annotate a protein. Proteins annotated with detailed GO terms tend to yield lower SemSim. On the other hand, a greater number of GO terms annotated to proteins are usually associated with higher SemSim. Based on sequence–function and structure–function relations, we have shown that taking into account these two effects can improve the usefulness of SemSim measures.

AUTHORS’ CONTRIBUTIONS B.M.K. proposed the concept of the study, prepared the data sets, carried out data analysis, and participated in writing the article. B.M.K. and T.G. developed the software and ran the calculations. M.K. participated in the design of the study, data analysis, and writing the article. All authors read and approved the final article.

ACKNOWLEDGMENTS B.M.K. would like to acknowledge the financial support from ‘‘Mloda Kadra’’ Fellowship cofinanced by European Union within European Social Fund. Part of the calculations have been done in Wroclaw Centre for Networking and Supercomputing.

AUTHOR DISCLOSURE STATEMENT The authors declare no competing interests.

REFERENCES Alvarez, M.A., Qi, X., and Yan, C. 2011. A shortest-path graph kernel for estimating gene product semantic similarity. J. Biomed. Semantics 2, 3. Ashburner, M., Ball, C.A., Blake, J.A., et al. 2000. Gene ontology: tool for the unification of biology. Nat. Genet. 25, 25–29. Benabderrahmane, S., Smail-Tabbone, M., Poch, O., et al. 2010. IntelliGO: a new vector-based semantic similarity measure including annotation origin. BMC Bioinform. 11, 588. Berman, H.M., Westbrook, J., Feng, Z., et al. 2000. The Protein Data Bank. Nucleic Acids Res. 28, 235–242. Chabalier, J., Mosser, J., and Burgun, A. 2007. A transversal approach to predict gene product networks from ontologybased similarity. BMC Bioinform. 8, 235. del Pozo, A., Pazos, F., and Valencia, A. 2008. Defining functional distances over Gene Ontology. BMC Bioinform. 9, 50. Gruber, T.R. 1995. Toward principles for the design of ontologies used for knowledge sharing. Int. J. Hum. Comput. Stud. 43, 907–928. Gruca, A., Sikora, M., and Polanski, A. 2011. RuleGO: a logical rules-based tool for description of gene groups by means of Gene Ontology. Nucleic Acids Res. 39, W293–W301. Haugen, A.C., Di Prospero, N.A., Parker, J.S., et al. 2010. Altered gene expression and DNA damage in peripheral blood cells from Friedreich’s ataxia patients: cellular model of pathology. PLoS Genet. 6, e1000812. Holm, L., and Rosenstro¨m, P. 2010. Dali server: conservation mapping in 3D. Nucleic Acids Res. 38, 545–549. Holm, L., Kaä¨riaïnen, S., Rosenstro¨m, P., et al. 2008. Searching protein structure databases with DaliLite v.3. Bioinformatics 24, 2780–2781. Huang da, W., Sherman, B.T., and Lempicki, R.A. 2009. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 37, 1–13. Hvidsten, T.R., Lægreid, A., Kryshtafovych, A., et al. 2009. A comprehensive analysis of the structure-function relationship in proteins based on local structure similarity. PLoS ONE 4, e6266. Konopka, B.M., Nebel, J.-C., and Kotulska, M., 2012. Quality assessment of protein model-structures based on structural and functional similarities. BMC Bioinform. 13, 242.

822

KONOPKA ET AL.

Li, W. CD-HIT User’s Guide. Available at: http://weizhong-lab.ucsd.edu/cd-hit/wiki/doku.php?id = cd-hit_user_guide. Accessed April 23, 2012. Li, W., and Godzik, A. 2006. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659. Lord, P.W., Stevens, R.D., Brass, A., et al. 2003. Semantic similarity measures as tools for exploring the gene ontology. Pac. Symp. Biocomput. 8, 601–612. Murzin, A.G., Brenner, S.E., Hubbard, T., et al. 1995. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247, 536–540. Nebel, J.C. 2012. Proteomics and bioinformatics soon to resolve the human structural interactome. J. Proteomics Bioinform. 5, xi–xii. Othman, R.M., Deris, S., and Illias, R.M. 2008. A genetic similarity algorithm for searching the gene ontology terms and annotating anonymous protein sequences. J. Biomed. Inform. 41, 65–81. Pascual-Garcia, A., Abia, D., Mendez, R., et al. 2010. Quantifying the evolutionary divergence of protein structures: the role of function change and function conservation. Proteins 78, 181–196. Pekar, V., and Staab, S. 2002. Taxonomy learning: factoring the structure of a taxonomy into a semantic classification decision. Proceedings of the Nineteenth Conference on Computational Linguistics. Pesquita, C., Faria, D., Bastos, H., et al. 2007. Evaluating GObased semantic similarity measures. ISMB/ECCB SIG Meeting Program Materials. Available at: www.psb.ugent.be/cbd/cco/Bio-Ontologies2007.pdf Pesquita, C., Faria, D., Bastos, H., et al. 2008. Metrics for GO based protein semantic similarity: a systematic evaluation. BMC Bioinform., 9, S4. Pesquita, C., Faria, D., Falcaõ, A.O., et al. 2009. Semantic similarity in biomedical ontologies. PLoS Comput. Biol. 5, e1000443. Schlicker, A., Domingues, F.S., Rahnenfu¨hrer, J., et al. 2006. A new measure for functional similarity of gene products based on Gene Ontology. BMC Bioinform. 7, 302. Stephenson, A. 2012. Functions for extreme value distributions. Available at: http://cran.r___project.org/web/packages/ evd/ Wang, J.Z., Du, Z., Payattakool, R., et al. 2007. A new method to measure the semantic similarity of GO terms. Bioinformatics 23, 1274–1281. Warita, K., Mitsuhashi, T., Tabuchi, Y., et al. 2012. Microarray and Gene Ontology analyses reveal downregulation of DNA repair and apoptotic pathways in diethylstilbestrol-exposed testicular Leydig cells. J. Toxicol. Sci. 37, 287–295. Wu, H., Su, Z., Mao, F., et al. 2005. Prediction of functional modules based on comparative genome analysis and gene ontology application. Nucleic Acids Res. 33, 2822–2837. Wu, X., Zhu, L., Guo, J., et al. 2006. Prediction of yeast protein–protein interaction network: insights from the Gene Ontology and annotations. Nucleic Acids Res. 34, 2137–2150. Xu, T., Du, L., and Zhou, Y., 2008. Evaluation of GO-based functional similarity measures using S. cerevisiae protein interaction and expression profile data. BMC Bioinform. 9, 472.

Address correspondence to: Dr. Bogumil M. Konopka Institute of Biomedical Engineering and Instrumentation Wroclaw University of Technology Wybrzeze Wyspianskiego 27 50 370 Wroclaw Poland E-mail: [email protected]

An improved method for functional similarity analysis of genes based on Gene Ontology.

TopoICSim: a new semantic similarity measure based on gene ontology.

A weighted multipath measurement based on gene ontology for estimating gene products similarity.

Semantic similarity measurement between gene ontology terms based on exclusively inherited shared information.

Information content-based Gene Ontology functional similarity measures: which one to use for a given biological data type?

NaviGO: interactive tool for visualization and functional similarity and coherence analysis with gene ontology.

Gene function prediction based on the Gene Ontology hierarchical structure.

Clinical phenotype-based gene prioritization: an initial study using semantic similarity and the human phenotype ontology.

Protein-protein interactions prediction based on iterative clique extension with gene ontology filtering.

Functional module search in protein networks based on semantic similarity improves the analysis of proteomics data.

A novel method for identifying disease associated protein complexes based on functional similarity protein complex networks.

Detecting Cooperativity between Transcription Factors Based on Functional Coherence and Similarity of Their Target Gene Sets.

Correlating information contents of gene ontology terms to infer semantic similarity of gene products.

CELLO2GO: a web server for protein subCELlular LOcalization prediction with functional gene ontology annotation.

Limbform: a functional ontology-based database of limb regeneration experiments.

HPOSim: an R package for phenotypic similarity measure and enrichment analysis based on the human phenotype ontology.

Correction: Comparative GO: a web application for comparative gene ontology and gene ontology-based gene selection in bacteria.

Analysis of tumor suppressor genes based on gene ontology and the KEGG pathway.

Predicting lncRNA-disease associations and constructing lncRNA functional similarity network based on the information of miRNA.

Functional fixedness: The functional significance of delayed disengagement based on attention set.

Guidelines for the functional annotation of microRNAs using the Gene Ontology.

The use of semantic similarity measures for optimally integrating heterogeneous Gene Ontology data from large scale annotation pipelines.

A drug target slim: using gene ontology and gene ontology annotations to navigate protein-ligand target space in ChEMBL.

On the Bayesian Derivation of a Treatment-based Cancer Ontology.