Systematic Biology Advance Access published September 8, 2015 Syst. Biol. 0(0):1–14, 2015 © The Author(s) 2015. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For Permissions, please email: [email protected] DOI:10.1093/sysbio/syv044

Integrating Sequence Evolution into Probabilistic Orthology Analysis IKRAM ULLAH1 , JOEL SJÖSTRAND2 , PETER ANDERSSON3 , BENGT SENNBLAD4 , AND JENS LAGERGREN5,∗ 1 School

of Computer Science and Communication, Science for Life Laboratory, KTH Royal Institute of Technology, Stockholm, Sweden; 2 Department of Numerical Analysis and Computer Science, Science for Life Laboratory, Stockholm University, Stockholm, Sweden; 3 School of Computer Science and Communication, KTH Royal Institute of Technology, Stockholm, Sweden; 4 Atherosclerosis Research Unit, Dept. of Medicine, Science for Life Laboratory, Karolinska Institutet, Solna, Sweden; and 5 School of Computer Science and Communication, Science for Life Laboratory, Swedish e-Science Research Center (SeRC), KTH Royal Institute of Technology, Stockholm, Sweden; ∗ Correspondence to be sent to: School of Computer Science and Communication, KTH Royal Institute of Technology, SE - 100 44 Stockholm, Sweden; E-mail: [email protected].

Abstract.—Orthology analysis, that is, finding out whether a pair of homologous genes are orthologs — stemming from a speciation — or paralogs — stemming from a gene duplication - is of central importance in computational biology, genome annotation, and phylogenetic inference. In particular, an orthologous relationship makes functional equivalence of the two genes highly likely. A major approach to orthology analysis is to reconcile a gene tree to the corresponding species tree, (most commonly performed using the most parsimonious reconciliation, MPR). However, most such phylogenetic orthology methods infer the gene tree without considering the constraints implied by the species tree and, perhaps even more importantly, only allow the gene sequences to influence the orthology analysis through the a priori reconstructed gene tree. We propose a sound, comprehensive Bayesian Markov chain Monte Carlo-based method, DLRSOrthology, to compute orthology probabilities. It efficiently sums over the possible gene trees and jointly takes into account the current gene tree, all possible reconciliations to the species tree, and the, typically strong, signal conveyed by the sequences. We compare our method with PrIME-GEM, a probabilistic orthology approach built on a probabilistic duplication-loss model, and MRBAYESMPR, a probabilistic orthology approach that is based on conventional Bayesian inference coupled with MPR. We find that DLRSOrthology outperforms these competing approaches on synthetic data as well as on biological data sets and is robust to incomplete taxon sampling artifacts. [Comparative genomics; gene duplication; gene loss; orthology; paralogy; phylogenetics; probabilistic modeling; relaxed molecular clock; sequence evolution; tree realization; tree reconciliation]

Fitch (1970) introduced the notion of orthology in an attempt to capture the — intuitively straightforward but formally problematic — concept of two genes having the same function in different species. He formalized this notion by defining two genes as being orthologs if their last common ancestor (LCA) in the phylogenetic gene tree corresponds to a speciation event. If instead the LCA corresponds to a gene duplication, the genes are called paralogs. The connection to function is that while single-copy genes are exposed to purifying selection, paralogous genes initially share the same function, implying that the purifying selection may be temporarily relaxed until one of them either becomes nonfunctional or acquires a new function. Paralogous genes are therefore more likely to have different functions. To directly identify orthologous or paralogous genes based on this definition, a correct gene tree and a classification of its vertices (nodes) into speciations and duplications are required. In this context, it should be noted that, orthology and speciation refer to the same phenomenon; saying that genes g1 and g2 are orthologs is synonymous to saying that their LCA is a speciation vertex (note that g1 and/or g2 may refer to either a single gene or a set of genes). Orthology analysis (i.e., the determination of the orthology or paralogy status for homologous genes of interest) is of central importance in computational biology, genome annotation, and comparative genomics. For instance, there is much functional and genetic diversity within the tree-of-life and it is not possible

to experimentally test every gene to discover its function. However, once we discover a certain gene’s function, we can infer, and possibly annotate, the function of other genes that are orthologous to the first gene. Apart from this, putative orthologs are often used to estimate species phylogenies, because, by definition, orthologous genes should follow the species evolution unless affected by confounding factors such as horizontal gene transfer (Gogarten and Townsend 2005), incomplete lineage sorting (Maddison 1997), compositional heterogeneity across sequences (Jermiin et al. 2004), and long-branch attraction (Ho and Jermiin 2004).

Downloaded from http://sysbio.oxfordjournals.org/ at University of Winnipeg on September 21, 2015

Received 7 June 2014; reviews returned 25 August 2014; accepted 24 June 2015 Associate Editor: Lars Jermiin

Main Approaches to Orthology Analysis Orthology analysis methods can be divided into (i) direct sequence similarity heuristics and (ii) phylogenybased methods. Direct sequence similarity heuristics are often based on so-called Reciprocal Best Hits, using an appropriate local alignment algorithm (e.g., BLAST: Altschul et al. 1990, or Smith–Waterman: Smith and Waterman 1981) and a particular scoring scheme (Hulsen et al. 2008). Some examples of such methods are ORTHOLUGE (Fulton et al. 2006), COG (Tatusov et al. 2003), and INPARANOID (Remm et al. 2001). These methods are typically hard to generalize to more than two species, although the GETHOGs algorithm (Altenhoff et al. 2013) computes multiple-species orthology relations from a so-called orthology graph. 1

[19:08 3/9/2015 Sysbio-syv044.tex]

Page: 1

1–14

2

SYSTEMATIC BIOLOGY

sequence alignment (MSA) and a species tree. The DLRS model incorporates submodels for gene tree evolution, sequence evolution and a relaxed molecular clock. This model enables the use of the gene sequences directly in the reconciliation analysis, and consequently, also directly in the orthology analysis. Notice, however, that neither the method of Åkerborg et al. (2009), nor that of Rasmussen and Kellis (2011), provides reconciliations or orthology probabilities. Here, we present the algorithms and software tools necessary to include the DLRS model in orthology analysis and demonstrate its performance using synthetic as well as biological data. We provide two tools. The first takes as input an MSA together with a species tree and performs an orthology analysis that takes all gene trees into account. The second also allows a user to specify a single gene tree that should be used in the orthology analysis (which still takes the sequences into account). Although the first tool may be more general and powerful, no model can capture all aspects of biology. Consequently, there are many types of information that might lead to a specific gene tree being preferred, thus motivating the second tool. Since we want a baseline to compare our first tool with and there are no previous Bayesian tools for performing an orthology analysis with respect to all gene trees, we devise such a tool, by combining MRBAYES (Huelsenbeck and Ronquist 2001) and MPR. For a fixed gene tree, the most advanced orthology analysis tool is PrIME-GEM (Sennblad and Lagergren 2009) and, consequently, we use this for the comparison with our second tool. Reconciliations and Realizations A reconciliation can be viewed as a hypothesis of how a gene tree has evolved inside a species tree (Goodman et al. 1979); formally it is a function, , that maps every gene tree vertex either to a species tree vertex (in which case it is a speciation) or a species tree edge (in which case it is a duplication; cf. Fig. 1a,b). Notice that, in a reconciliation, a gene tree vertex representing a duplication event may be placed anywhere on the corresponding species tree edge. In other words, a reconciliation does not provide exact information

Downloaded from http://sysbio.oxfordjournals.org/ at University of Winnipeg on September 21, 2015

Phylogeny-based methods use sequence information to generate a gene tree, and then classify its vertices into speciations and duplications by contrasting it with the corresponding species tree. While, in theory, methods that model lateral gene transfer could be used for orthology analysis augmented with xenology (i.e., gene homology through lateral gene transfer), our study, like most previous studies, aims to distinguish between orthology and paralogy only, and we will for the remainder of the article focus only on such attempts. Goodman et al. (1979) pioneered gene tree — species tree comparisons by introducing the most parsimonious reconciliation (MPR). Given a gene tree and a species tree, MPR maps each gene tree vertex to either a species tree vertex (in which case it represents a speciation), or a species tree edge (in which case it represents a duplication) in a way that minimizes the number of duplications (a recent MPR extension also incorporates incomplete allele sorting events, Wu et al. 2014). Unfortunately, minimizing the number of duplications often leads to incorrect orthology assignments, as was recently shown in Sennblad and Lagergren (2009), where a probabilistic orthology analysis method, PrIME-GEM, was presented as an improved alternative. Both MPR-based and PrIME-GEM orthology analysis require a gene tree, which typically has been reconstructed from gene sequence data. However, neither of these methods take the sequences directly into account during the reconciliation analysis, even though sequence data may facilitate dating of gene tree vertices and thereby also classification of these into speciations and duplications. The recently proposed socalled species tree aware–gene tree reconstruction (STA– GTR) methods (Åkerborg et al. 2009; Rasmussen and Kellis 2011; Boussau et al. 2013), which take advantage of a species tree, appear to outperform earlier methods in terms of accuracy. These can be used to obtain more reliable gene trees and, thereby, improve any gene tree-based orthology analysis. As a basis for a STA-GTR method, Åkerborg et al. (2009) presented a probabilistic model, the Duplication-Loss (DL) model with i.i.d. Rates across gene tree edges and sequence evolution (DLRS), that defines a posterior distribution over gene trees and reconciliations for a given multiple

FIGURE 1. Illustration of reconciliation and realization of a gene tree. (a) and (b) illustrate two different reconciliations for same gene tree while (c) and (d) show two different realizations for the reconciliation in (a). The species tree is laid out with wide edges and ellipses indicating speciations, while the gene tree has narrow edges and smaller vertices (circles indicating speciations and squares indicating duplications). The bands in (c–d) denotes discretization vertices.

[19:08 3/9/2015 Sysbio-syv044.tex]

Page: 2

1–14

2015

ULLAH ET AL.—DLRSOrthology

concerning when a duplication event occurred along an edge of the species tree. The precise evolution of a gene tree within a species tree is captured by the concept of a realization, a function, , that maps each gene tree vertex to an exact point in the species tree where the specific event (duplication or speciation) has occurred (cf. Fig. 1c,d). It should be noted that, given a species tree, there are finitely many reconciliations for a gene tree, while there can be infinitely many possible realizations for a single reconciliation.

MATERIALS AND METHODS

The DL model.—This model (formerly known as the Gene Evolution Model, GEM; Arvestad et al. 2003) explains the evolution of a gene tree, G, inside a species tree, S, given with divergence times. A linear birth–death model with birth rate  and death rate  is used to model duplication and loss rates, respectively, over the edges of S, as follows (see also Fig. 2a–e): A single gene lineage starts at the edge predating the root of S; when a gene lineage reaches a speciation vertex vS in S, it splits into two independent lineages; one for each outgoing edge of vS . This process

continues recursively until it reaches the leaves of S. Any lineage in G that does not reach the leaves of S is pruned, resulting in (i) an ultrametric binary gene tree, G and (ii) a reconciliation explaining each bifurcation event in G as either a speciation or a duplication. Building on the DL model, (Sennblad and Lagergren 2009) created the orthology analysis framework PrIME-GEM, which takes as input a species tree and a gene tree and uses MCMC to integrate over remaining model parameters. The DLR model.—In the Duplication-Loss model with i.i.d. Rates across gene tree edges (DLR), the ultrametric edge times of the gene tree from the DL model are converted into non-ultrametric edge lengths, expressed in expected number of substitutions per site (i.e., as used by most standard substitution models, Felsenstein 1981). For each edge, edge rates are drawn from a “relaxed clock” model, that is, a model for variation of substitution rate over gene tree edges, and an edge length is obtained as the product of the edge time and edge rate. The substitution rates across edges are modeled using an i.i.d.-gamma distribution with mean m and variance  (e.g., Åkerborg et al. 2009; Linder et al. 2011, see also Fig. 2f). No orthology method based on the DLR model currently exists, but it would be similar to that of PrIMEGEM, albeit with the additional use of edge lengths as input. The DLRS model.—This model (formerly known as GSR; Arvestad et al. 2003; Åkerborg et al. 2009) is the most comprehensive of our probabilistic models.

Downloaded from http://sysbio.oxfordjournals.org/ at University of Winnipeg on September 21, 2015

We now proceed to describe three nested models, by stepwise extensions, where the last is the DLRS model (Åkerborg et al. 2009) that our framework for probabilistic orthology analysis builds on. This is followed by a description of this framework, and a presentation of the synthetic and biological data used to evaluate our methods.

3

FIGURE 2. Illustration of gene tree generation under the DL, DLR, and DLRS models. (a–e) depict different events in the DL model of (Arvestad et al. 2003). The process starts with the evolution of a single lineage at the start of species tree in (a), undergoes duplication in (b), speciation in (c), and loss in (d), ultimately reaching the leaves of species tree in (e) to yield a binary gene tree and a realization. (f) demonstrates the extension into the DLR model where variable edge lengths are obtained by applying a relaxed molecular clock for the edge substitution rates. Last, in (g), under the DLRS model, sequence evolution takes place along the relaxed gene tree edges. Tree layout as in Figure 1.

[19:08 3/9/2015 Sysbio-syv044.tex]

Page: 3

1–14

4

SYSTEMATIC BIOLOGY

It extends the DLR model by including a standard substitution model generating nucleotide, codon, or protein sequences. In the present article, we consistently use the Jones et al. (JTT JTT: 1992) model for our analyses of protein data; the JTT model is also used in the generation of synthetic data. We base our DLRSOrthology method on the DLRS model.

probability, P[ u and v are orthologs|D,S]  P[lcaG (u,v) is a speciation|G,D,S]P[G|D,S] = G

=



P[lcaG (u,v) is a speciation|G,l,,S]

G

Variable-tree variant.—A gene tree, even if it is the maximum a posteriori (MAP) tree, may cover a relatively small portion of the posterior distribution of possible gene trees. The problem can be even worse if we are working with a nonoptimal gene tree, something that, as we show below, can often be the case for gene trees proposed by conventional (e.g., distance, parsimony, or probabilistic) methods ignoring the constraints implied by the species tree. Consequently, orthology analysis based on such a tree may give a biased reflection of the speciation probabilities supported by the data. It is therefore desirable to consider all gene trees in orthology analysis and let their relative influence be determined by the posterior distribution. The variable-tree analysis takes as input an MSA of nucleotides, amino acids, or codons for a gene family, a species tree and a mapping of the genes to their host species in the species tree, and infers orthology probabilities by, in addition to gene tree edge lengths l and parameters  = (,,m,), marginalizing out the gene tree. The parameter  consists of the gene evolution parameters (,), the parameters of the substitution rate model (m,), as well as any parameters of the substitution model; however, in our analyses, we use the parameterfree JTT substitution model. Importantly, we are not interested in a particular vertex of the gene tree (since the gene tree varies, this is simply not meaningful) but in the LCA of one or several pairs of gene tree leaves (i.e., genes), for which the orthology analysis should be performed. For any tree T and u,v ∈ V(T), where V(T) is the set of vertices of T, let lcaT (u,v) denote the LCA of u and v in T. Mathematically, we are interested in the following

[19:08 3/9/2015 Sysbio-syv044.tex]

×p[G,l,|D,S]d(l,) = Ep[G,l,|D,S] [P[lcaG (u,v) is a speciation |G,l,,S]] where D denotes the sequence data, P[.] and p[.] denote probability and probability density functions, respectively, and Ep[·] [X] is the expected value of the variable X taken over the probability distribution p[·]. In a subsequent subsection, we briefly describe the framework from (Åkerborg et al. 2009) for sampling from p[G,l,|D,S] and also a novel algorithm for computing P[u is a speciation|G,l,,S], efficiently for all possible vertices u of G. Together with the algorithms of (Åkerborg et al. 2009), this new algorithm allows our method to estimate the variable-tree variant orthology probability, for any pair of gene tree leaves u and v. That is, we first obtain a collection Cv = {(Gi ,li ,i )}ni=1 of samples from the posterior distribution p[G,l,|D,S]; the orthology probability is then approximated by computing the following sum P[u and v are orthologs |D,S] ≈

1 n



P[lcaGi (u,v)

(Gi ,li ,i )∈Cv

is a speciation |Gi ,li ,i ,S].

(1)

Fixed-tree variant.—Apart from a species tree and the MSA, the fixed-tree variant of DLRSOrthology requires a gene tree G as input, and the analysis is carried out with respect to G as follows. Since, in this case, the LCA will not vary, we can focus on a single vertex v, or a set of vertices. For a single vertex v, the speciation probability conditioned by G can be expressed as follows,

Downloaded from http://sysbio.oxfordjournals.org/ at University of Winnipeg on September 21, 2015

DLRSOrthology The main contribution of this article is our probabilistic framework for orthology analysis named DLRSOrthology. The original algorithms and MCMC framework of Åkerborg et al. (2009) (i.e., PrIME-DLRS) infers gene trees under the DLRS model by integrating reconciliations as a nuisance parameter (we will briefly describe the original PrIME-DLRS MCMC framework below; for a more thorough explanation, the reader is referred to Åkerborg et al. 2009; Arvestad et al. 2009; Sjöstrand et al. 2012). DLRSOrthology enables orthology inference with respect to the DLRS model by extending the original PrIME-DLR framework with new algorithms, and comes in two variants: the variable-tree variant and fixed-tree variant.

P[v is a speciation|G,D,S]  = P[v is a speciation|G,l,,S]p[l,|G,D,S]d(l,) = Ep[l,|G,D,S] [P[v is a speciation|G,l,,S]] The framework from (Åkerborg et al. 2009) is used also in this case, but now with the modification that we use samples from p[l,|G,D,S], in combination with our novel algorithm for computing P[v is a speciation|l,,G,S]. We obtain a collection Cf = {(li ,i )}ni=1 of samples from the posterior distribution p[l,|G,D,S] and the orthology probability

Page: 4

1–14

2015

is then approximated by computing the following sum P[v is a speciation|G,D,S] ≈

1 n



=

(li ,i )∈Cf

(2) =

p[G,l,,|D,S] =

∈(G)

∈(G)

∈A()

p[G,l,,a|D,S]da. (3)

To compute the right hand side of Equation (3), (Åkerborg et al. 2009) developed a discretization strategy whereby each edge in the species tree is divided into a finite number of discretization intervals with a new discretization vertex in the middle of each interval. The so modified species tree is referred to as the discretized species tree, denoted S , and realizations of the gene tree that map its vertices to vertices of S are called discretized realizations (Fig. 1c,d show two possible discretized realizations for the reconciliation in Fig. 1a). Using discretized realizations, the right-hand side of Equation (3) may be approximated as   ∈(G)



∈A()



p[G,l,,a|D,S]da 

p[G,l,,d|D,S ] (d),

∈(G) ∈D()

where D() is the set of -compatible discretized realizations, and (d) is the product, over all vertices, v, of G, of the length of the interval that d maps v to (see further the Appendix). The multiplication with (d) provides an approximation of the integration over the interval lengths. Novel algorithms in DLRSOrthology.—We now describe an algorithm for computing the probability that a specific vertex is a speciation (which may be the LCA in G of two homologous genes or a fixed vertex); the remaining details of the method can be found in the Appendix. We express the probability P[v is a speciation|G,l,,S] as

[19:08 3/9/2015 Sysbio-syv044.tex]

= ≈

p[v is a speciation, G,l|,S] p[G,l|,S]  ∈(G) p[v is a speciation,G,l,|,S] 







p[G,l|,S]

∈(G) ∈A() p[v

∈(G)

is a speciation,G,l,a|,S]da

p[G,l|,S] ∈D() p[v

is a speciation,G,l,d|,S ] (d) p[G,l|,S ]

.

To compute this sum of sums efficiently for all vertices of a gene tree G, we extend the dynamic programming (DP) algorithm of (Åkerborg et al. 2009) so that the following three probabilities can be computed for v ∈ V(G) and x ∈ V(S) or e ∈ E(S), where E(S) are the edges of S: o(x,v) the probability that Gv is created, given that the event creating v has occurred on x; b(e,v) the probability that a single lineage starting in e, infinitely close to x, generates both the incoming edge of v and Gv ; and a(x,v) the sum of the probabilities of all discretized   realizations d of G\ Gv \v in S such that d(v) = x; where Gv is the subtree of G rooted at v. Thus, the probability conditioned on S and  that (G,l) is generated and u occurs at x is o(x,v)a(x,v). Recursions for all of these probabilities are described in the Appendix. Note that by computing o(·,·) and b(·,·) in postorder (i.e., for the vertices of a tree, the computation for a child is always performed before that of its parent) with respect to G and S (using (A.1) and (A.2) from the Appendix), we obtain an approximation of p[G,l|,S] (i.e., through p[G,l|,S ] = b(f ,r) where f is the most ancestral edge of S and r is the root vertex of G). Using o(·,·) and a(·,·), we can now compute the speciation probability. From the definition of MPR (Hallett and Lagergren 2000), it follows that, in any reconciliation, a vertex v of G can only be a speciation if the MPR maps it to a vertex of the species tree S. Otherwise, this vertex is an obligate duplication. In other words, the probability that v is a speciation is the probability that a realization maps it to the same species tree vertex as MPR does and it can be computed as follows. Assume that MPR maps v to the species tree vertex x, then P[v is a speciation|G,l,,S ] =

o(x,v)a(x,v) . P[G,l|,S ]

Downloaded from http://sysbio.oxfordjournals.org/ at University of Winnipeg on September 21, 2015

DLRSOrthology Algorithms The previous PrIME-DLRS framework.—The major technical contribution of (Åkerborg et al. 2009) is an algorithm for computing p[G,l,|D,S] and the software implementation of the entire MCMC framework. Let (G) be the set of possible reconciliations of G in S. For any reconciliation  ∈ (G), let A() be the set of possible realizations compatible with . (Åkerborg et al. 2009) express the probability density p[G,l,|D,S] as  

follows: P[v is a speciation|G,l,,S]

P[lcaG (u,v)

is a speciation |G,li ,i ,S].



5

ULLAH ET AL.—DLRSOrthology

(4)

Notice that for the trivial case of obligate duplications, P[v is a speciation|G,l,,S ] will always be zero.

Page: 5

1–14

6

SYSTEMATIC BIOLOGY

Thresholding DLRSOrthology and the Maxmin Point

sensitivity(ω) =



v∈V (sp) (G):P[v is a speciation|G,D,S ]≥ω P[v



specificity(ω) =

v∈V (sp) (G) P[v



is a speciation|G,D,S ]

v∈V (sp) (G):P[v

P[G|D,S ]

is a speciation|G,D,S ] 0.5) from the fixed and variable tree analyses for the AGP\HS data using DLRSOrthology, PrIME-GEM and MRBAYESMPR, respectively LCA of gene tree leavesa,b

Vertexc

(Gga_agp1, Tgu_agp1) (Cfa_agp1, (Mmu_agp1, Mmu_agp2, Mmu_agp3)) (2, Mdo_agp1) (3, Oan_agp1) (1, 4) (2, Oaa_agp1) a Numbers

1 2 3 4 5 6

Fixed tree

Variable tree

DLRSOd

PGEMe

DLRSO

MBMPRf

0.97 0.71 0.96 0.98 0.96 –

0.99 0.99 0.99 0.99 0.99 –

0.76 0.40 0.55 0.56 0.63 –

1.0 1.0 – – 1.0 0.97

refer to descendants of corresponding subtree in column 2. in Figure 5c.

Downloaded from http://sysbio.oxfordjournals.org/ at University of Winnipeg on September 21, 2015

b Obligate duplications are not shown. c Bold subtree numbers refers to vertices d .DLRSOrthology e PrIME-GEM. f MR BAYESMPR.

a)

b)

X Tgu_agp1 Gga_agp1 Oan_agp1 Mdo_agp1 Mmu_agp2 Mmu_agp3 Mmu_agp1 Hsa_agp2 Hsa_agp1 Cfa_agp1

Hsa_Abca13 Hsa_Abca12 Dre_Abca12 Hsa_Abca2 Dre_Abca2 Hsa_Abca4 Dre_Abca4b Dre_Abca4a Hsa_Abca7 Dre_Abca7 Mmu_Abca1 Hsa_Abca1 Dre_Abca1b Dre_Abca1a Cin_47790 Mmu_Abca16 Mmu_Abca15 Mmu_Abca14 Mmu_Abca17 Mmu_Abca3 Hsa_Abca3 Dre_Abca3 Cin_43769 Hsa_Abca9 Hsa_Abca8 Hsa_Abca10 Hsa_Abca6 Hsa_Abca5 Dre_Abca5 Cin_36165

FIGURE 6. Summary of the high-posterior gene trees in DLRS-based posterior distribution for the biological data. a) ABCA. The posterior distribution was dominated by two trees: The first tree had a posterior probability of 77% and is shown in solid outline. The second tree had a posterior probability of 13%, the dashed line indicates the difference to the first tree. b) AGP and AGP\HS . The posterior distribution for the AGP data comprised two trees: The first tree (A), shown in solid outline, had a posterior probability of 0.78. The second tree (B) had a posterior probability of 0.17; the dashed line indicate the difference to tree A. The AGP\HS also comprised two trees: The top tree (A ) had a posterior probability of 0.56, while the second best tree (B ) had a posterior probability of 0.37. A and B are the subtrees of the AGP trees A and B, respectively, with the human genes removed (indicated by a cross).

[19:08 3/9/2015 Sysbio-syv044.tex]

Page: 11

1–14

12

SYSTEMATIC BIOLOGY

DISCUSSION In this article, we have described and evaluated the performance of a probabilistic orthology analysis method named DLRSOrthology. The method takes as input a multiple sequence alignment together with a dated species tree and is based on the STAGTR method PrIME-DLRS (Åkerborg et al. 2009). In the case where the input species tree lacks dating information, it may be dated using MAPDP (Åkerborg et al. 2008), TimeTree (Hedges et al. 2006, http://www.timetree.org), or PhyloBayes (Lartillot and Philippe 2004). The method has several benefits compared to its predecessor, the probabilistic orthology method PrIME-GEM (Sennblad and Lagergren 2009). First, it uses the DLRS model which, compared to the DL model (Arvestad et al. 2003) used by PrIME-GEM, also includes rate variation and sequence evolution. This makes the orthology analysis more biologically realistic and robust. Second, DLRSOrthology can integrate orthology probabilities over the posterior distribution of gene trees, while PrIME-GEM rely on a precomputed gene tree. Furthermore, in contrast to a natural Bayesian generalization of the MPR method, MRBAYESMPR, presented here, DLRSOrthology uses probabilistic orthology analysis, which, as already shown in (Sennblad and Lagergren 2009) and corroborated here, is superior to MPR orthology analysis. Our comparison of orthology probabilities from DLRSOrthology with those from PrIME-GEM and MRBAYESMPR demonstrates clearly better performance.

[19:08 3/9/2015 Sysbio-syv044.tex]

In particular, the parsimony-based orthology analysis tool MRBAYESMPR is sensitive to incomplete taxon sampling, while DLRSOrthology is clearly less so. This is in line with previous results showing that gene tree reconstruction using MRBAYES (Huelsenbeck and Ronquist 2001) is inferior to that of PrIMEDLRS (Åkerborg et al. 2009), partially since the latter method is species tree aware (Åkerborg et al. 2009; Rasmussen and Kellis 2011; Boussau et al. 2013). This is particularly evident in the comparisons of gene tree reconstruction using the ABCA data as described in the “Results” section. The importance of accounting for uncertainty in gene tree estimation is further emphasized by the observation that the published gene tree for ABCA (Annilo et al. 2006) was not among the most probable gene trees, having posterior probabilities of ≤0.01. An analysis relying on such low confidence trees can hardly provide accurate orthology estimates. The observation that probabilistic orthology analysis is superior to MPR orthology analysis has consequences for any MPR-based method, for example, that of Rasmussen and Kellis (2011). Moreover, in contrast to our method, their orthology method requires whole-genome analysis. We note that some commonly used databases provide orthology information (Ensembl, Flicek et al. 2013, and TreeFam, Ruan et al. 2008) based on orthology methods relying on MPR and reconstruction of a single gene tree using methods considered to be less accurate than MRBAYES. Given that our method clearly outperformed MRBAYESMPR implies that a shift to probabilistic species tree aware orthology methods should improve current annotations. It is pertinent to mention here that DLRSOrthology analysis is computationally demanding for large data sets. This is a known challenge in MCMC-based methods (Huelsenbeck and Ronquist 2001; Heled and Drummond 2010; Boussau et al. 2013) and different heuristics have been proposed (Heled and Drummond 2010; Boussau et al. 2013) to tackle the problem; we intend to include these in future releases of the software. Due to the lack of large-scale comparisons, currently no firm conclusion can be drawn. Nevertheless, we believe that, based on its sound mathematical model and competitively very good performance, DLRSOrthology is currently the most reliable probabilistic orthology estimation tool and that it is ideal for situations where accuracy is desirable.

Downloaded from http://sysbio.oxfordjournals.org/ at University of Winnipeg on September 21, 2015

corresponding to the two trees from the AGP analysis with human sequences removed, referred to here as A and B . Gene tree A , shown in Figure 5c, had posterior probability of 0.56, while B had a posterior probability of 0.37. This is in contrast with the MRBAYESMPR results, where B was the the MAP gene tree and had a posterior probability of 0.84, while gene tree A merely had posterior probability 0.13. Although the DLRSOrthology’s speciation probabilities for AGP\HS (Table 3) are, in general, smaller than those for AGP, viewed as rankings they show a high degree of consistency with the AGP results. In particular, using the DLRSOrthology maxmin threshold of AGP\HS (Supplementary Table S2), the LCA of the Mmu_agp1, 2 and 3 and Cfa_agp1 genes (corresponding to vertex 2 in Fig. 5c) is classified as a duplication, in line with the full data case. In contrast, for AGP\HS MRBAYESMPR classifies this vertex as a speciation with probability 1, which contradicts the MRBAYESMPR estimation for the full AGP data set. Furthermore, it is clear that any reasonable threshold for the MRBAYESMPR results will classify vertex 2 as a speciation. This clearly demonstrates the superior robustness of DLRSOrthology in incomplete taxon sampling scenarios.

SOFTWARE AVAILABILITY C++ source code for DLRSOrthology, along with the data set used in current study and the programs required for the generation of synthetic data, are publicly available at http://www.nada.kth.se/∼ikramu/dlrsorthology/. A short tutorial, describing software setup and a demo run, may also be found there.

Page: 12

1–14

2015 SUPPLEMENTARY MATERIAL

Data available from the Dryad data Repository: http://dx.doi.org/10.5061/dryad.4r910.

ACKNOWLEDGMENTS

FUNDING This study was supported by the Swedish eScience Research Center, The Swedish Research Council (2010-4757) and University of Engineering and Technology, Peshawar, Pakistan. Bengt Sennblad’s position is supported by a Karolinska Institute distinguished professor award to Anders Hamsten. Bengt Sennblad acknowledges funding from the Magnus Bergvall Foundation and the Foundation for Old Servants.

APPENDIX We will here describe in detail the algorithms for DLRSOrthology. We use the following notation. For any tree T, L(T) denotes its leaves. For an edge e = (x,y) ∈ E(T) where y is closer to the leaves of T than x, we refer to x as the tail of e and y as the head of e, and say that x is the parent of y, denoted (y), and that y is the child of x. We, moreover, call e an outgoing edge of x and the incoming edge of y. The sibling of a vertex x is denoted by (x).Let S be the discretized species tree and G be the gene tree. The subtree of G rooted at u is denoted by Gu . We denote the time of a vertex x ∈ V(S ) by t(x), and assume time 0 at the leaves and > 0 for internal vertices. Any realization maps each leaf of the gene tree to the species vertex that is induced by the gene and species relation. From the definition of the model follows that the stem tip (the most ancestral point) of the gene tree corresponds to the stem tip of the species tree. The “on” probability.—For x ∈ V(S ) and u ∈ V(G), we define the on probability, denoted as o(x,u), to be the probability of Gu given that the event creating u occurred on x. Assume that u is a vertex of G and that v,w ∈ V(G) are the children of u. Assume further that x is a vertex of S and that y,z ∈ V(S ) are the children of x. Finally, assume that e,f ∈ E(S ) are outgoing edges of x and that

(x) is the time length of the discretization interval associated with x. The probability o(x,u) can now be

expressed as ⎧ b(e,v)b(f ,w)+b(e,w)b(f ,v) ⎪ ⎪ ⎪ ⎪ ⎪ ⎨2b(e,v)b(e,w) (x) o(x,u) = 1 ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ 0

if x is a speciation vertex, if x is a discretization vertex, if u ∈ L(G),x ∈ L(S )and the gene u belongs to species x, otherwise. (A.1)

where b(.,.) denotes the below probability defined as follows. The “below” probability For e ∈ E(S ) with tail x ∈ V(S ) and u ∈ V(G), we define the below probability, denoted as b(e,u), to be the probability that a single lineage starting in e, infinitely close to x, generates both the incoming edge of u and Gu . The probability b(e,u) can be expressed as ⎧ p11 (e,z)( l( (u),u)) ⎪ t(x) ) ⎪ ⎪ ⎨ b(e,u) =  ⎪ p11 (e,z)( l( (u),u)) ⎪ t(x)−t(z) )o(z,u) ⎪ ⎩ z∈D(x) 0

if u is a leaf and z its species in V(S), if u is an internal vertex, otherwise, (A.2)

where D(x) is the set of descendants of x; l( (u),u) denotes length of the edge between (u) and u, and (·) denotes the gamma distribution’s probability density function for substitution rates. p11 (e,z) is the probability that a single gene lineage starting at the tail of e ∈ E(S ) has k descendants at z out of which (k −1) will go extinct before reaching the extant species while one lineage may or may not go extinct. The “above” probability For z ∈ V(S ), u ∈ V(G) and Gu = G\(Gu \u), the above probability, denoted a(z,u), is the sum of the probabilities of all discretized realizations d of Gu in S such that d(u) = z. Gu may be viewed as the subtree of G excluding u and the descendants of u, while above probability may be viewed as the complement to the on probability. The computation of above is based on the observation that, for v ∈ V(G), if we have computed above for its parent u = (v) and below for its sibling w = (v), we can combine those two probabilities and the incoming edge of v to obtain the above probability of v. That is, formally, ⎧ l(u,v) ⎪ ⎪p11 (e(x),z)( t(x)−t(z) ) ⎪ ⎪ ⎨ a(z,v) =  l(u,v) ⎪ x∈A(z) p11 (e(x),z)( t(x)−t(z)| ) ⎪

⎪ ⎪ ⎩b(e(x),w)a(x,u) 1 (x)(2 (x)−1)+1 disc

Downloaded from http://sysbio.oxfordjournals.org/ at University of Winnipeg on September 21, 2015

The authors wish to thank Lars Jermiin, Nicolas Lartillot, Mathieu Groussin, and the anonymous reviewers for constructive comments on the manuscript, and Lars Arvestad at Scilifelab for useful discussions during the study. Tarmo Annilo and Michael Dean kindly provided their ABCA sequence alignment.

[19:08 3/9/2015 Sysbio-syv044.tex]

13

ULLAH ET AL.—DLRSOrthology

if v is the root of G and x is the stem tip of S , otherwise, (A.3)

where A(z) is the set of ancestors of z, e(x) is the outgoing edge of x in S , and 1disc (x) is the indicator function for membership in V(S )\V(S). An illustration of the above probability is given in Supplementary Figure S1. The probability of speciation A vertex v of G can only be a speciation if the MPR maps it to a vertex of the species tree S. A vertex v of G is a speciation with respect to a

Page: 13

1–14

14

SYSTEMATIC BIOLOGY

realization  if and only if (v) is the vertex that MPR maps v to. So the probability that v is a speciation is the probability that a realization maps it to the same species tree vertex as MPR does. It can be computed as follows. Assume that MPR maps v to the species tree vertex x, then P[v is a speciation|G,l,,S ] =

o(x,v)a(x,v) . P[G]

(A.4)

REFERENCES

[19:08 3/9/2015 Sysbio-syv044.tex]

Page: 14

Downloaded from http://sysbio.oxfordjournals.org/ at University of Winnipeg on September 21, 2015

Åkerborg O., Sennblad B., Arvestad L., Lagergren J. 2009. Simultaneous Bayesian gene tree reconstruction and reconciliation analysis. Proc. Natl. Acad. Sci. USA 106:5714–5719. Åkerborg O., Sennblad B., Lagergren J. 2008. Birth-death prior on phylogeny and speed dating. BMC Evol. Biol. 8:77. Alex L., Natalia B., Sylvia R. 2007. Fully bayesian mixture model for differential gene expression: simulations and model checks. Stat. Appl. Genet. Mol. Biol. 6:1–28. Altenhoff A.M., Gil M., Gonnet G.H., Dessimoz C. 2013. Inferring hierarchical orthologous groups from orthologous gene pairs. PLoS ONE 8:e53786. Altschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403–410. Annilo T., Chen Z.-Q., Shulenin S., Costantino J., Thomas L., Lou H., Stefanov S., Dean M. 2006. Evolution of the vertebrate ABC gene family: analysis of gene birth and death. Genomics 88:1–11. Arvestad L., Berglund A.-C., Lagergren J., Sennblad B. 2003. Bayesian gene/species tree reconciliation and orthology analysis using MCMC. Bioinform. 19:i7–i15. Arvestad L., Lagergren J., Sennblad B. 2009. The gene evolution model and computing its associated probabilities. J. ACM 56:1–44. Blair J.E., Shah P., Hedges S.B. 2005. Evolutionary sequence analysis of complete eukaryote genomes. BMC Bioinform. 6:53. Boussau B., Szöllosi G.J., Duret L., Gouy M., Tannier E., Daubin V. 2013. Genome-scale coestimation of species and gene trees. Genome Res. 23:323–330. Bradley A.P. 1997. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn. 30:1145– 1159. Felsenstein J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368–376. Fitch W.M. 1970. Distinguishing homologous from analogous proteins. Syst. Zool. 19:99–113. Flicek P., Ahmed I., Amode M.R., Barrell D., Beal K., Brent S., Carvalho-Silva D., Clapham P., Coates G., Fairley S., Fitzgerald S., Gil L., García-Girón C., Gordon L., Hourlier T., Hunt S., Juettemann T., Kähäri K., Keenan A.S., Komorowska M., Kulesha E., Longden I., Maurel T., McLaren W.M., Muffato M., Nag R., Overduin B., Pignatelli M., Pritchard B., Pritchard E., Riat S., Ritchie H.G.R.S., Ruffier M., Schuster M., Sheppard D., Sobral D., Taylor K., Thormann A., Trevanion S., White S., Wilder S.P., Aken B.L., Birney E., Cunningham F., Dunham I., Harrow J., Herrero J., Hubbard T.J.P., Johnson N., Kinsella R., Parker A., Spudich G., Yates A., Zadissa A., Searle S.M.J. 2013. Ensembl 2013. Nucleic Acids Res. 41:D48–D55. Fournier T., Medjoubi-N N., Porquet D. 2000. Alpha-1-acid glycoprotein. Biochim. Biophys. Acta 1482:157–171. Fulton D.L., Li Y., Laird Y.R., Horsman M.S., Roche B.G.F.M., Brinkman L.F.S. 2006. Improving the specificity of high-throughput ortholog prediction. BMC Bioinform. 7:270.

Gogarten J.P., Townsend J.P. 2005. Horizontal gene transfer, genome innovation and evolution. Nat. Rev. Microbiol. 3:679–687. Goodman M., Czelusniak J., Moore G.W., Romero-Herrera A.E., Matsuda G. 1979. Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences. Syst. Biol. 28:132–163. Hallett M.T., Lagergren J. 2000. New algorithms for the duplicationloss model. RECOMB 2000: Proceedings of the Fourth Annual International Conference on Computational Molecular Biology. New York: ACM Press, 138–146. Hedges S.B., Dudley J., Kumar S. 2006. TimeTree: a public knowledgebase of divergence times among organisms. Bioinform. 22: 2971–2972. Heled J., Drummond A.J. 2010. Bayesian inference of species trees from multilocus data. Mol. Biol. Evol. 27:570–580. Ho S.Y., Jermiin L.S. 2004. Tracing the decay of the historical signal in biological sequence data. Syst. Biol. 53:623–637. Huelsenbeck J.P., Ronquist F. 2001. MRBAYES: Bayesian inference of phylogenetic trees. Bioinform. 17:754–755. Hulsen T., de Vlieg, J., Leunissen J.A.M., Groenen P.M.A. 2008. OPTIC: orthologous and paralogous transcripts in clades. BMC Bioinform. 36:D267–270. Jermiin L.S., Ho S.Y., Ababneh F., Robinson J., Larkum A.W. 2004. The biasing effect of compositional heterogeneity on phylogenetic estimates may be underestimated. Syst. Biol. 53:638–643. Jones D.T., Taylor W.R., Thornton J.M. JTT: 1992. The rapid generation of mutation data matrices from protein sequences. Comp. Appl. Biosci. 8:275–282. Lartillot N., Philippe H. 2004. A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Mol. Biol. Evol. 21:1095–1109. Linder M., Britton T., Sennblad B. 2011. Evaluation of Bayesian models of substitution rate evolution–parental guidance versus mutual independence. Syst. Biol. 60:329–342. Maddison P.W. 1997. Gene trees in species trees. Syst. Biol. 46:523–536. Mahmudi O., Sjöstrand J., Sennblad B., Lagergren J. 2013. Genomewide probabilistic reconciliation analysis across vertebrates. BMC Bioinform. 14 (Suppl 15) S10. Rasmussen M.D., Kellis M. 2011. A Bayesian approach for fast and accurate gene tree reconstruction. Mol. Biol. Evol. 28:273–290. Remm M., Storm C.E., Sonnhammer E.L. 2001. Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol. 314:1041–1052. Ruan J., Li H., Chen Z., Coghlan A., Coin L.J.M., Guo Y., Hériché J.-K., Hu Y., Kristiansen K., Li R., Liu T., Moses A., Qin J., Vang S., Vilella A.J., Ureta-Vidal A., Bolund L., Wang J., Durbin R. 2008. TreeFam: 2008 update. Nucleic Acids Res. 36:D735–D740. Sennblad B., Lagergren J. 2009. Probabilistic orthology analysis. Syst. Biol. 58:411–424. Sjöstrand J., Sennblad B., Arvestad L., Lagergren J. 2012. DLRS: Gene tree evolution in light of a species tree. Bioinform. 28: 2994–2995. Smith T., Waterman M. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147:195–197. Tatusov R.L., Fedorova N.D., Jackson J.D., A.R. Jacobs, B. Kiryutin, Koonin E.V., Krylov D.M., Mazumder R., Mekhedov S.L., Nikolskaya A.N., Rao B.S., Smirnov S., Sverdlov A.V., Vasudevan S., Wolf Y.I., Yin J.J., Natale D.A. 2003. The COG database: an updated version includes eukaryotes. BMC Bioinform. 4:41. Wu Y.-C., Rasmussen M.D., Bansal M.S., Kellis M. 2014. Most parsimonious reconciliation in the presence of gene duplication, loss, and deep coalescence using labeled coalescent trees. Genome Res. 24:475–486.

1–14

Integrating Sequence Evolution into Probabilistic Orthology Analysis.

Orthology analysis, that is, finding out whether a pair of homologous genes are orthologs - stemming from a speciation - or paralogs - stemming from a...
544KB Sizes 2 Downloads 6 Views