Research Articles

JOURNAL OF COMPUTATIONAL BIOLOGY Volume 21, Number 10, 2014 # Mary Ann Liebert, Inc. Pp. 723–731 DOI: 10.1089/cmb.2014.0093

Maximal Acyclic Agreement Forests JOSH VOORKAMP

ABSTRACT Finding the hybridization number of a pair or set of trees, P, is a well-studied problem in phylogenetics and is equivalent to finding a maximum acyclic agreement forest (MAAF) for P. This article defines a new type of acyclic agreement forest called a maximal acyclic agreement forest (mAAF). The property for which mAAFs are ‘‘simplest’’ is more general and could be considered more biologically relevant than the corresponding property for MAAFs, and the set of MAAFs for any P is a subset of the set of mAAFs for P. This article also presents two new algorithms; one finds a mAAF for any P in polynomial time and the other is an exhaustive search that finds all mAAFs for some P, which is also a new approach to finding the hybridization number when applied to a pair of trees. The exhaustive search algorithm is applied to a real world data set, and the findings are compared to previous results. Key words: agreement forests, algorithms, exhaustive search, partitions, phylogenetics.

1. INTRODUCTION

A

s groups of taxa are found whose ancestry do not appear to be sufficiently summarized by a single tree, the problem of inferring ancestral relationships requires solutions that include the possibility of accounting for hybridization, reticulation, and/or horizontal gene transfer. Given a set of trees P describing the ancestral relationships between some set of species, an established problem is to determine the minimum number of hybridization events that must have occurred. If P consists of a pair of trees, the solution to this problem is equivalent to finding a maximum acyclic agreement forest (MAAF) for P (Baroni et al., 2005). Although this problem is NP-hard (Bordewich and Semple, 2006), there has been a succession of results, including Bordewich et al. (2007), Collins (2009), Collins et al. (2011), Wu and Wang (2010), and Albrecht et al. (2011), which have significantly decreased the required calculation time. However, if the space of acyclic agreement forests (AAF) of P is considered, then MAAFs may not capture the complete picture. Let N (fT ‚ T 0 g) be the set of all phylogenetic networks that display T and T 0 . The set of MAAFs correspond to the networks in N (fT ‚ T 0 g) that have fewest hybridization events of all networks in N (fT ‚ T 0 g), call this set of networks N MAAF (fT ‚ T 0 g). Suppose there is a set of taxa L  L(T ) that labels a common subtree of T and T 0 , and it is known that for any ‘‚ ‘0 2 L the path connecting ‘ to ‘0 does not contain any hybridization events. It is possible for there not to exist a network in N MAAF (fT ‚ T 0 g) that preserves this property. However if N mAAF (fT ‚ T 0 g) is the set of all networks corresponding to mAAFs then there is a network in N mAAF (fT ‚ T 0 g) with such a property. So whereas MAAFs are informally the collections of treelike relationships such that there is no other AAF with fewer Department of Mathematics & Statistics, University of Otago, Dunedin, New Zealand.

723

724

VOORKAMP

members, the mAAFs are the collections of treelike relationships such that there is no other AAF with fewer members, which also includes the same treelike relationships.

2. NOTATION Other than the exceptions noted here the notation and definitions match those in Semple and Steel (2003). A consistent difference is the usage of the calligraphic font. Here the calligraphic font is reserved for mappings that return or variables that are either leaf labeled digraphs or sets of leaf labeled digraphs, for example, L(T ) becomes L(T ). The notation v D v0 indicates that there exists a directed path from v to v0 in a digraph D. When D is clear from context it may be omitted. In the case that either v D v0 or v = v0 is true the notation v D v0 will be used. All trees are assumed to be directed binary phylogenetic trees and planted (the root is labeled and has outdegree 1). A tree S is a subtree of T if S @ T j L(S), a slightly stronger definition than appears elsewhere, denoted by SpT . If S and S 0 are two subtrees of a phylogenetic tree T then S and S 0 are said to overlap with regard to T if A(T (L(S))) \ A(T (L(S 0 ))) 6¼ 0, that is, the sets of arcs of T that correspond to those in S and S 0 respectively have an empty intersection. A profile P is a set of phylogenetic trees such that L(T ) = L(T 0 ) for any pair of trees T ‚ T 0 2 P. A forest F = ft0 ‚ t1 ‚ . . . ‚ tk g is a collection of phylogenetic trees t0 ‚ t1 ‚ . . . ‚ tk such that fL(t0 )‚ L(t1 )‚ . . . ‚ L(tk )g is a S partition of a set L(F ) = ti 2F L(ti ). It follows that, at most, one tree in F has a labeled root and thus is planted, and in the forests in this article exactly one does. A forest F of a phylogenetic tree T is a forest such that L(F ) = L(T ), ti pT for every ti 2 F , and no pair of trees in F overlap with regard to T . An agreement forest F of a profile P is a forest that is an agreement forest for every element of P. If F is an agreement forest of a profile P, the ancestor-descendent digraph AG(F ‚ P) is given by AG(F ‚ P) = (F ‚ f(ti ‚ tj ) : mrcaT L(ti )

mrcaT L(tj ) for some T 2 Pg)

(1)

If AG(F ‚ P) is acyclic then F is said to be acyclic with respect to P, otherwise it is cyclic. Two forests are isomorphic F @ F 0 if the elements can be paired into isomorphic trees.

2.1. Maximal acyclic agreement forest Throughout, mAAFs will be described as the maximal elements according to a partial ordering on the space of forests. Definition 2.1. A forest F 0 is a subforest of a forest F denoted F 0 pF if L(F 0 ) = L(F ) and there is a partition P of F 0 such that every A 2 P is a forest of some ti 2 F . Lemma 2.1.

If F is an AAF of a profile P and F 0 pF then F 0 is an AAF of P.

Proof. There are three properties to check: that each tree in F 0 is a subtree of the trees in P, that F 0 is nonoverlapping in the trees in P, and that F 0 is acyclic with respect to P. For every tree t0 2 F 0 there is a tree t 2 F such that t0 £ t and tpT for all t 2 F and T 2 P, and so the first property holds. Consider any pair of trees ti0 ‚ tj0 2 F 0 . Either there is a single tree t 2 F such that ti0 pt and tj0 pt, or there are distinct trees ti and tj such that ti0 pti and tj0 ptj . In the former case, ti0 and tj0 are nonoverlapping with respect to t and since tpT for every T 2 P it follows that ti0 and tj0 are nonoverlapping with respect to P. If ti0 and tj0 are subtrees of ti and tj in F respectively then they are subtrees of trees that are nonoverlapping with respect to every tree in P. Assume that F 0 is cyclic with respect to P and let C0 = (t00 ‚ t10 ‚ . . . ‚ tn0 ) for n > 0 be a cycle in AG(F 0 ‚ P) where t0i AG(F 0 ‚ P) t0i + 1 for i 2 [0‚ n - 1] and t0n AG (F 0 ‚ P) t00 . If there is a tree t 2 F such that ti0 pt for all i then it follows that mrcat L(t00 ) = mrcat L(t10 ) =    = mrcat L(tn0 ) and as t is binary it follows that the trees in F 0 are overlapping with regards to t, which contradicts the definition. Alternatively assume the trees in C0 are subtrees of at least two trees in F . Consider the sequence of trees (t0 ‚ t1 . . . ‚ tn ) for ti 2 F where ti0 pti for all i. It can be shown that this sequence satisfies ti AG(F ‚ P) ti + 1 for i 2 [0‚ n - 1] and tn AG(F ‚ P) t0 . It was already established that there are at least two distinct trees in this sequence, and so it represents a cycle in AG(F ‚ P), which contradicts the hypothesis and so F 0 must be acyclic. -

MAXIMAL ACYCLIC AGREEMENT FORESTS

725

The justification for mAAFs should now be clear. If F is a mAAF of P then every subforest of F is also an AAF, and every AAF is a subforest of at least one mAAF; the latter is a property that MAAFs, in particular, do not share. Definition 2.2. An AAF F of a profile P is a maximal acyclic agreement forest of P such that if there exists an AAF F 0 of P with F pF 0 then F @ F 0 .

3. FINDING MAXIMAL ACYCLIC FORESTS Although finding an MAAF is NP-hard (Bordewich and Semple, 2006), a mAAF can be found in polynomial time, as will be demonstrated in the next few sections. Informally the algorithm is based on the following idea. If F is not a mAAF of a profile there must exist an AAF of which F is a nonisomorphic subforest. If such a forest exists then there is a pair of trees in F that can combine to produce an AAF. Starting with a forest consisting of L(F ) as isolated vertices, the algorithm iteratively checks pairs of trees to see if replacing them with some combination of the pair results in an AAF. When every current pair is checked and no pair can be combined the resulting forest is a mAAF. This checking, even if done naı¨vely, can be done in O(jL(F )j3 ) time so all that remains is to show that the checks at each step only take polynomial time. To accomplish this, the algorithm find_mAAF uses one subalgorithm for each check that must be applied to the forest: every tree in the forest must be a subtree of the profile, the trees in the forest are nonoverlapping with respect to the profile, and the forest is acyclic with respect to the profile. To reduce running time the proofs and implemented algorithm take advantage of the fact that since only one pair of trees is combined at each step the tests only need to compare this new tree against the remaining ones. The proofs2 also make extensive use of the MRCA result against the original trees, which decreases the processing that is required and makes the operation linear time in the size of the set of labels, using a result from Berkman and Vishkin (1989), which allows for mrcaT L to be found in O(jLj) time after an O(jL(T )j) preprocessing step.

3.1. Combining subtrees This algorithm is responsible for finding, for a specified pair ti ‚ tj 2 F , a tree t if it exists such that tpT for all T 2 P and ti and tj are nonoverlapping subtrees of t. Informally for a specific tree T in P it first determines which of ti and tj is pendant then, if only one is pendant, identifies the arc of the nonpendant tree to which the other attaches via a set of descendent leaves. If such a tree can exist it then verifies that either both are pendant or the descendent leaves are the same for the remaining trees in P. Lemma 3.1. Let ti and tj be two nonoverlapping subtrees of a tree T and S be the subtree of T given by S = T j(L(ti ) [ L(tj )). At least one of ti and tj is a pendant subtree of S. Proof.

Omitted.

-

Lemma 3.2. Let ti and tj be two nonoverlapping subtrees of all trees in a profile P. Determining the tree S such that S @ T j (L(ti ) [ L(tj )) for all T 2 P if it exists takes polynomial time. Proof. From lemma 3.1 there are three possible relative arrangements of ti and tj in T j (L(ti ) [ L(tj )) for each T 2 P; namely, only ti is pendant, only tj is pendant, or both are pendant. For the rest of the proof let T 0 be a specific tree in P‚ vi = mrcaT 0 L(ti )‚ vj = mrcaT 0 L(tj ), and vij = mrcaT 0 (L(ti ) [ L(tj )) = mrcaT 0 fvi ‚ vj g. Suppose both ti and tj are pendant in T 0 j (L(ti ) [ L(tj )), in which case vij vi and vij vi with vi s vj, thus vi, vj, and vij are all distinct. If S exists it is must be obtained by attaching the roots of ti and tj to a newly created vertex. To verify S exists it is sufficient to check that for all remaining T 2 P the vertices mrcaT L(ti ), mrcaT L(tj ), and mrcaT (L(ti ) [ L(tj )) are distinct. In the case that only one is pendant, first determine which by checking whether vij = vi, in which case vij = vi vj and tj is pendant, or vij = vj, in which case vij = vj vi and ti is pendant. Without loss It is not yet implemented in the algorithm and so in implementation each calculation of MRCA takes O(jL(T )j2 ) time.

2

726

VOORKAMP

of generality assume ti is pendant. The set of points M = fmrcaT 0 fvi ‚ ‘j g : ‘j 2 L(tj )g lie along the path from vj to vi. Let m 2 M be the vertex such that there is no m0 2 M with m m0 . The purpose of m is to determine the edge in tj to bisect and attach ti in order to obtain the required S. Let A = f‘j 2 L(tj ) : mrcaT 0 (L(ti ) [ f‘j g) = mg and B = f‘j 2 Lðtj Þ : mrcaT 0 (L(ti ) [ f‘j g) 6¼ mg = L(tj )yA. If S exists it is obtained by bisecting the in-arc of mrcatj A and attaching ti to the newly created vertex. To establish that S exists, in this case two steps need to be taken for all remaining trees T . First check that the same tree is pendant by verifying mrcaT (L(ti ) [ L(tj )) = mrcaT L(ti ) mrcaT L(tj ), then find mT = mrcaT A and verify that mrcaT f‘b ‚ mT g 6¼ mT for all ‘b 2 B. If either fail, then S does not exist. When vi, vj, and vij are distinct the running time is O(jL(ti )j + jL(tj )j) per tree. Otherwise the running time is O(jL(ti )j + jL(tj )j) per T 2 P, as values involving the MRCA need to be calculated once for all the labels in one tree and twice for all the labels in the other. The total running time follows as O(jPj(jL(ti )j + jL(tj )j)), after the preprocessing step of O(jPkL(P)j). -

3.2. Determining overlap This algorithm checks if a specified tree overlaps with any other tree in the forest with regards to a profile. Informally it determines if there is any path from the MRCA of the label set of one tree to one of its leaves that passes through the MRCA of the label set of the other tree in any member of the profile. Lemma 3.3. Let ti and tj be subtrees of all trees in a profile P. It is the case that ti and tj overlap in P if and only if there exists ‘i 2 L(ti ) such that mrcaT L(ti ) mrcaT L(tj ) ‘i or either there exists ‘j 2 L(tj ) such that mrcaT L(tj ) mrcaT L(ti ) ‘j or mrcaT L(ti ) = mrcaT L(tj ) for some T 2 P. Proof. Let T 0 be a tree in P. Assume ti and tj overlap in T 0 . It follows that A(T 0 (L(ti )))\ A(T 0 (L(tj ))) 6¼ ; and thus either mrcaT 0 L(ti ) mrcaT 0 L(tj ), mrcaT 0 L(tj ) mrcaT 0 L(ti ), or mrcaT 0 L(ti ) = mrcaT 0 L(tj ). They clearly overlap in the last case, so of the remaining two and without loss of generality pick the former. Assume there is no ‘i 2 L(ti ) such that mrcaT 0 L(ti ) mrcaT 0 L(tj ) ‘i in which case there is no path from mrcaT 0 L(tj ) to ‘i and tj is a pendant subtree of T 0 j(L(ti ) [ L(tj )). In this case the subtrees do not overlap and so such an ‘i must exist. For the other direction assume that there exists a leaf ‘i 2 L(ti ) such that mrcaT L(ti ) mrcaT L(tj ) ‘i . As both trees are binary, both, and thus all, out-arcs of mrcaT L(tj ) are in A(T (L(tj ))), and as the path from mrcaT L(ti ) to ‘i passes through this vertex one of the out-arcs must also be in A(T (L(ti ))) and so the trees overlap. Lemma 3.4. Let F be a forest of a profile P and let t 2 F be a specific tree such that F yftg is an AAF of P j (L(F )yL(t)). Determining if there is a pair of trees in F that overlap in a tree in P takes polynomial time. Proof. Since F yftg is an AAF of P j (L(F )yL(t)) if a pair of trees in F overlap then one of the pair must be t. For each tree T 2 P and for each t0 2 F calculate v = mrcaT L(t) and v0 = mrcaT L(t0 ). If v = v0 the trees overlap. Next calculate w = mrcaT fv‚ v0 g. If w s v and w s v0 then the trees do not overlap because they are both pendant in T j (L(t) [ L(t0 )). Without loss of generality assume w = v, then for each ‘ 2 L(t0 ) check if mrcaT fv0 ‚ ‘g = v0 . If such a leaf exists then the two trees overlap in T . Thus after O(jPkL(P)j) preprocessing steps the algorithm takes O(jPkF j(jL(t)j + jL(P)j)) time. -

3.3. Checking for acyclicity This algorithm is responsible for finding any cycles in AG(F ‚ P). Unlike the previous two subalgorithms this algorithm has two distinct parts. The first is construction of AG(F ‚ P) and the second is checking it for a cycle that passes through a specified tree. Lemma 3.5.

Let F be an agreement forest of P. Constructing AG(F ‚ P) takes polynomial time.

Proof. There is an arc (ti ‚ tj ) 2 A(AG(F ‚ P)) precisely when mrcaT L(ti ) mrcaT L(tj ) equivalently mrcaT L(ti ) = mrcaT (L(ti ) [ L(tj )). Calculating this for every pair ti ‚ tj 2 F , and every T 2 P can be done in O(jF j2 jPj) after the O(jPjjL(P)j) preprocessing step. -

MAXIMAL ACYCLIC AGREEMENT FORESTS

727

Lemma 3.6. If F is an agreement forest for P then checking that F is acyclic with respect to P given that F yftg is an acyclic agreement forest of P j (L(F )yL(t)) takes polynomial time. Proof. From lemma 3.6 AG(F ‚ P) may be found in polynomial time. A digraph D may be checked for acyclicity by iteratively removing vertices with out-degree zero. When no such vertex remains if V (D) = ; then D is acyclic otherwise it is cyclic. This can clearly be done in O(jV(D)j) time. The running time for checking if F is cyclic with respect to P is then O(jF j2 jPj) after the O(jL(P)kPj) preprocessing step.-

3.4. Putting it all together To find a mAAF in polynomial time start with a forest F consisting of singleton vertices corresponding to the label set of the trees of P. Iterate the following steps. Pick a pair of trees ti ‚ tj 2 F from the current forest F . Combine them to obtain t with the results from section 3.1 if possible, check if F [ ftgyfti ‚ tj g overlaps with respect to P via the results in section 3.2. Finally, use the results in section 3.3 to determine if in F [ ftgyfti ‚ tj g is acyclic with respect to P. If these steps are successful replace F with F [ ftgyfti ‚ tj g and repeat. If unsuccessful repeat with original F using a different pair of trees for ti and tj. If all pairs of trees have been tried then the algorithm stops and the current forest is a mAAF. The final result directly related to this algorithm will be to show that any mAAF is a possible output of find_mAAF. To obtain this result the way in which pairs of trees are selected for combining will be formalized as a permutation r = (f‘1 ‚ ‘01 g‚ f‘2 ‚ ‘02 g‚ . . . ‚ f‘n ‚ ‘0n g) of all the 2-element subsets of L(P). To select a pair of trees at the ith step look at the ith element of this permutation and select the pair of trees in F that contain the labels ‘i and ‘0i . If they already appear in the same tree move on to the next pair. The following results prove that for every mAAF there is at least one permutation that results in find_mAAF returning the mAAF, and that the output of find_mAAF is guaranteed to be a mAAF. Lemma 3.7.

If F is a mAAF of P, then it is a possible output of find_mAAF applied to P.

Proof. For each ti 2 F and each ti 2 L(t)yfmin L(ti )g, add (min L(ti),‘i) to the permutation. Then add all remaining pairs of leaves to complete the permutation. The first elements added will cause find_mAAF to build F since each tree ti 2 F is specified by min L(ti). The other leaves of ti are then added to these trees one at a time. Since F is already known as a mAAF, this method creates subforests of F that are AAFs and so the iterating step will always be successful in replacing the pair of trees with a new one. However, no other pair after this will combine a pair of trees successfully as F is a mAAF and so the required result will be the one returned. Lemma 3.8.

If F is an output of find_mAAF(P) then it is a maximal acyclic agreement forest of P.

Proof. This follows as no pair of leaves whose associated trees were unsuccessfully selected for merging need be considered later. Let r be the random permutation of leaf pairs. Let F j be the forest obtained by find_mAAF after processing the permutations up to and including flj ‚ l0j g. If F n is not maximal there must exist a pair fli ‚ l0i g such that the trees containing the respective labels ti, ti0 can be combined to and replaced by some ti00 to give an AAF F n + 1 . It then follows that F i - 1 @ F i and trees ti - 1 ‚ ti0 - 1 2 F i - 1 containing the leaves could not be combined and replaced by a tree ti00- 1 that would result in an AAF. However, since F n + 1 is a mAAF of P‚ ti - 1 ptn ‚ ti0 - 1 ptn0 , and tn00 is a member of F n + 1 , it follows that ti00- 1 must exist as tn00 j (L(ti - 1 ) [ L(ti0 - 1 )). Moreover, it can be shown that F i - 1 [ fti00- 1 gyfti - 1 ‚ ti0 - 1 gpF n + 1 and is thus an AAF so find_mAAF would have found and continued operating with F i - 1 [ fti00- 1 gyfti - 1 ‚ ti0 - 1 g and so f‘j ‚ ‘0j g cannot exist.Combining the running times, including the preprocessing step, and noting that if F is an AAF of P then jF jpjL(P)j, the running time of find_mAAF is O(jL(P)j6 jPj).

4. EXHAUSTIVE SEARCH Scornavacca et al. (2012) developed an algorithm that calculates all MAAFs for some profile, arguing that since there can be more than one MAAF it is important for biologists to be able to calculate all of them. The following sections will describe an algorithm for the related problem of finding all mAAFs. Previous results for calculating a MAAF (including Albrecht et al., 2011; Bordewich et al., 2007; Collins, 2009;

728

VOORKAMP

Collins et al., 2011; Wu and Wang, 2010), have operated in a way that could be described as outside-in; that is, they start with a structure that does not represent an AAF and then recurse until a structure is found that is an AAF. The algorithm that will be described here can be thought of as inside-out; it starts with a structure that does represent an AAF and recurses until it finds a structure that does not. In order to reduce the computational time, a result is presented that ensures each AAF is checked exactly once since, although it is easy to make statements like ‘‘visit each vertex in a graph once,’’ it naturally raises the question of how to best accomplish it. If the graph is sufficiently small vertices that have been checked can be stored and compared against; however, this is not generally feasible for the set of AAFs of a profile. Definition 4.1. If P and Q are two partitions of a set S, then Q is a refinement of P denoted Q £ P if every part in Q is a subset of a part in P. Definition 4.2. For a profile P a partition P of L(P) is called an agreement partition if for any T 2 P the forest F = fT j A : A 2 Pg is an agreement forest for P. Similarly, if F is a (m/M)AAF then P is a (maximal/maximum) acyclic agreement partition ((m/M)AAP) It is not uncommon for algorithms that calculate MAAFs to do so via partitions of the label set rather than forests as they have been defined here. It is useful to distinguish the two in this work since if F and F 0 are agreement forests of a profile P, then F 0 pF implies fL(t0 ) : t0 2 F 0 gpfL(t) : t 2 F g but not generally vice versa. Definition 4.3. Let P be a partition of a set to which some complete ordering can be applied. The greatest merged element (GME) of P is defined as gme P =

max max A

A2P:jAjq2

(2)

Definition 4.4. Let Dgme S be the digraph whose vertex set is the set of all partitions of S and two partitions P, Q are connected by an arc (P,Q) if P W {Ay{gme Q},{gme Q}}y A = Q where gme Q 2 A. That is, if gme Q is turned into its own part then P is obtained. Lemma 4.1. For a set S, let D be the digraph such that the vertex set is the set of partitions of S and there is an arc (A, B) if A £ B. Then Dgme S is a spanning tree of D. Proof. It is sufficient to show that each arc in A(Dgme S) expresses a refinement, and that there is a unique path from a specific vertex, the root, in Dgme S to every other partition of S. The root of Dgme S is the partition P that has no parts with two or more members, since in that case gme P is undefined. This corresponds to the partition containing only singleton elements. Since the GME, when it exists, is unique if Q is an element of V(Dgme S) other than the root, then it has in-degree 1 and so any path that exists from the root to Q must be unique. The following inductive proof will show that such a path always exists. The root of Dgme S is trivially reachable from itself. Suppose there is a unique path from the root to any partition with n + 1 parts and let P be any partition with n parts in V(Dgme S). Since the parent of P is obtained by isolating an element of a part as its own set, it follows that the parent of P has n + 1 parts and the result follows. The knowledge of this particular spanning tree permits a very simple recursive algorithm that can avoid, but not completely prevent, unnecessarily checking if a partition is an AAP. In particular, starting with the partition P of singleton elements recurse by picking each singleton fxg 2 P such that x > gme P if gme P is defined and unconstrained otherwise, then pick every A 2 P such that either jAj ‡ 2 or x > a with {a} = A and check if combining {x} and A yields an AAP. The test of x > a is to avoid double counting caused by adding x to {a} as well as a to {x} when both a and x are greater than gme P when it is defined. If a partition P is reached that is not an AAP then none of the children in the search tree need to be checked since no partition that has P as a refinement may be an AAP if P is not. This permits searching specifically within AAP space and visiting each point in said space only once.

4.1. Results and future work The exhaustive search algorithm was implemented and applied to the Grasses Phylogeny Working Group (2001) dataset. A summary of the results can be found in Table 1. See Figure 1 for plots that illustrate the size of the spaces for datasets that finished.

MAXIMAL ACYCLIC AGREEMENT FORESTS

729

Table 1. The Results for Exhaustive Search Applied to the Grasses Phylogeny Working Group (2001) Dataset Datasets ndhF/ITS ndhF/phyt ndhF/rbcL ndhF/rpoC2 rpoC2/ITS phyt/ITS rbcL/ITS rbcL/rpoC2 phyt/rbcL phyt/rpoC2 ndhF/waxy waxy/ITS phyt/waxy rbcL/waxy rpoC2/waxy

Taxa

Hybridization number

Number of mAAFs

46 40 36 34 31 30 29 26 21 21 19 15 14 12 10

19 14 13 12 15 8 14 13 4 7 9 8 3 7 1

‡3 ‡ 18 ‡2 ‡ 54 ‡ 43 ‡5 ‡ 519 ‡ 384 ‡ 66 164 606 213 18 64 4

Number of MAAFs (2268) (48) (27) (9) (9) (4) 1 396 18 6 35 1

Time ‡ 2d ‡ 2d ‡ 2d ‡ 2d ‡ 2d ‡ 2d ‡ 2d ‡ 2d ‡ 2d 11.04h 6.23h 3.18m 10.02m 7.93s 5.03s

The results were compared to those in Linz et al. (2010), and the cases in which the exhaustive algorithm did not complete the results from that article are indicated by parentheses.

a

b

c

d

FIG. 1. Summary log plot of partition, AAF, and mAAF space, indicated in black, gray, and white respectively.

e

f

730

VOORKAMP

Based on the results in Table 1, it appears that the number of mAAFs of a profile is of similar magnitude to the number of MAAFs, and both are significantly smaller than the number of AAFs. Currently, an obvious drawback is the long running time for a number of the datasets so a useful next result would be, if possible, to extend the reduction results in Bordewich and Semple (2007). If this, or some other method, is successful in reducing the running time it would be interesting to revisit the method in Linz et al. (2010), wherein the goal was to construct a particular type of network—time consistent—given a MAAF and profile. However, that article found that for some profiles none of the MAAFs had a corresponding timeconsistent network; maybe some mAAFs do? Scornavacca et al. (2012) argue that constructing all phylogenetic networks for all MAAFs is an appealing problem and remains so at the time of writing. Hopefully the content of this article has successfully suggested the possibility that finding all mAAFs may be a similarly useful tractable problem.

4.2. Implementation The algorithm is implemented in C, and GPL licensed source code can be obtained online. The executable to build some mAAF is maximalforestbuilder, and the executable to exhaustively search the partition space is efss. For help building an executable for a specific operating system, suggestions or requests for specific program output, or notification of bugs, please contact the author.

ACKNOWLEDGMENTS This article is a briefer version of chapter 9 of Voorkamp (2014). In particular, pseudocode, some observations, and some graphs have been omitted. I would like to acknowledge my PhD supervisors Michael Hendy and Barbara Holland. This work was financially supported by the New Zealand Marsden fund (09-MAU-037 to Dr. Barbara Holland).

AUTHOR DISCLOSURE STATEMENT No competing financial interests exist.

REFERENCES Albrecht, B., Scornavacca, C., Cenci, A., et al. 2011. Fast computation of minimum hybridization networks. Bioinformatics 28, 191–197. Baroni, M., Gru¨newald, S., Moulton, V., et al. 2005. Bounding the number of hybridisation events for a consistent evolutionary history. J. Math. Biol. 51, 171–182. Berkman, O., and Vishkin, U. 1989. Recursive star-tree parallel data structure. Technical Report CS-TR-2437, Computer Science, University of Maryland. Bordewich, M., Linz, S., St John, K., et al. 2007. A reduction algorithm for computing the hybridization number of two trees. Evolutionary Bioinformatics 3, 86–98. Bordewich, M., and Semple, C. 2006. Computing the minimum number of hybridization events for a consistent evolutionary history. Discrete Appl. Math. 155, 914–928. Bordewich, M., and Semple, C. 2007. Computing the hybridization number of two phylogenetic trees is fixed-parameter tractable. IEEE/ACM Transactions on Computational Biology and Bioinformatics 4, 458–466. Collins, J. 2009. Rekernelisation algorithms in hybrid phylogenies [Master’s thesis]. University of Canterbury, Christchurch, New Zealand. Collins, J., Linz, S., and Semple, C. 2011. Quantifying hybridization in realistic time. J. Comput. Biol. 18, 1305–1318. Grasses Phylogeny Working Group, 2001. Phylogeny and subfamilial classification of the grasses (Poaceae). Ann. Mo. Bot. Gard. 88, 373–457. Linz, S., Semple, C., and Stadler, T. 2010. Analyzing and reconstructing reticulation networks under timing constraints. J. Math. Biol. 61, 715–737. Scornavacca, C., Linz, S., and Albrecht, B. 2012. A first step towards computing all hybridization networks for two rooted binary phylogenetic trees. J. Comput. Biol. 19, 1227–1262. Semple, C., and Steel, M. 2003. Phylogenetics. Oxford University Press, Oxford, United Kingdom.

MAXIMAL ACYCLIC AGREEMENT FORESTS

731

Voorkamp, J. 2014. Untangling evolution. [Ph.D. thesis] University of Otago, Dunedin, New Zealand. Wu, Y., and Wang, J. 2010. Fast computation of the exact hybridization number of two phylogenetic trees, 203–214. In Borodovsky, M., Gogarten, J., Przytycka, T., and Rajasekaran, S., eds., Bioinformatics Research and Applications, Lecture Notes in Computer Science, Vol. 6053. Springer Berline / Heidelberg.

Address correspondence to: Josh Voorkamp E-mail: [email protected]

Maximal acyclic agreement forests.

Finding the hybridization number of a pair or set of trees, [Formula: see text], is a well-studied problem in phylogenetics and is equivalent to findi...
945KB Sizes 2 Downloads 4 Views