JOTUN HEIN Center for hiolscukrr Genetics, University of California at San Diego,

La Jolla, Calgomia 92093 Received 1 April 1989; revised 27 August I989

ABSTRACT The parsimony principle states that a history of a set of sequences that minimizes the amount of evolution is a good approximation to the real evolutionary history of the sequences. This principle is applied to the reconstruction of the evolution of homologous sequences where recombinations or horizontal transfer can occur. First it is demonstrated that the appropriate structure to represent the evolution of sequences with recombinations is a family of trees each describing the evolution of a segment of the sequence. Two trees for neighboring segments will differ by exactly the transfer of a subtree within the whole tree. This leads to a metric between trees based on the smallest number of such operations needed to convert one tree into the other. An algorithm is presented that calculates this m::ric. This metric is used to formulate a dynamic programming algorithm that finds the most parsimonious history that ftts a given se! of sequences. The algorithm is potentially very practical. since many groups of sequences defy analysis by methods that ignore recombinations. These methods give ambiguous or contradictory results because the sequence history cannot be described by one phylogeny, but only a family of phylogenies that each describe the history of a segment of the sequences. The generalization of the algorithm to reconstruct gene conversions and the possibility for heuristic versions of the algorithm for larger data sets are discus&.

INTRODUCTION Application of the principle of parsimony has given realistic phylogenies when applied to different groups of sequences such as globins, cytochromes, condition that must be ribosomal RNAs, and many others. s is that the sefulfilled before applying phylogeny is fi0t the case if the quences can be described by one singk phylogeny. conversions in their sequences have been subject to recombinations or evolutionary history. ences that most obviously violate rs of multigene families and following problem is proposed: at is THEMATICAL BIOSCIENCES 98:185-200 (1990) 655 Avenue

of the

185

186

JOTUN HEIN

set of sequences in terms of recombinations and substitutions? The problem is ~mputationally very different from the traditional parsimony problem but still difficult. The motivation for this investigation is that a large fraction of the sequence data presentll determined fall outside the domain of traditional phylogeny reconstruction methods. When traditional methods are applied anyway, they give co&sing and conflicting results. Typically a large equally good. Recombiset of very different phylogenies will appear as splits the sequences up into regions that nationsarefoundiftheresearcher are then ana@zed separately (e.g., McClure et al., [m. Frequently this is not done,andan errommu history of the sequences is the result. Even if the sequences are searched for recombinations, which is laborious, it takes luck tochoosethecorrectregionstoanalyze. In she* a method that reconstructs the history of sequences allowing for recombinations would be very useful. There have previously been methods that addressed the problem created by recombinations_ They have all been within the frameworkof compatibilthe ammnt of evolution ity, that is, do ilot find the history that IBWI&S (parsimony), but find the history thab allows the largest number of characters (positions in sequences) to fit the history perfec@. IWectly originally meant[6]thatonlyoneeventhadoccumxlinthese~creatinga bipartition. A history would fit with this bipartition if, in the tree for the character, there was an edge that created the same bipartition if cut, thus creating two smaller trees. It then becomes very easy to check if two positions are compatible with the same tree (see Figure 1): If the two bipartitions can be combined to create four nonempty sets, they cannot fit the same tree, that is, they are incompatible-otherwise they are compatible. Sneath et al. [ll] general&d the compatibility/iixompatibility concept, allowing more than two character states (m&otides/amino acids) to be observed for each &racter. The compatibility/incompatibility was then displayed in a lower-triaq@r matrix with the positions at the two axes. If there was a shift in phylogeny at some point within the sequences, this would decompose the lower-triangular matrix into three regions, two smaller triangular matrices (intraregional) with few incompatibilities and a rectangle onal) with many incompatibilities. Traditional tree reconstruction could then be applied to the two regions separately. Stephens [I21 pmposed a statistical test for runs of informative clusterings. If two such runs were statistically significant and incompatible, a recombination would have to be postulated. Neither method, though use&d, tries to reconstruct tie history of the se~uemxs, neither asserts anything about which recombinations have taken place, and both demand some subjective decisions from the researcher. overcomes this and is a natural generalization of to analyze sequences without recombi-

EVOLUTION AND RECOMBINATION

187 site

site 1 + 2

site 1

3

s6 s7

s6

s7

FtG. 1. Two bipartitions are compatible if when combined they split the sequences into fewer than four groups. There must be a branch that separates sl,s2,s3 from the rest. If another position is compatible with this class of trees, it may be the same bipartition, in which CM the combinations of the two positions is still a bipartition of the sequences. It can also be as position 2. where it adds more information about which tree is the true one. It will then bipartition one of the classes created by position 1 (here, s$,sS, s6,s7 into s&s6 versus ~4.~7). thus creating three nonempty classes. At the top. Seven very short sequences are shown. Position 1 would imply that the tree describing the position has an inner branch separating l-3 from 4-7. Position 2 is compatible with position 1 and indicates that 5 and 6 are brothers. Position 3, however, is not compatible with position 1 (or position 2). since it indicates that s3 and s4 are brothers, while position 1 kdicates they are positioned in different parts of the tree. The tree for position 3 must have s4 and sS as brothers, which they are not in the trees for positions 1 and 2. In a compatibility method, position 3 would be over&xi by 1 and 2 and simply ignored. Restriction map: sl- +---; s2= +--; ~33: +-+; s4=--+; s5=-+-; s6=-+-; s7=---.

nations. The method uses a metric on trees based on recombination that can be used in isolation to determine which recombinations would have to be postulated to transform one tree into another. It is also shown that recombination partially roots otherwise unrooted trees and it is thus not possible to postulate any combination of recombinations to explain some data, as the induced partial footings could prevent the resulting trees from being fully rootable. Last, as +?e method is fully automated, it would be vq suitable in conjunction with computer simulations to evaluate the statistical significance of observed results. First it must be clarified how the history of a set of sequences that has arisen through both duplications and recombinations can be described properly. Assume that six extant sequences are to be analyz4 (see Figure 2). What is the relevant way of describing the duplications and recombinations in their history? It will be described as alleles of a gene (first presented by Hudson [5] for a specific stochastic tree-generating process), but the situation with recombination in a multigene family is analogous. If one focuses on one position, no recombination can have taken place, and the history of that position can be described by a conventional phylogeny. This is obviously true for any position within the molecule, so the issue is, describing different positions related?

JOTUN HEIN

$3

$1

$4

ss

s6

I Recombination

FOG.2. Five sequences. sl . . . ~5. are given. Top: a tree with 3ne recombination ifI the ancestor to s4 and s5. ‘Ibe left part of the molecule is relatei by the tree shown with straight lines. and the tree for the right part of the molecule is described by following the cowl line above the recombiition point. ‘fbe bran&@ order is different for the keftand rigbtparts.Inthe~t~version,s3andsdarebrothersintheIeftpart,birtnotinthe right part. Bottom: ‘f&e effect of recombination on the location of ancestral genetic material. Ancestral genetic material that is linked in the immediate ancestor to s4 and s5 has been recombined and was not linked “kforethis event. .

For simplicity, assume that only one recombination has happened in the history of these five sequent. It happened in the middle of the molecule and involvedthe a.ncestorto s4 and s5. ‘Thelower part of Figure 2 shows a snapshot of the location of ancestral genetic material to s4 and s5 before and itfter the recombination event. After the recombination, the tree hanging from the recombination point, the history of the segments on each side of the recombination point is the same, yet the two segments were previously locat on two different chromosomes and their histories differed. The

EVOLUTION AND RECOMBINATION

189

history of the molecule of the left segment would coincide with the tree where the curved line is chosen instead of the straight line, but the molecule with right-side material could have a branch connected somewhere else into the tree as shown in Figure 2. This description going backwards in time may seem awkward at first, but it is fully correct. In general, the history of a set of sequences can be described by a set of recombination points in the sequence defining a series of intervals. Each interval has a history that can be described by one phylogeny. The phylogeny for two neighboring intentals would be closely related. The second can be obtained from the first by choosing a point on the tree where recombination has occuned: erasing the branch above the point attaching it to the main tree, and then choosing another attachment point in the tree. Due to fluctuations in rates of evolution and variance in the number of mutations that have occurred, parsimony can never expect to regain all information about the phylogeuy. In the traditional parsimony problem without recombination, there may be no collinearity between branch lengths and real time. Thus, the root and any interpretation of time direction of events is lost (if no additional assumptions about the process of evolution have been added). The same happens in this more general situation, but with one important and complicating exception: As shown in the above example, the root must be outside the small tree containing s4 and 6. In general, if a subtree has been moved in order to explain the difference between two neighboring trees, the trees become partially rooted; the root must be outside the involved subtree. THE ALG0RITHM A set of n sequences sl,..., sn of length I is assumed given. Positions that

partition the sequences into groups, where at least two are larger than one, are frequently called in,romariue positions, since the weight of their history is phylogeny-dependent. In this analysis only informative positions are considered. If a tree for a position is known, it is easy to find the most parsimonious history for the position. An algorithm doing this has been worked out by Fitch [l], Hartigan [3], and Sankoff [lo] and is thus referred to as the FHS algorithm. The overall strategy in constructing the algorithm is to convert the problem into a shortest-path problem in a graph that is defined as follows. Nodes For each informative position, create as many nodes as there are potential trees relating the sequences. Ewh node is weighted according to how much evolution it would take to explain the position assuming the tree mrresponding to the node. (IJse the FHS algorithm). Edges Two nodes that correspond to neighboring positions are connected er of by an edge. The edge is weighted in proportion to

190

JOTUN HEIN

1

rite

k-l

sitek

siten

is to FOG.3. The dynamicaI programminga~ument. The aim of dynamic pmng findthemost pakmonious (optimal) exphnation of the sequence for alI n positions. A sotutiaacaabcdescribedasapathfnwrpasitionltopositioba,andateachpositi~one compatibIe history is chsen. If the optimaI path passes through a specific history at positi~k.thenitm~tbeanexteasionofsomeoptimalbistoryforpositi~ltok-1, 0tk!Wi!XamOR! pakmoniousIi&orycouIdbefoundby&oosiqanoptimaIhistoryof thefiit k-l po6itions.

recombinations needed to transform the first tree into the second tree (or vice vena). The nodes carry tk weight of the substitutioq while the edges carry the of the recombinations. The most parskmious history will then conespond to the shortest path starting at the first position and ending in the last. The argument is inustrated in Figure 3. POSSIBLE TREES AT EACH POSITION Let

T(n) be the set of trees with n labeled tips corresponding to extant

sequences and BT( n) the subset of T(n) whew aillinternal nodes have three incident edges. Lowercase letters are used to indicate the sizes of these sets. It is well known [4] that k(n) is 1X3X - XQn -5). The trees of relevance in the following are BT(n), since they correspond to the trees that can arise by a series of duplications. From a given tree, two derived types of trees will be encountered. A stcbfree is the tree that falls off when a branch in the complete tree is cut, and an inner tree is obtained by removing some subtrees. Define the set of JM&Z& rooted trees, PRBT(n), to be derived from BT(n) by the refinement that each tree in BT(n) has been split up into a se&s of cases with information about where the root cannot be, as shown in Figure 4. It is possible to partially root a tree in a more general way by simply giving a subset of all edges where the root must be. The partial rooting introduced by recombkations will be through subtrees that automatically define an inner +m within the complete tree. This will be used to l

l

191

EVOLUTION AND RECOMBINATION

T

2

-

7

FIG. 4. Top: A partially rooted tree that has nine tips. Two thickly drawn subtrees indicate regions where the root cannot be. The corresponding inner tree has only seven tips. Middle: The minimum requirement OFa subtree to impose a real restriction on which recombinations can be allowed. The right disconnected subtree is an area where the root cannot be. It is also the smallest subtree to impose a real restriction on which recombinations are allowed. The left subtree, T, is part of the subtree that is to be moved. Assume the subtree to be moved contains T and is defiied by a cut at position 1. It has to be moved somewhere within the cut tree with two tips, and there is no possibility to change the topology of the total tree. Assume it has been defined by a cut at position 2. Now a change in topology can be achieved. So a subtree that prohibits this operation is a real restriction. This operation is not forbidden if the subtree is only the subtree defined by 2 and contains thick branches, so it must also contain a branch above this point (2). There is an analogous restriction on the subtree complementary to the subtree with thick branches. The subtree with thick branches must have arisen by being moved in a recombination that has changed the topology of the tree. Thus, the tree containing T must have size at least 3. Bottom: The simplest partially rooted tree that has genuine restriction on which recombinations can be performed. It is not possible here to move the subtree with tips l-4 to the one of the external edges with position 6 or 7 at its tip.

FIG. 5. The two possible shapes for trees in T(6). T&e fiit class contains 15 members, while~sgcoradcwa~~9o~~The~of~~classescdurbecalculated rdati* easily for small !&es [4b If all labdin@ of the tiLpswould give differeat trees Inthefirst~onecanintefchaqethetwo thefewouMbe6!=720treesforeacb owxount.Tbeshape tdlsustbat6!ka2x2 brothersinea&ofthebrotherpairs,so factor 8, yielding90 ther pair& that is, 2? central node, givingan extra factor of 32. In total. 6!/(3! x 23) = 15.

enumeratethe number of parts@ Wted tree!Aunfortunately, no &Bed expressicmhasbeenfouudf~thisnumber* First, more notation is needed. T~heshqe of a tree is au equivaknce class of trees defined by permutations of the WAS, as illustrated in Figure 5. TbereatelOStteesinBT(6),and~ocnneintwo~The~of~e respectively. Now let &(n,s, m) be the number of withnlabeledti~ofshapeswithaniMertreewith RHtips Bt(n, s,2) txmapds to the number crffully rooted trees and has size equal to the number of edges in the tree, 2n - 3. Bt(n, s,3) is the number of internal nodes in the m n-2 lQ(n,s,4) is the number of internaledges,n-3.lhesefirstthteeaumbersdonotdependonthe.rhane, but the remaiuing ones do. Bt(n, s, n - 1) is the number of brother pairs in thetree.When4

Reconstructing evolution of sequences subject to recombination using parsimony.

The parsimony principle states that a history of a set of sequences that minimizes the amount of evolution is a good approximation to the real evoluti...
2MB Sizes 0 Downloads 0 Views