1231

PROGRESSIVE

ALIGNMENT

375

AND PHYLOGENIES

in Fig. 3 by a pair of parallel planes perpendicular to the j-k face. In this case, however, the cells included within the j and k boundaries for each ith plane are not always confined in a rectangular region. Consequently, the tracing process of Algorithm 2 becomes more elaborate and time consuming. Application

Example

As an example of the application of the above three-way algorithm, an alignment of three ATPases produced by ALIGN3 is shown in Fig. 4. The sequences used were the (Y and p subunits of bovine mitochondria*6 and the /I subunit from Escherichia coli. l7 The MDM78 weighting system was used with a constant 8 added to make all weights nonnegative. The CPU time required for the matrix summation was 5 min 4 set and that for the tracing process was 45 set on the DEC VAX 11/750. Acknowledgments This work was supported in part by a grant from the National GM28 I39 (to J. W. Lee).

Institutes of Health,

I6 J. E. Walker, I. M. Feamley, N. J. Gay, B. W. Gibson, F. D. Northrop, S. J. Powell, M. J. Runswick, M. Saraste, and V. L. J. Tybulewicz, J. Mol. Biol. 184, 677 (1985). I7 J. E. Walker, M. Saraste, M. J. Runswick, and N. J. Gay, EMBO J. 1,945 (1982).

[231 Progressive Alignment and Phylogenetic Construction of Protein Sequences

Tree

By DA-FEI FENG and RUSSELL F. DOOLITTLE Introduction The relationship of a set of related protein sequences can be expressed quantitatively in terms of a phylogenetic tree. The topology of the tree gives an indication of how the sequences should be grouped; the branch lengths provide some sense of the true evolutionary distances. The accuracy of the tree naturally depends on the alignment of the sequences. When there are more than two sequences, the problem lies in the gaps which must be introduced to align sequences optimally. For the simple case of three METHODS

IN ENZYMOLOGY,

VOL.

183

Copyright 0 1990 by Academic Press, Inc. All rights of reproduction in any form reserved.

376

ALIGNING

PROTEIN

AND

NUCLEIC

ACID

SEQUENCES

1231

sequences A, B, and C, the locations of the gaps in a three-way alignment may be very different from those between any pair of sequences. The scheme discussed here, the progressive alignment method, produces a multiple alignment for a set of protein sequences by iteratively acting on the sequences.’ The essence of the method is based on the simple rule, “once a gap, always a gap.” Consequently, the order in which the sequences are arranged is crucial. In this regard, an approximate phylogenetic order of the sequences is first determined by a series of pairwise alignments by the Needleman and Wunsch method.2 This provides the most closely related pair, which is the starting point for the progressive alignment procedure. The preliminary set of pairwise measurements also reveals any subclusters that may exist in the set. These subclusters can be treated as units during the alignment process, ensuring that the relative positions of the residues within the cluster will not be altered. Thus, all subclusters are prealigned before the final alignment is undertaken. The progressive alignment method utilizes the minimum mutation matrix of Dayhoff, thereby permitting the matching of not only identical residues but also those that are similar to each other. To further increase the chances of maximizing the matching, the characters from the last added sequence, or set of sequences, are averaged across all previously aligned residues. Methods All the programs described here are written in the C language and run on a VAX computer with the UNIX (Berkeley 4.3) operating system. Other users have run these programs successfully with the VMS operating system, although some minor adjustments are usually necessary. The ensemble of programs dealing with sequence alignment and tree building can be obtained by sending a blank magnetic tape to the authors. Dejnitions. For purpose of description, we distinguish between simple and compound trees. Simple trees are those in which the branching order follows the simple clustering pattern ((((AB)C)D)E), etc., whereas compound trees have subclusters, as in (((AB)(CD))E), etc. Neutral elements are simple characters (Xs or Js) that are filled into sequences when gaps occur. They are neutral in the sense that they are invisible to the scoring system used to establish subsequent alignments, which is to say, when X is I D.-F. Feng and R. F. Doolittle, J. Mol. Evol. 25, 351 (1987). 2 S. B. Needleman and C. D. Wunsch, J. Mol. Biol. 48,443 (1970). 3 M. 0. Dayhoff, “Atlas of Protein Sequence and Structure,” Vol. 5, Suppl. 3, National Biomedical Research Foundation, Washington, D.C., 1978.

1231

PROGRESSIVE

ALIGNMENT

AND

377

PHYLOGENIES

matched with another residue, the value is equal to zero. Negative segments are those internodal connecting distances with negative values that occasionally emerge from Fitch-Margoliash4 trees when data scatter confounds the segment averaging (or least-squares treatment). Percent identity is taken as the number of identities per 100 aligned residues. Pairwise Alignments. The Needleman and Wunsch algorithm2 was used in a three-matrix form5 and utilized the minimum mutation matrix of Dayhoff in its scoring. The similarity scores obtained from the alignments are converted to difference scores by the relationship D = - In S,e X 100 = - ln(S,,

- S&/(&S&,

- S,,)

X 100

where S,, is the alignment score itself, S,, the score obtained with random sequences of the same lengths and compositions, and Siti, the average score of the two sequences being compared when each is aligned with itself. Based on previous comparisons of many different kinds of sequences,‘j a value of 770 determined as an average random score is used for S,, in these initial comparisons. In the event that the binary results indicate a compound tree, the sequences forming the subclusters are first prealigned. Multiple Alignments.The Needleman- Wunsch algorithm2 is used successively on pairs of sequences presented in an appropriate order of relatedness as determined by the preliminary comparisons. Gaps are concurrently filled with neutral elements, and a limited number of nearest alternative sequences is tried in search of better alignments as determined by higher similarity scores. Treebuilding. When the alignment is completed, random scores (Srand) are generated by jumbling the actual sequences under comparison. This is necessary to offset the effect of unusual amino acid combinations that can give high (or low) scores with the minimum mutation matrix.3 S,, values are also calculated for each pair of sequences. Then, using the above equation, all the pairwise similarity scores S,, are converted to a distance matrix. The conventional method of Fitch and Margoliash4 is used to determine the branching order of the sequences. Combining the branching order with the distance matrix, a least-squares approach as described by Klotz and Blanken’ is used to determine the branch lengths. At this point, branches around the shortest interior segments may be interchanged in search of better trees as measured in terms of lower percent standard 4 W. M. Fitch and E. Margoliash, Science 155,279 (1967). 5 M. L. Fredman, Bull. Math. Biol. 46, 553 (1984). 6 D.-F. Feng, M. S. Johnson, and R. F. Doolittle, J. Mol. Evol. 21, 112 (1985). 7 L. C. Klotz and R. L. Blanken, J. Theor. Biol. 91,261 (1981).

378

ALIGNING

PROTEIN

AND

NUCLEIC

TABLE

ACID

SEQUENCES

1231

I

ENSEMBLEOFPROGRAMSFORPROGRESSIVEALIGNMENTANDPHYLOGENETICTREE CONSTRUCTION Objective

Program SCORE PREALIGN TREE BLEN ALIGN FORMAT TREEPLOT MULPUB

Initial paitwise alignment of sequences to be studied, yields a preliminary branching order Align subcluster of sequences Progressive alignment of properly ordered sequences (and subsets of sequences); yields branching order and branch lengths of the phylogenetic tree Determines the branch lengths from a given topology combined with a distance matrix Progressive alignment of properly ordered sequences (and subset of sequences); yields multiple alignment only Converts sequences recorded in other formats to the “Old Atlas” format so that it can be recognized by this set of programs A customized plotting routine designed to display the phylogenetic tree data in the form of a dendrogram on a Zeta plotter Format progressively aligned sequences

deviation. Finally, the phylogenetic tree of the sequences in the form of a dendrogram is plotted on a plotter or, if an automatic plotter is not available, drawn by hand. Outline

of the Progressive

Method

In the time since the original publication of the progressive alignment method,’ a number of improvements regarding execution have been made. Several of the programs have been combined so that they will run consecutively without interruption (Table I). The following describes the current status of the procedure. Initial Branching Order. The program SCORE is written so that the Needleman and Wunsch pairwise alignment procedure* can be applied directly when a set of n sequences is provided in a single file. At this stage, the order in which the sequences are assembled is unimportant; the program SCORE performs all n(n - 1)/2 alignments and sorts out a roughly proper order. Furthermore, SCORE will reveal whether the starting set contains any clusters. If one or more clusters exist, those sequences should be prealigned with the program PREALIGN. As a hypothetical example, assume that the result of the binary alignments conducted by SCORE yields a branching order ((((AB)C)D)(EF)). The one-line branching order

1231

PROGRESSIVE

ALIGNMENT

AND

PHYLOGENIES

379

notation8 shows that sequences A and B represent the most closely related pair. They are followed by C and then D and then a cluster EF. PREALIGN aligns sequences E and F, filling gaps with neutral elements (Js). From here on, E and Fare treated as a unit, EF. All sequences, A through F, are then entered into a single file in the order provided by the SCORE results. This becomes the input file for the programs ALIGN or TREE, depending on whether the user is interested in only getting a multiple alignment or also obtaining a phylogenetic tree. Progressive Alignment. The program TREE (or ALIGN), which is the heart of the procedure, is now used to generate the multiple alignment for sequences A through F. It begins by inserting neutral elements (Xs) in gaps that occur in the aligned pair of A and B, now designated as AB. Next, the first ternary case (AB)C is considered; that is, when C is brought in, a new alignment is made and a score determined. The key to this alignment is that new gaps can be incorporated into either C or B. If one or more gaps are inserted into B, the same gaps must be inserted into A (since the AB alignment must be conserved). The result is compared with the alternative (BA)C. Assuming that the former has a higher score, it will then set the path for the next alignment. When sequence D is brought in, the arrangements ((AB)C)D and ((AB)D)C are compared. Again, assuming that ((AB)C)D wins, the next two configurations to be examined will be (((AB)C)D)(EF) and (((AB)C)(EF))D. This fine tuning is applied only on the nearest neighbors for the simple reason that it is impractical as well as unnecessary to make an exhaustive trial. The procedure is continued iteratively until all sequences have been incorporated. To carry out the above operations, one has a choice of using either the programs ALIGN or TREE. The former will provide only the multiple alignment and therefore takes much less time to complete the task. The latter is more time consuming, since TREE must calculate all the pairwise S,, values determined from shuffled sequences. Each of the S,, values is based on 50 jumbles. Based on the equation noted above, the similarity scores, S,, , are converted to distance values. In line with the method of Fitch and Margoliash,4 the pair of sequences separated by the smallest distance is grouped together first. The matrix is then reduced by one, and all distance values related to this pair are averaged in a new matrix. This procedure is repeated until the matrix is reduced to a single dimension. The end result is a branching order with the associated branch lengths. The branching order may or may not be the same as the one determined from binary alignment alone; the branch lengths may contain some negative 8 W. M. Fitch,

Am. Nat. 111,223

(1977).

380

ALIGNING

PROTEIN

AND

NUCLEIC

ACID

SEQUENCES

1231

values. The program saves the branching order and the distance scores in a file called ALBR which the user can modify so that alternative topologies can be tried when it is used in conjunction with the program BLEN. Negative Branch Lengths. Occasionally, negative branch lengths may occur. When that happens, alternative trees are chosen with the branches on either side of the negative segment reversed, and a new set of branch lengths is calculated. The amended branching order serves as the input file for the BLEN program, and the new tree can be calculated with very little effort. Experience tells us that only a small number of trials need to be undertaken to eliminate the negative segments. In general, only the lengths in the immediate neighborhood of the reversed segments are affected in the recalculation. Even if no negative values appear in the initial tree, it is prudent to explore alternative topologies to see if better trees are generated. Ordinarily, this involves swapping the branches around the shortest interior segments. The criterion for “better” in this case is lower percent standard deviation. Automatic Plotting. The program TREEPLOT is used to generate the dendrogram. TREEPLOT is designed specifically to be run on a Zeta plotter; the library functions it uses are not generally available, and users will have to make their own arrangements for plotting details. A schematic outline of the progressive alignment method from start to finish is presented in Fig. 1. Example In the following example, all the computer execution statements assume that the user is operating in the UNIX environment. When working under VMS or some other operating system, some of these statements may have to be modified. It is essential that all sequences be in the “Old Atlas” format. Sequences written in any other format can be converted to this arrangement by using the program FORMAT. Figure 2 shows an example of such a sequence. Illustration Sequences. Seven globin sequences have been chosen to illustrate progressive alignment and phylogenetic tree construction. They are human hemoglobin cy chain (hbah); human hemoglobin /? chain (hbbh); human hemoglobin y chain (hbgh); haghsh hemoglobin (heha); lamprey hemoglobin (hbrl); human myoglobin (myoh); gastropod myoglobin (mycr). Binary Order. To begin with, one concatenates all the sequences (any order) into a file called “testfl”, cat hbah hbbh hbgh heha hbrl myoh mycr > test0

[231

PROGRESSIVE

ALIGNMENT

AND

381

PHYLOGENIES

pL-j/

TREEPLOT

FIG. 1. Plow chart of progressive alignment procedure. Program names are shown in boxes: the primary programs used for progressive alignment are enclosed in double boxes; other utility programs are shown in single boxes.

(“cat” is a UNIX command). The next step is to use the SCORE program to perform all the pairwise comparisons, score test0 test 1 > test2 At completion, “test2” contains n(n - 1)/2 pairwise alignments. All the results calculated based on binary alignments and, in particular, the THBRL PHBRL PHBRL PHBRL PHBRL PHBRL

HEMOGLQBIN, RIVER LAMPREY (IAMPETRA 1PIVDSGSVAPLSAAEKTKIRSAWAPVYSNY 31ETSGVDILVKFFTSTPAAQEFFPKFKGMTS 61ADQLKKSADVRWHAERIINAVNDAVASMDD 91TEKMSMKLRDLSGKHAKSFQVDPQYFKVLA 12lAVIADTVAAGDAGFEKLMSMICILLRSAY*

FLWIATILIS)

FIG. 2. The “Old Atlas” format for protein sequence data. The first column of every line specifies either T (title) or P (protein data). The next four columns are devoted to a four-letter identification code. The sequence is arranged 30 residues across and terminates with an asterisk. The programs mentioned in the text recognize any sequence in this format.

382

ALIGNING

PROTEIN

AND

NUCLEIC

ACID

SEQUENCES

I231

branching order are in “test1 .” For example, the preliminary branching order of these sequences should be hbgh, hbbh, hbah, myoh, mycr, heha, hbrl, with heha and hbrl forming a subcluster. The output reads Branching HBRL)

order:

(HBGH,HBBH)HBAH)MYOH)MYCR)HEHA)-

Clusters: (HEHA,HBRL) Prealigning “test3,”

Sequences. Concatenate

heha and hbrl into a file called

cat heha hbrl > test3 Then, prealign these two sequences, prealign test 3 test 4 > test5 The write-file “test4 will contain the prealigned sequences in the “Old Atlas” format with neutral elements (Js) occupying all gap positions. The file “test5” contains the equivalent alignment without Js and also general information about the degree of similarity. If more than two sequences are to be prealigned, the same procedure is used. In the event that nesting of clusters occurs, the deeper group should be prealigned first; then systematically work back through the entire cluster. Progressive Alignment. The prealigned sequences in the file “test4 are combined with the other sequences in a single file called “sample,” but the order must now be in tune with the preliminary branching order. The concatenation should therefore be performed in the following order, cat hbgh hbbh hbah myoh mycr test4 > sample If only a multiple alignment is desired, the program ALIGN should be used, since it does not require the detailed distances and is much faster. align sample sample1 > sample2 The file “sample 1” will contain all the aligned sequences with gaps occupied by Xs in the “Old Atlas” format. It can be used as the input file for another program, MULPUB, which reprints the alignment in a closepack format without Xs (Fig. 3). A multiple alignment without Xs, and results such as the similarity scores and percent identity between each pair of sequences after multiple alignment, will be given in the file “sample2.” If a phylogenetic tree is needed as well, then one should use the program TREE, tree sample sample1 > sample3

1231

PROGRESSIVE

ALIGNMENT

AND PHYLOGENIES

383

hbqh hbbh hbah

The file “sample3” will end up not only with the same calculated results as were given in “sample2” (from the ALIGN program), but it will also contain the difference scores, branching order, and branch lengths. The multiple alignment of the seven globin sequences is shown in Fig. 3. At this point, it is useful to draw on paper a tree based on the branching order provided in the file “sample3.” It is important to note how the segments are numbered so that one can identify which branch length in “sample3” corresponds to which segment on the tree (see Fig. 4). As a general rule, tip segments are numbered first, followed by the nearest interconnecting segment; the very first segment is always located at the left-hand side of the closest pair. Applying this rule to the globin case, the appropriately numbered segments start at tip hbgh, then hbbh,then proceed to the nearest internal segment, then hbah,etc. (see Fig. 4). The globin tree that emerged directly from the initial alignment is shown as a dendrogram in Fig. 5 (top). It should be noted that one of the branch lengths is small and negative; this is an indication that an altemative topology should be tried with the branches adjacent to the negative segment reversed. Exploring Alternative Topologies.To explore other topologies, one should note that after the completion of the run using TREE, there is a hidden file called ALBR (see Table II) in the user computer work space. On examination, one will find that it contains at least one line of letters, in alphabetical order, beginning with upper case. If there are more than 26 sequences, the lower case alphabet will be invoked. The file also contains a set of triangularly arranged numbers representing the distance matrix, in the same order as the letters. The letters refer to the sequences; for example, A is hbgh,B is hbbh,C is hbah,D is myoh, E is heha,F is hbrl, and G is

384

ALIGNING

PROTEIN

HBGH

AND NUCLEIC

ACID SEQUENCES

1231

HBBH

FIG. 4. Example showing how segments are numbered so that the calculated branch lengths can be matched with the phylogenetic tree. The segments in this diagram are not drawn proportional to the actual calculated lengths. See the legend to Fig. 3 for four-letter designations.

mycr. When there is more than one line of letters, clusters exist. In “sample3,” the second line is EF representing heha and hbrl. The branching order and the distance matrix are always separated by an asterisk. The most direct way to investigate the negative segment is by interchanging myoh and the cluster (hehahbrl). This involves modifying the ALBR file (see Table II). Let the new file be “sample4 and use it as input file for BLEN, blen sample4 The new topology and the corresponding branch lengths are given in Fig. 5 (bottom). Note that the negative branch segment disappeared and that the goodness of fit, as measured by the percent standard deviation, improved. Discussion

Since the progressive alignment method was first published in 1987, it has been applied to a wide range of sequences with great success. Although the basic method has remained intact, the computing procedures have been significantly updated. In the initial version, a number of steps had to be carried out separately in order to arrive at the multiple alignment and phylogenetic tree for a set of protein sequences. For example, merely to obtain the preliminary binary branching order and prepare the sequences

1231

PROGRESSIVE

ALIGNMENT

AND

385

PHYLOGENIES

60

14 61 -0

/

22

58

3’

60

I

FIG. 5. Phylogenetic trees for the seven hemoglobins determined directly from the progressive alignment method. In the top tree, the segments corresponding to number 7 in Fig. 4 is -0.08. The percent standard deviation (%SD)’ (calculated from the actual distance matrix and the distance matrix derived from the tree) is 3.45. The bottom tree was obtained by switching myoh and the cluster heha- hbrl. The %SD of this tree is 3.18. See the legend to Fig. 3 for four-letter designations.

386

ALIGNING

PROTEIN

AND NUCLEIC

ACID SEQUENCES

1231

TABLE II Two EXAMPLESOFTHEOUTPUTFILE ALBR FROMTHE TREE PROGRAM" ABCDEFG EF * 15.43 57.41 98.72 139.39 55.52 131.07 ABCDEFG EF * 15.43 57.41 98.72 139.39 55.52 131.07

59.63 111.48 115.28 118.09 129.51

104.67 113.38 97.23 121.07

112.51 103.86 111.01

100.49 115.66

107.18

59.63 111.48 115.28 118.09 129.51

104.67 113.38 97.23 121.07

112.51 103.86 111.01

100.49 115.66

107.18

(1The first example gives the file that produces the branch lengths shown in Fig. 5 (top) when used with the program BLEN. Above the asterisk is the branching order and below it is the distance matrix. The capital letters on the first line denote the seven globin sequences. The second line indicates that sequences E and F form a cluster. The distance matrix is arranged in the same oder as the letters on the first line. The entry 129.5 1 refers to the distance between sequences E and G. The second example shows the modified letter order after interchanging sequence D and cluster EF. Note that the distances are not modified. The latter was the input file for BLEN, which gives the branch lengths shown in Fig. 5 (bottom).

for progressive alignment, one had to use three different programs: SCORE, BORD, and PREALIGN. Similarly, in the original progressive mode, four programs, DFALIGN, SHUFFLE, BORD, BLEN, were needed to obtain a phylogenetic tree. Presently, only two programs (SCORE and PREALIGN) are necessary in the preprogressive alignment stage; even better, the four programs previously needed for the progressive mode have been combined into a single program called TREE. Aside from the fact that fewer programs are now involved, the single most important improvement is in the TREE program. Formerly, in order to get the branch lengths of a phylogenetic tree, one had to manually

1231

PROGRESSIVE

ALIGNMENT

AND

PHYLOGENIES

357

construct a connectivity table, which is not only cumbersome but also error prone, especially when working with a large number of sequences. The automation of that step eliminated a great deal of manual work and allows one to start with a set of properly ordered sequences and to obtain the phylogenetic tree in a single operation. Furthermore, it is now a simple matter to try various alternative topologies in order to gain some sense of reliability of the tree. In this respect, it is worth noting how the branching order for the hemoglobin sequences changed during the various operations. The preliminary binary order from SCORE was hbgh, hbbh, hbah, myoh, mycr, heha, hbrl, and the initial progressive changed it to hbgh, hbbh, hbah, myoh, heha, hbrl, mycr. The appearance of a negative segment between myoh and the cluster heha and hbrl prompted a further exploration, and the alternative branching order with myoh and the cluster group reversed yielded a tree with no negative segment and a lower percent standard deviation. The final order is hbgh, hbbh, hbah, heha, hbrl, myoh, mycr, which also makes the most evolutionary sense. Proper sequence alignment is essential to any scheme of phylogenetic tree construction based on sequences. The proper alignment, however, is not necessarily the mathematically optimal alignment, and we have made the point before that it is a historical alignment that warrants the attention of the evolutionist.’ The progressive alignment procedure we have developed puts more stock in the relationship of recently diverged sequences than it does in distant relationships. Thus, it seems to us foolish to omit or move a gap that occurs when two similar sequences are aligned just because an additional match might be made with some very distantly related sequence, as is often the case when mathematically optimized alignments are generated. But if we are going to imply that parsimony is not an inviolate rule, how are we ever to choose the best tree? There are empirical approaches that one can adopt in addition to statistical measures like the lowest percent standard deviation. For example, one can examine sequences from more than one set of gene products from a given set of organisms to see if they all yield the same phylogeny. Or, if only one type of sequence is involved, the sequences can be cut into halves (amino-terminal halves and car-boxy-terminal halves) and two separate trees made. If the data are reliable, the two trees ought to have the same topology. Recently, we developed another check on the reliability of protein sequence-based trees. It is a simple character-based test that we call PAPA (parsimony after progressive alignment). Its features are discussed in another chapter in this volume.iO 9 R. F. Doolittle, “URFs and ORFs: A Primer on How to Analyze Derived Amino Acid Sequences,” Chap. 3. University Science Books, Mill Valley, California, 1986. lo R. F. Doolittle and D.-F. Feng, this volume, [41].

Progressive alignment and phylogenetic tree construction of protein sequences.

1231 PROGRESSIVE ALIGNMENT 375 AND PHYLOGENIES in Fig. 3 by a pair of parallel planes perpendicular to the j-k face. In this case, however, the c...
742KB Sizes 0 Downloads 0 Views