438

ALIGNING

PROTEIN

AND

NUCLEIC

ACID

SEQUENCES

1271

behavior of our measures based on optimal rotations, instead of under the (usually false) hypothesis that we know the original alignment of the two genomes. Normalized rate of intersection and largest common ordered subset measures of divergence through rearrangement should be compared for sensitivity and accuracy to indices based on the least number of rearrangements necessary to convert one genome into another.* Much work is needed to collate the labels for markers used in distantly related genomes. Empirical and theoretical research on fragment transposition distances is needed to produce better models of this phenomenon. Generalizing the term “synteny” used in chromosomal genetics to encompass the notion of proximate fragments in one genome tending to be close together as well on another would give an appropriate label to this field of study. Acknowledgments This work was supported by individual operating grants and to David Sankoff and Robert Cedergren, as well as an infrastructure grant and a CRAY computer time allotment, from the National Sciences and Engineering Research Council of Canada. David Sankoff and Robert Cedergren are Fellows in the Evolutionary Biology Program of the Canadian Institute for Advanced Research. s D. Sankoff, Buf. ht. Sat. Inst. S(3), 46 1 (1989).

1271 Multiple By

DAVID

J.

BACON

Sequence Comparison and

WAYNE

F.

ANDERSON

Introduction One reason for performing amino acid sequence comparisons is to discover structural and/or functional similarities among proteins. Because structural similarity may be present in proteins that do not exhibit a strong sequence similarity (e.g., see Matthews et al.‘), one would like to be able to recognize the structural resemblance even when the sequences are very different. This chapter addresses the problem of finding weak similarities or distant relationships among proteins for which only the sequences are known. Comparing just two sequences at a time by current methods does I B. W. Matthews, M. G. Grutter, W. F. Anderson, and S. J. Remington, Nature (London) 290,334 (1981). 4 METHODS

IN ENZYMOLOGY,

VOL.

183

Copyright 0 1990 by Academic Press, Inc. All rights of reproduction in any form reserved.

[271

MULTIPLE

SEQUENCE

COMPARISON

439

not allow for a sufficiently sensitive test of similarity. Apparent pairwise similarities frequently fail the statistical tests for significance. Simultaneous intercomparison of several sequences, on the other hand, often yields a significantly nonrandom signal that in turn provides a statistical basis for assertions about structure and/or function. Fixed-Length

Sequence

Comparison

The sequence matching method described here has its roots in the procedures of Fitch2 and Cantor and Jukes3 for comparing two protein sequences: every run of k consecutive residues from one sequence is compared with every such run in the other sequence. The score for each run comparison is a sum of k residue similarity scores as looked up from a 20 X 20 table (4 X 4 in the case of polynucleotides). Typically k is chosen to be between 15 and 30 residues. The objective is to find the runs of length k which match best according to the similarity scores in the table. The fact that k is a constant for any given experiment is what gives this procedure the name fixed-length sequence comparison. It is also possible to have a variable-length method in which the search is made for runs of many different lengths” or to allow for insertions or deletions in the sequences.’ Multiple fixed-length sequence comparison (MSC) is conceptually no different from pairwise fixed-length comparison, except that the scoring function is generalized and there are many more ways to form combinations of runs. Given m runs, one run from each sequence, we define the mutual matching score to be the sum of the m(m - 1)/2 pairwise run comparison scores. This reduces to the usual definition when M = 2. The combinations of runs that should theoretically be considered in the search for the best one are described by the following recursive formula: for every run from one sequence chosen from a set of m, put this run with every combination of runs from the remaining m - 1 sequences. This is just the usual set of pairwise combinations when m = 2. The number of combinations of runs grows rather explosively as m increases. For m = 2 sequences, having lengths L, and L, , there are (L, k + l)(L, - k + 1) combinations, or a little fewer than n* for sequences of equal length n. For m = 3, there are nearly n3 such combinations and, in general, on the order of n”. If n = 319, k = 20, and m = 5, this comes to some 2.4 X lOi* combinations. It is clearly not practicable to evaluate them all. * ’ 4 5

W. C. P. S.

M. Fitch, J. Mol. Biol. 16,9 (1966). R. Cantor and T. H. Jukes, Proc. Natl. Acad. Sci. U.S.A. 56, 117 (1966). Argos, J. Mol. Bio!. 193, 385 (1987). B. Needleman and C. D. Wunsch, J. Mol. Biol. 48,443 (1970).

440

ALIGNING

PROTEIN

AND

NUCLEIC

ACID

SEQUENCES

1271

It is also not necessary to evaluate all combinations. In this chapter, we present a “heuristic” search method which, though not absolutely guaranteed to find the best matches, almost always does manage to find them.6 Moreover, it does so in a number of search steps proportional to only nm’, a function that grows much more slowly with respect to the number of sequences, m, than does the nm required for exhaustive evaluation of all combinations. The main use of the heuristic method is in discovering regions of similarity and evaluating them statistically. It does not attempt to construct full sequence alignments in the Needleman and Wunsch” sense and, indeed, is generally quite unnecessary in the cases of strong sequence similarity for which full alignments are likely to be meaningful. It has great value, however, in uncovering distant relationships among special regions in proteins even when evolution has eroded the homology of the remaining portions of the polypeptide chains to the point of virtual unrecognizability. Sensitivity

of Sequence

Comparison

There are many 20 X 20 tables of amino acid similarity scores in common use. One example is the mutation data (MD) matrix;’ another is a structural-functional table which reflects properties such as size, presence of polar or hydrophobic groups, and general preferences for forming different kinds of secondary structure.6 There is growing evidence that, even if there were such a thing as an optimal scoring table by some definition, pairwise sequence comparison would still be too weak on its own to discern many real protein relationships unambiguously. This problem is a statistical one. A run in one sequence may appear to be related to a run in another sequence, but not to an extent that gives one confidence in the relationship being more than that which would be expected by chance in a population of about n* possible run pairs. Yet in many of these cases it does later turn out that there is a structural relationship. Multiple sequence comparison offers a way out of this dilemma. It occurs quite often that there is some third sequence, apparently but not convincingly related to each of the first two, for which the mutual intercomparison of all three sequences gives a highly significant score. One also finds that the addition of an unrelated sequence reduces the significance, reflecting a highly desirable behavior of the statistics of multiple sequence 6 D. J. Bacon and W. F. Anderson, J. Mol. Biol. 191, 153 (1986). ’ B. C. Orcutt, M. 0. Dayhoff, D. A. George, and W. C. Barker, “User’s Guide for the Alignment Score Program of the Protein Identification Resource (PIR),” PIR Report ALI-1284. National Biomedical Research Foundation, Washington, D.C. 1984.

I271

MULTIPLE

SEQUENCE

441

COMPARISON

comparison: the signal-to-noise ratio rises sharply when a postulated mutual relationship is real and drops when it is not. Heuristic

Multiple

Sequence

Comparison

Method

The objective, formally, is to find mutually similar runs of length k among sequences S, , S,, . . . , S, that are of lengths L, , L2, . . . , L,, respectively. The mutual similarity of the runs starting at positions p, , P2, - * . , p,,, , respectively, is defined by the scoring function 5 2 ‘i i-l j-i+1 q-0

T[%Pi

+ 4hsjtPj +

411

where T is the 20 X 20 table of residue similarities. An exhaustive combinatorial algorithm to find the highest matching scores would evaluate this formula for all possible combinations of the starting positions and would note which combinations yield the highest scores. In practice, it is necessary to limit the combinatorial “explosion” that occurs as the number of sequences increases. The approach is to try to guess where the best scores will be in the space mapped out by the run combinations. The guessing game works as follows. In any region of mutual similarity, consisting of a set of m runs, each constituent subset of sequences should also have a reasonable level of mutual similarity. This is true, though to a progressively weaker extent, right down to subsets consisting of 2 runs. So the procedure starts by examining a pair of sequences, and performing the exhaustive search algorithm on them. This is not unreasonably time consuming for only two sequences. The run pairs that yield the top matching scores are saved in a “heap,” denoted H2. The number of run combinations, M, saved in a heap is typically chosen to be about 1000. Then a third sequence, S,, is selected, and the scoring function is evaluated using run triplets made up from pairs in H2 combined with every possible run in S, . Of the M(L, - k + 1) scores that this step produces, only the triplets associated with the highest M scores are saved in a new heap, H3. This process continues, with each new heap Hi being created from heap Hi-1 and sequences Si, until all the sequences are used up. The final heap H,,, should include the region of highest mutual similarity of length k among all the sequences. Clearly, the success of this procedure depends critically on the strength of the matching in the particular subsets used to build up the final heap, and it is thus dependent on the order in which sequences are selected. It turns out that with an adequate heap size, the algorithm is remarkably robust and even works quite well with random sequences.6 Moreover, the success rate of the algorithm improves as the strength of the run matching increases, because the better the match is, the more likely it is to contain

442

ALIGNING

PROTEIN

AND

NUCLEIC

ACID

SEQUENCES

WI

submatches that are strong enough to be in the first two or three heaps. This can be defeated by making the heaps too small, but a constant heap size of 1000 proves to be adequate in all but the most pathological cases. Because the heap size M and the run length k are constants for any given application of the heuristic search procedure, and each new heap Hi is constructed by comparing each of approximately n runs in a new sequence Si against all the M(i - 1) runs in the previous heap Hi-1 , the total number of run comparisons leading to H,,, is proportional to nm2, a function that is only quadratic in the number of sequences. The factor n (average sequence length) tends to be constant to within an order of magnitude in practice. The execution time of the heuristic procedure is much better than that of exhaustive search even for three sequences; the savings are more dramatic for larger groups, where an experiment on five sequences may take minutes instead of months. Statistics The highest score arising from the intercomparison of sequences is statistically significant only if it is higher than what chance alone would be expected to produce. Even this does not guarantee that there is anything interesting about the region of mutual similarity, but it does suggest strongly the possibility of a common polypeptide fold. The major probability model used in the MSC program is based on a “generating functions” model developed for two sequences by McLachlan.8 It yields, for any given score s, the frequency with which a score of s or higher is expected to occur by chance for a set of sequences of lengths L, , L . . . ) L,. Also accounted for in this probability model are the run leigth k, the particular 20 X 20 table of nonnegative integer residue similarity scores in use, and information on the relative abundances of the residues. The population being nonrandomly sampled consists of all the regions that the exhaustive combinatorial search algorithm would have examined, so it is the size of this population that is multiplied by the probabilities to get the expected frequencies. If the observed number of occurrences of a given score of s or higher is much more than the expected number (including the case where the top score occurs once and is expected to occur less than once), then the score is considered significant. As a rule, we take the cumulative frequency of a score to be significant if it is at least 100 times that which the model predicts to occur by chance. It is important not to rely on such crude measures of score quality as “number of standard deviations above the mean” for multiple comparis A. D. Mdachlan,

J. Mol.

Biol. 61,409

(197 1).

WI

MULTIPLE

SEQUENCE

COMPARISON

443

sons. Not only does the search procedure sample the population nonrandomly, but the score frequency distribution of the population itself is not normal when more than two sequences are intercompared. This is because when a region A of one random sequence happens to compare well with a region B in another random sequence and also well with a region C in a third random sequence, the chances of a higher than average score between B and C are distinctly greater than 50%, even though all three sequences are random. Yet all three pairs A-B, A-C, and B-C contribute to the score for the mutual comparison. Even though the three sequences are independent, their pair-wise comparison scores are not. The distribution of the three-way scores for random sequences is higher in the tail regions than a normal distribution with the same mean and standard deviation would be. If a normal distribution were used as a reference instead of the “correct” one given by the MSC probability model, there is the serious danger of overinterpreting high scores. Operation

of MSC Program

The FORTRAN program that implements the sequence intercomparison algorithm requires, in addition to the sequences, a few parameters that control its operation. The run length, k, should be set to around 20 to give the most sensitive sequence comparison. Smaller numbers tend not to produce scores that stand out as well above the background noise of essentially random run comparisons. Larger numbers suffer from the problem that gaps, which might improve local alignments, are not detected by the algorithm. Another MSC parameter is the heap size. It has already been suggested that 1000 is usually an adequate number. Some economy of computer time can be realized with the use of smaller heaps, but pushing this too far can impair the ability of the program to find the best regions of mutual similarity. The only hazard associated with raising the heap size beyond 1000 is that the program will take longer to run and require a little more memory. Whenever more than five sequences or several long sequences are to be intercompared, it is advisable to run the program twice, with the sequences supplied in reverse order the second time, and check to make sure the same top matches are found in both cases. This is to allow for the possibility that the region of best mutual matching happens to be particularly weak in the first two or three sequences supplied to the program. In extreme circumstances, a few more permutations of the sequence order may be tried in order to produce a clear result. This should be necessary only if there are many sequences with little similarity. There are some program parameters that control details of the proba-

444

ALIGNING

PROTEIN

AND

NUCLEIC

ACID

SEQUENCES

1271

bility models to be used by MSC in estimating significance. The user may select the usual probability model based on generating functions, a model which causes the program to intercompare scrambled sequences a number of times to obtain reference frequency data directly, or both models at once. In addition, these may be used with either the residue counts of the input sequences or natural residue abundances as read from a library tie, or both. Thus, up to four ways of estimating significance may be invoked simultaneously. It is worth noting here that the generating functions model in conjunction with the residue counts of the input sequences is rather expensive in terms of computer time when there are more than five sequences to compare. This is the most conservative of the four possible models in that it is the least likely to overestimate significance, but, because of the cost in computer time, the library values are more practicable when more than five sequences are intercompared. Bacon and Anderson6 have illustrated the behavior of the four probability models for all subsets of two or more sequences from a set of five unrelated sequences. All the models are conservative enough that overinterpretation of high scores is unlikely. Finally, there is a parameter to tell the program how many regions of mutual similarity to print from the final heap. These are sorted starting with the region having the highest score. In addition to the above control parameters, MSC requires a library file that specifies the 20 X 20 table of residue similarities, the natural residue abundances, and the basic generating function coefficients corresponding to these abundances. The default library that comes with the program has amino acid abundances derived from the NEWAT database9 and a table of amino acid similarities based on structural and functional considerations.6 There is an auxiliary program for generating a new library when the similarity table or abundances are to be changed, to facilitate experimentation with different amino acid distributions and comparison regimes. A suitable computer configuration for running MSC is a mainframe, minicomputer, or workstation. A small personal computer can even be used to good effect, because the memory requirements of the program are quite modest. Most of the memory is used by the heaps, and at most two of these are actually retained by the program at any given time. However, the program can take a few hours to perform a large sequence intercomparison on a slow computer. To simplify the use of MSC, there is an interactive program to help set up input. This program allows one to supply sequences in virtually any format and ensures that all control parameters are specified.

9 R. F. Doolittle, Science 214, 149 (I 98 1).

1271

MULTIPLE

SEQUENCE

COMPARISON

445

Examples A control experiment with five ribonucleases demonstrated that the significance levels reported by MSC are an accurate indication of the presence or absence of protein relationships.‘j Two pairs in the set of five proteins had strong enough similarity to be detected clearly with pairwise comparison, yet when any larger subset of the five were intercompared, the significance dropped. In a set of five mutually related DNA-binding proteins, there was a very strong tendency for the significance of mutual intercomparisons to rise as proteins were added to the set, even though the similarity of some of the pairs in the set was well below the level normally considered significant. The overall matched region turned out to be the well-known helix- tumhelix DNA-binding motif in all the proteins. The application of MSC to a set of enzymes employing the coenzyme flavin adenine dinucleotide (FAD) yielded a clear similarity when three or more sequences were compared despite uncertain pairwise similarities.6~‘0 Because two of these sequences were from enzymes of known three-dimensional structure, it was possible to associate the similar region with the FAD-binding site. Intercomparison of the sequences of the dimethyl sulfoxide reductase A subunit, biotin sulfoxide reductase, and formate dehydrogenase revealed that these three enzymes, all of which utilize molydopterin cofactors, contained four regions of amino acid sequence similarity.” In this case, overlapping fixed length runs were extracted from the heap to extend the regions of similarity beyond the initial runs of 20 residues. The three-dimensional structures of these enzymes are not known, so it is not yet possible to determine the functions of the four regions. Discussion

The MSC program is very useful in finding regions of similarity among sequences and determining whether the similarity is significant, but in some ways it might appear not to go far enough. For example, it does not account for insertions or deletions of residues in the sequences, and its run length is a fixed parameter for any given experiment. These restrictions are deliberate, and the chief reasons for making them are statistical. If a range of run lengths were automatically sampled, not only would the execution time and memory requirements increase, but significance could be oblo S. T. Cole, K. Eiglmeier, S. Ahmed, N. Honors, M. L. Elmes, W. F. Anderson, and J. H. Weiner, J Bucferiol. 170, 2448 (1988). I’ P. T. Bilous, S. T. Cole, W. F. Anderson, and J. H. Weiner, Mol. Microbial. 2,785 (1988).

446

ALIGNING

PROTEIN

AND

NUCLEIC

ACID SEQUENCES

1271

scured unnecessarily by the larger reference population that would have to be considered. Allowing “gaps” would make this problem much worse. A very real concern with MSC is that the ratio of observed to expected high score frequencies is not really the same thing as statistical significance. What is needed is an accurate model of the variability in the high scores for random sequence intercomparisons. This would allow significance to be more properly quoted as the probability of the highest score occurring by chance. The problem is not that the frequency ratios cannot be trusted, only that they are not an ideal measure of significance. Although it is unclear that full sequence alignments are useful when similarity is weak, full alignments can occasionally be meaningful if the aligned regions are of roughly equal length. There are many possible approaches to this task. Needleman and Wunsch’ mention a multiple sequence generalization of their algorithm which appears to imply an implementation requiring computation time proportional to nm for m sequences of length n and a constant gap penalty or to n*“-’ if the gap penalty is an arbitrary function. We have been experimenting with an implementation of this kind for the purpose of finding best paths between regions of similarity, keeping the computing time down by imposing an upper limit on the extent to which gaps can change the register of the sequences. Muratar* uses a constant gap penalty to constrain the computation time to something proportional to n3 for three sequences. Gotohr3 describes fairly general gapping conditions that allow alignments to be made in time proportional to IZ* for two sequences. The profile analysis method of Gribskov and Eisenbergr4 may be a particularly effective way of dealing rationally with large numbers of sequences. Sequence alignments based on sequence alone, of course, usually do not reflect the best way of aligning the three-dimensional structures unless the homology is well above the levels MSC is typically used to detect.15 On the other hand, a very common reason for aligning protein sequences is to deduce information on the three-dimensional structure of one protein based on the structure of a related protein. The variable gap penalty method of Chothia and Leskr6 may offer a better approximation to a structurally correct alignment than the unmodified Needleman - Wunsch5 procedure, although the tests we have made with this so far do not indicate much improvement. Certainly, the general idea of using as much structural information as possible in order to improve alignments has great merit. I2 M. Murata, this volume, [22]. I3 J. Gotoh, J. Mol. Biol. 162, 705 (1982). I4 M. Gribskov, R. Liithy, and D. Eisenberg, this volume, [9]. IS R. J. Read, G. D. Brayer, L. JurSek, and M. N. G. James, Biochemistry 23,657O (1984). ‘6 A. M. Lesk, M. Levitt, and C. Chothia, Protein Eng. 1, 77 (1986).

[281

SIMULTANEOUS

COMPARISON

OF SEVERAL

SEQUENCES

447

Conclusion Multiple sequence comparison is most useful when sequence similarity is weak. The technique outlined here does not attempt to produce an overall sequence alignment, but it is easy to use, finds results reasonably quickly if they are to be found at all, extends sequence comparison to far more than two or three sequences, and provides a statistical basis for assertions about regions of protein similarity. Acknowledgments This work was supported by the Medical Research Council of Canada through a grant to the MRC Group on Protein Structure and Function, Department of Biochemistry, University of Alberta.

[28] Simultaneous

Comparison By

MAUNO

of Several Sequences

VIHINEN

Introduction Numerous methods have been developed to compare and align two or more nucleic acid or protein sequences. The difference between sequence comparison and alignment is that the former indicates all similarities between the sequences whereas the latter method aligns the matching bases or residues. Sequence alignments are valuable for sequence divergence studies and for computer modeling based on homologous counterparts. The comparisons give overall sequence similarity regardless of alignment. Dot-plot figures, the graphic presentation of sequence comparison, show conserved regions as well as the alignment, which can be seen around the main diagonal. The number of known sequences has increased drastically, and often several sequences having the same function as the sequence under study can be found in databases. This has lead to the need for methods analyzing simultaneously several sequences. Most of these techniques are for aligning several sequences (see other chapters in this volume), but some are for simultaneous comparison of several sequences.L~2 In particular, a new

I M. Vihinen, Comput. Appl. Biosci. 4, 89 (1988). * G. Krishnan, K. K. Rajinder, and P. Jagadeeswaran, Nucleic Acids Rex 14, 543 (1986).

METHODS

IN ENZYMOLOGY,

VOL.

183

Copyright 0 1990 by Academic Press, Inc. All rights of reproduction in any form reserved.

Multiple sequence comparison.

438 ALIGNING PROTEIN AND NUCLEIC ACID SEQUENCES 1271 behavior of our measures based on optimal rotations, instead of under the (usually false)...
677KB Sizes 0 Downloads 0 Views