[281

SIMULTANEOUS

COMPARISON

OF SEVERAL

SEQUENCES

447

Conclusion Multiple sequence comparison is most useful when sequence similarity is weak. The technique outlined here does not attempt to produce an overall sequence alignment, but it is easy to use, finds results reasonably quickly if they are to be found at all, extends sequence comparison to far more than two or three sequences, and provides a statistical basis for assertions about regions of protein similarity. Acknowledgments This work was supported by the Medical Research Council of Canada through a grant to the MRC Group on Protein Structure and Function, Department of Biochemistry, University of Alberta.

[28] Simultaneous

Comparison By

MAUNO

of Several Sequences

VIHINEN

Introduction Numerous methods have been developed to compare and align two or more nucleic acid or protein sequences. The difference between sequence comparison and alignment is that the former indicates all similarities between the sequences whereas the latter method aligns the matching bases or residues. Sequence alignments are valuable for sequence divergence studies and for computer modeling based on homologous counterparts. The comparisons give overall sequence similarity regardless of alignment. Dot-plot figures, the graphic presentation of sequence comparison, show conserved regions as well as the alignment, which can be seen around the main diagonal. The number of known sequences has increased drastically, and often several sequences having the same function as the sequence under study can be found in databases. This has lead to the need for methods analyzing simultaneously several sequences. Most of these techniques are for aligning several sequences (see other chapters in this volume), but some are for simultaneous comparison of several sequences.L~2 In particular, a new

I M. Vihinen, Comput. Appl. Biosci. 4, 89 (1988). * G. Krishnan, K. K. Rajinder, and P. Jagadeeswaran, Nucleic Acids Rex 14, 543 (1986).

METHODS

IN ENZYMOLOGY,

VOL.

183

Copyright 0 1990 by Academic Press, Inc. All rights of reproduction in any form reserved.

448

ALIGNING

PROTEIN

AND NUCLEIC

[281

ACID SEQUENCES

method to study sequence similarities by comparing one sequence to several others was developed.’ In this approach pairwise comparisons of aligned sequences are superimposed to search conserved regions of the query sequence. The method can be used to compare DNA, RNA, or protein sequences. The three-dimensional structures of related proteins are known to be more conserved than their primary sequences. Also, the secondary structures comprising the tertiary structure are conserved. To be able to use these data the algorithm was modified so that it can simultaneously compare predicted secondary structural features of several sequences and thus extend comparisons into a wholly new dimension, Methods General Description The main idea of the algorithm is to compare one sequence X with several other sequences. The classic window/stringency method is used in pair-wise comparisons between sequence X and each of the sequences Y, . . . Y,. The observation arrays from the pairwise comparisons are superimposed, and those points scoring equal or higher than some predetermined stringency value are recorded (Fig. 1). The same idea is used to compare secondary structural predictions, where numerical values are compared instead of characters.

a

b

C

FIG. 1. Multiple sequence comparison scheme. Observation arrays of pairwise comparisons are superimposed, and values of cells &,YJ in each array are summed (a). If the total number of matches for a certain cell is equal to or greater than the stringency S, the point is accepted and drawn on the dot-plot (b). This is repeated for each cell of the observation arrays to get the relationships between the sequence X on the x axis and the sequences Y on y axis (cl.

[281

SIMULTANEOUS

COMPARISON

OF SEVERAL

SEQUENCES

449

SequenceAlignment Before the sequence comparison can be performed all sequences Y have to be aligned with a sequence X, since the conserved residues can have different spacing even in closely related sequences. There are many methods from which to choose. All the insertions and deletions causing gaps in either of the aligned sequences are taken into account in the sequence Y, because gaps could occur in different places in the sequence X when different Y sequences are compared to it. Gaps appearing in the sequences Y are filled with dots; those in the sequence X lead to deletion of the insertion from the sequence Y. In this way conserved characters have the same spacing and distance in the sequences Y. Dots used to adjust the conserved characters are ignored unless one occurs in the middle of the window, where they are not allowed. The goodness of the alignments can be tested by doing the usual comparison of the two sequences. If the conserved residues are in the main diagonal, the alignment can be used. If the conserved regions wander around the main diagonal, the parameters of alignment should be adjusted. To avoid excessive gaps, however, the gap weight and the length weight should not be too small. As a rule several alignments should be tested with each sequence Y, and the one with highest weight values giving a reasonable alignment should be chosen. Doolittle gives guidelines for estimating the number of gaps allowed in related sequences.3

SimultaneousComparisonof SeveralSequences The aligned sequences Y are compared pairwise with the sequence X, the query. The value of the window determines how many successive characters from each sequence are compared at a time. A certain number of matches, the stringency, must be reached for a point to be recorded in the middle of the window. The window is slid in steps of 1 through both sequences so that every character in sequence X will be compared to every character in sequence Y. This is repeated for each pair. Comparison tables that take into account relatedness of amino acid residues give much better results for protein sequence comparisons than a unitary matrix, which is based only on identities. The residues can mutate to others with different frequencies. The most often used comparison table, the Dayhoff mutation matrix, 4 is based on observed amino acid replacements between similar proteins from closely related organisms. 3 R. F. Doolittle, Science 214, 149 (198 1). 4 M. 0. Dayhoff, R. M. Schwartz, and B. C. Orcutt, in “Atlas of Protein Sequence and Structure” (M. 0. Dayhoff, ed), Vol. 5, Suppl. 3, p. 345. National Biomedical Research Foundation, Washington, D.C., 1978.

450

ALIGNING

PROTEIN

AND NUCLEIC

ACID SEQUENCES

1281

The observation arrays containing results of pairwise comparisons are superimposed in the following step. The values of each cell (‘i~yj) of all arrays are summed, and only those points having a score equal to or higher than a second stringency are recorded (Fig. 1). The value of the second stringency can be either the sum of the sequences Y having scores equal to or higher than the first stringency or the total number of matches of the pairwise comparisons. Both implementations can be used without problem, although they can give different results, especially when the first stringency is zero and the total number of matches is used. It seems preferable to use the first stringency even when the second stringency is used to calculate the total number of matches, because highly homologous sequences can cause a bias. Actually, the technique using observation arrays containing the results of all pairwise comparisons was not implemented as described here. Each cell (xi,yj) is calculated simultaneously for all pairwise comparisons, because in this way storage of the observation arrays in memory is avoided. The end result is the same as if the described method, which is easier to understand, were used. The choice of the window and stringency parameters has great effect on the results. A too wide window and low stringency give an enormous number of points appearing by chance (noise), which hides the conserved regions. Too high stringency is also disadvantageous, because important signals are lost. Use of this algorithm, as well as any comparison method, requires testing of several parameter sets. The value of the second stringency should probably be low in initial analyses to see whether similarities occur, and be higher when the window and the first stringency have been adjusted. Comparison of Secondary Structure Predictions Secondary structure predictions have to be done prior to a comparison. Any of the numerous predictions for CY,/.I, and turn structures, hydropathy, flexibility,5 acrophilicity,6 surface probabilities,’ and other features can be used. All these methods use as the input primary amino acid sequences, for which numerical values are calculated residue by residue. Some predictions are made for a fixed number of consecutive residues, whereas the prediction window can vary greatly for others, such as hydropathy. The size of the window used in a prediction should be carefully chosen, because s P. A. Karplus and G. E. Schulz, Natunvissenschaften 72,2 12 (1985). 6 T. P. Hopp, in “Synthetic Peptides in Biology and Medicine” (K. Alitalo, P. Partanen, and A. Vaheri, eds.), p. 3. Elsevier, Amsterdam, 1985. 7 J. Janin, S. Wodak, M. Levitt, and M. Ma&et, J. Mol. Bid. 125, 357 (1978).

WI

SIMULTANEOUS

COMPARISON

OF SEVERAL

SEQUENCES

451

too large a number of residues can hide important peaks by reducing the magnitude of alterations and by smoothing sharp peaks. The values obtained from structural predictions are compared by using the window/stringency method. The alignment of the sequences have to be taken into account. The same alignment should be used both in sequences as well as in secondary structure comparisons. Because different sequences having similar secondary structural features can have slightly different predicted values, an extra value, a limit, is introduced to measure similarity of the sequences. The limit facilitates determination of the degree of similarities as comparison tables used in sequence comparisons. The absolute difference between the values for residues in sequence X and sequence Y must be smaller than or equal to the value of the limit to be accepted as a match. Since all important secondary structural features have a length of at least a few residues, the window is used to increase the signal-to-noise ratio. Similar values can be found all around the sequence, and the noise could prevent finding the conserved regions. Use of this second window is analogous to the use of a window in multiple sequence comparison. Noise can be reduced by concentrating on a certain bandwidth around the main diagonal. However, some important data can be missed. The observation arrays containing the data for matches in the compared secondary structure predictions are superimposed as described for multiple sequence comparisons. The method is very sensitive for the parameters used. The value of the limit must be changed for each prediction, because the amplitude of the values varies. The smaller the difference between the highest and lowest peaks in prediction, the smaller the value of the limit should be. The window should usually be smaller than in sequence comparisons. Implementation The programs are written in FORTRAN to function with the University of Wisconsin software package environment.* However, the software package is not required to run the programs. The programs called MULTICOMP, for multiple sequence comparisons, and SSCOMP, for secondary structure comparisons, as well as the program to calculate the secondary structural predictions are interactive. Alignments were done with the Needleman - Wunsch algorithm,9 although other alignments can be used.

* J. Devereux, P. Haeberli, and 0. Smithies, Nucleic Acids Rex 12, 387 (1984). 9 S. B. Needleman and C. D. Wunsch, J. Mol. Biol. 48,443 (1970).

452

ALIGNING

PROTEIN

AND

NUCLEIC

ACID SEQUENCES

1281

The programs were run under the VAX/VMS operating system. They are available from the author for a nominal charge. Results of comparisons are shown in the form of a dot-plot picture, which has on the x axis the sequence X and on the y axis all the sequences Y, . . . Y, . This kind of picture shows the conserved regions of sequence X in relation to all the other sequences. The dot-plot illustration should not be used to look for conserved sites in any of the sequences Y, since the alignments have changed the spacing in sequences Y by introducing gaps and deleting stretches of sequences. If conserved sites of any of the sequences Y are needed, a new comparison is required as the interesting sequence Yin the x axis. Another presentation can be a line drawing or a list of conserved regions in sequence X, although these do not show the spacing of conserved regions in relation to the main diagonal. Use of Multiple Sequence Liquefying cy-Amylases

Comparisons

to Study Saccharifying

and

a-Amylases have been divided into two categories, saccharifying and liquefying enzymes, according to the extent of hydrolysis of starch.‘O Takaamylase A of Aspergillus oryzae is saccharifying whereas the Bacillus stearothermophilus enzyme is liquefying. Amylases are known to have some sequence similarity. ‘r-r3 As a test of applicability of the multiple sequence analysis, the method was used to find whether liquefying a-amylases could be distinguished from saccharifying ones on the basis of appearance of similarities in different regions when compared to other starch-hydrolyzing enzymes. The sequences for Taka-amylase Ai4 and B. stearothermophilus cyamylaser were compared with each other (Fig. 2a). The main diagonal indicates significant homology, as previously reported. The similarities in the hydropathy predictions (Fig. 2b) appear in the same regions as sequence similarities. The empty space between about residues 100 and 150 in the y axis is due to an insert in the B. stearothermophilus enzyme. Both a-amylase sequences were compared to several amylolytic enlo J. Fukumoto, J. Ferment. Technol. 41,427 (1963). I’ J. C. Rogers, Biochem, Biophys. Res. Commun. 128,470 (1985). I2 R. M. Mackay, S. Baird, M. J. Dove, J. A. Erratt, M. Gines, F. Moranelli, A. Nasim, G. E. Willick, M. Yaguchi, and V. L. Seligy, Biosystems 18,279 (1985). I3 B. Svensson, FEES Lett. 230,72 (1988). I4 H. Toda, K. Kondo, and K. Narita, Proc. Jpn. Ad. 58B, 208 (1982). is I. Suominen, M. Karp, J. Lautamo, J. Knowles, and P. MBnts%la, in “Extracellular Enzymes of Microorganisms” (J. Chaloupka and V. Krumphanzl, eds.), p. 129. Plenum, New York, 1987.

[281

SIMULTANEOUS

COMPARISON

OF SEVERAL

SEQUENCES

453

/ / , ,’

a

b

FIG. 2. Comparison of (a) sequences and (b) hydropathy profiles of B. steurothermophilus and A. oryzae cY-amylases. The sequence of the liquefying B. stearothermophilus enzyme is on the x axis and that of the saccharifying Taka-amylase A on the y axis. The comparison window was 20, the first stringency 0, and the second stringency 12 for sequence comparison. The hydropathy profiles were calculated with a window of seven residues by the method of Hopp and Woodsz2 The values of the parameters were 6 for the window, 0. I5 for the limit, 0 for the first stringency, and 5 for the second stringency.

/ / /’

a b FIG. 3. Comparison of (a) Taka-amylase A and (b) B. stearothermophilus cY-amylase to glucoamylase, isoamylase, &amylase, pullulanase, and cgtase sequences. The window was 20, the first stringency 8, and the second stringency 43.

454

ALIGNING

PROTEIN

AND

NUCLEIC

ACID

SEQUENCES

1281

zymes: Aspergillus niger glucoamylase I,r6 Pseudomonas amyloderamosa isoamylase,” Klebsiella aerogenes pullulanase (adextrin endo- 1,6-c&icosidase),‘* Bacillus circulans &amylase,i9 and Bacillus macerans cyclodextrin glucosyltransferase (~gtase)~O(Fig. 3). The conserved regions shared by B. stearothermophilus o+amylase and Taka-amylase A are 33 - 49, 10 l112, 220-236, 444-456 and 47-67, 113- 126, 194-210, 434-438, respectively, whereas the liquefying enzyme has conserved regions also at 26 - 30, 208 - 2 14, and 469 - 473 and the saccharifying enzyme at 1 1 - 15, 91- 109,227-234,316-325, and 392-407. Two of the common regions of similarities between B. stearothermophilus and Taka-amylase A were also reported by Svensson.13 The residues thought to be involved in the catalytic site and in substrate binding of Taka-amylase A*’ are conserved as well as the corresponding sites in the B. stearothermophilus cy-amylase. Similarities were also found in the aminoand carboxy-terminal parts of both enzymes. The B. stearothermophilus and A. oryzae a-amylases were used as representatives of their classes. The sequences for liquefying o-amylases in particular are very similar.’ Knowledge about sequence similarities and dissimilarities in saccharifying and liquefying cw-amylases is useful for modeling the three-dimensional structures of a-amylases and for studying and modifying properties of the enzymes by site-directed mutagenesis. These kinds of data can be used in study of phylogenetic relationships, too. The comparison of secondary structural features of these sequences did not show conserved regions. This is understandable, since the accuracy of the secondary structure predictions varies from about 50 to 70%, and the sequences were only distantly related. The comparisons of secondary structure cannot be any better than the predictions used. However, the secondary structural features can be valuable in cases of more conserved sequences (Fig. 2b). Comparison

of Hydropathy

Scales

The comparison of secondary structural features can be extended to study predictive methods. Several methods and scales are available, espeI6 E. Boel, I. Hjort, B. Svensson, F. Norris, K. E. Norris, and N. P. Fiil, EMBO J. 3, 1097 ( 1984). I7 A. Amemura, R. Chakraborty, M. Fujita, T. Noumi, and M. Futai, J. Biol. Chem. 263, 9271 (1988). I* N. Katsuragi, N. Takizawa, and Y. Murooka, J. Bacterial. 169,230l (1987). I9 K. W. S&ens, Mol. Microbial. 1, 86 (1987). 2o T. Takano, M. Fukuda, M. Monma, S. Kobayashi, K. Kainuma, and K. Yamane, J. Bacterial. 166, 1118 (1986). 21 Y. Matsuura, M. Kusunoki, W. Harada, and M. Kakudo, J. Biochem. 95,697 (1984).

1281

SIMULTANEOUS .,

COMPARISON

OF SEVERAL

SEQUENCES e

.

: ,..

455

#’ . , I ,

,

, , / /

i

.



/‘. .

_

, .

.

, .

.. .

,.

;

,

:

. ’

:

..

~,~~~...~.,,,.~,,,,~,,~..‘~,,.,,.~,.,....,...,,..,~~

FIG. 4. Comparison of hydropathy profiles of B. stearothermophilus cy-amylase calculated by values of Hopp and Woods (x axis) and Kyte and Doolittle 0, axis). The window was 6, the limit 0.4, the first stringency 0, and the second stringency 6.

cially for predictions of o-helix, P-sheet, and turn and for hydropathy calculations. Methods to predict secondary structural features can be evaluated and their accuracies estimated by comparison to known three-dimensional structures. Here, two hydropathy scales, that of Hopp and Woods** and that of Kyte and Doolittle,23 are compared for the B. steurothermophilusa-amylase in Fig. 4. The two methods predict many of the peaks and valleys similarly. The scales were normalized to have the same average of absolute values according to Cornette et ~1.~~ Comparison of structural predictions can be used to study different methods and scales and even to search for relationships between different features of structure. The ends of (Y and @tructures are often hydrophilic,25 and these kinds of data can be easily obtained with the method.

22T. P. Hopp and K. Woods, Proc. Natl. Acad. Sci. U.S.A. 78, 3824 (198 1). 23J. Kyte and R. F. Doolittle, .I. Mol. Biol. 157, 105 (1982). 24J. L. Comette, K. B. Cease, H. Margalit, J. L. Spouge, J. A. Berzofsky, and C. DeLisi, .I. Mol. Biol. 195,659 (1987). 25 T. P. Hopp, in “Proteins: Structure and Function” (J. J. LItalien, ed.), p. 437. Plenum, New York and London. 1987.

456

ALIGNING

PROTEIN

AND

NUCLEIC

ACID

I291

SEQUENCES

Conclusion Here, several protein sequences were compared simultaneously with a multiple sequence comparison method to show conserved regions of one sequence. However, the method is applicable for comparing DNA and RNA sequences, too. Pairwise comparisons are superimposed; thus, enormous computer power is not required. This approach is very fruitful because a multiple sequence comparison gives more data than individual pairwise analyses. As a test, sequences for liquefying and saccharifying cY-amylases were shown to share some common regions with other amylolytic enzymes, but each shared different conserved regions with others. Several secondary structural features can be predicted, and their use to compare protein sequences can point to important features possibly invisible by sequence comparisons. However, some caution is necessary in the use of secondary structure predictions, because predictive methods do not always give correct results. In connection with sequence analysis, structure predictions can be valuable in searching for functional sites, e.g., for protein engineering studies. The relatedness of two hydropathy prediction scales was analyzed with the method. Such applications extended to the analysis of different predictive methods may have numerous applications in the study of the relationships between different structural features. The method can even be used to analyze three-dimensional structures, either refined or modeled. Acknowledgments Antti Euranto and Petri Luostarinen are thanked for implementation This work was supported by a grant from Neste Oy Foundation.

[291 Hierarchical

Method Biological By

WILLIAM

of the algorithm.

to Align Large Numbers Sequences R.

of

TAYLOR

Introduction The rapidly increasing determinations of biological sequences have increased the corresponding need for a fast and effective multiple sequence alignment computer program for their analysis. The contents of this volume indicate that this need has not been neglected by those who write METHODS

IN ENZYMOLOGY,

VOL.

183

Copyright 0 1990 by Academic Press, Inc. All rights of reproduction in any form reserved.

Simultaneous comparison of several sequences.

[281 SIMULTANEOUS COMPARISON OF SEVERAL SEQUENCES 447 Conclusion Multiple sequence comparison is most useful when sequence similarity is weak. T...
579KB Sizes 0 Downloads 0 Views