352

ALIGNING PROTEIN AND NUCLEIC ACID SEQUENCES

[2 11 Sensitivity

Comparison of Protein Sequences

Amino

1211

Acid

By PATRICKARGOS and MARTIN~INGRON Introduction Many procedures have been developed to align protein amino acid sequences; this volume is proof alone. To compare two protein primary structures, a classic technique involves positioning the two sequences in single-letter code and with lengths m and n such that one is vertical along a column (m) and the other horizontal along a row (n). Then, in matrix fashion, scores are inserted at each grid point (m X n) corresponding to every pairwise comparison of the amino acids in each sequence. The scores can simply be 0 (nonidentity) or 1 (identify) on up to the Dayhoff mutation matrix (20 X 20) expressing relative position weights with which amino acids prefer to exchange in aligned sequence families as c cytochromes or globins.’ To achieve the actual alignment, all possible paths are traced through the sequence comparison (search) matrix with the residues consecutively aligned from the amino to the carboxy termini, albeit with deletions and insertions. For each path, the sum of positive Dayhoff values along the trace is determined with gap penalties subtracted each time an insertion/deletion is initiated and extended by one position in length (see Refs. 2,3, and 4 for details and reviews). There are fast ways5 to score every possible path to find the optimal one (maximum). The classic procedure is flawed, especially when alignments result in amino acid identity at 35% or less of the matched positions. The technique relies on extrinsic choices for the (20 X 20) scoring matrix as well as for the values of the initiating and extending gap penalties. Controls using “standard-of-truth” alignments from tertiary structural superposition@ show that optimal matches are achieved for different sequence pairs (e.g., serine

r M. 0. Dayhoff, W. C. Barker, and L. T. Hunt, this series, Vol. 91, p. 524. 2 P. Argos and P. McCaldon, in “Genetic Engineering, Principles and Methods” (J. K. Setlow, ed.), Vol. 10, p. 21. Plenum, New York and London, 1988. 3 G. von Heijne, “Sequence Analysis in Molecular Biology.” Academic Press, New York, 1988. 4 R. F. Doolittle, “Of URFS and ORFS a Primer on How to Analyze Derived Amino Acid Sequences.” University Science Books, Mill Valley, California, 1986. s S. B. Needleman and C. D. Wunsch, J. Mol. Biol. 48,443 (1970). 6 B. W. Matthews and M. G. Rossmann, this series, Vol. 115, p. 397.

METHODS

IN ENZYMOLOGY,

VOL.

183

Copyright 0 1990 by Academic Res, Inc. All rights of reproduction in any form reserved.

r211

SENSITIVE

SEQUENCE

COMPARISON

353

proteases or globins) with different gap penalty choices. Gap penalties should even be varied within regions of the sequence alignment according to the matched secondary structures, such as j? strands or exposed loops. Furthermore, local sequence dependency is not considered, as in comparing sequence fragments (oligopeptides); instead, residues are examined only singly in the search matrix. This chapter describes a procedure for pairwise sequence alignment that attempts to overcome the aforementioned difficulties; the technique is correspondingly described as sensitive. Next, a multiple sequence alignment procedure is described that preserves both sensitivity and speed of execution. The significance or credibility of the resultant alignments is subsequently considered. Finally, computer programs based on the procedures and available from the authors are discussed. Sensitive

Pairwise

Sequence

Comparison

Alignments considered “standards of truth” can be had by spatial superposition of o-carbon backbone folds known from X-ray diffraction analysis.7 The distribution of residue match and conservation is often not uniform. There may be 10 amino acids where half are identical and one is a cysteine essential for function, then an insertion/deletion of 8 residues, the next 30 residues with only 10% identity but strong hydrophobicity conservation to preserve a buried /? sheet, another insertion/deletion, then 12 amino acids with three identically conserved glycine residues essential in turns, then 5 amino acids with no perceivable conservation because their exposure allows extreme variation, and so forth. The variable pattern is a result of ever-changing structural and functional requirements. So, in any sensitive sequence comparison technique, it is essential to consider sequence fragments of varying length as well as the physical and chemical characteristics of the 20 amino acids. Such a procedure has been developed and described in detaiL8 The alignments of several sequences that relied on tertiary structures were collected; the identity level was only about 1896, on the average. The individual alignments were strung together such that over 11,000 amino acids were paired. Correlation coefficients were then calculated using about 100 different amino acid characteristics. For example, two strings of paired hydrophobicity values, each with 11,000 elements, are generated from the extended sequence alignment. The correlation between the two hydrophobicity series is determined. The following five characteristics proved to be ’ M. G. Rossman and P. Argos, J. Mol. Biol. 109, 99 (1977). 8 P. Argos, J. Mol. Biol. 193,385 (1987).

354

ALIGNING

PROTEIN

AND

NUCLEIC

ACID

SEQUENCES

1211

most sensitive: (1) hydrophobicity, (2) residue bulk or shape, (3) preference of a residue to be in the turn structural conformation, (4) antiparallel /?-strand conformational potential, and (5) amino acid refractivity index, which correlates strongly with molecular weight. The importance of these parameters seems obvious in conserving protein tertiary folds: hydrophobicity to maintain the interior core, turn preference to ensure the chain returns to form the core, p preference as strand folds often comprise protein interiors, and side chain size and shape for close-packed structures. A window length (e.g., 25 residues) must be defined. When aligning every 25-residue span of one protein sequence with every such span in the other sequence, two scoring criteria are used. The sum of the Dayhoff mutation values for each of the residue pairs is determined. The correlation coefficients for each of the five aforementioned characteristics are calculated for the two aligned sequence fragments and averaged. Calculating correlations insists that the two fragments show similarly varying characteristic patterns without the necessity of similar absolute characteristic values. Assigning weights a priori to the parameters is next to impossible, because accurate predictions cannot be made of the tertiary structure of the sequence fragment. So, each parameter is given unit weight with the expectation that the locally most important will dominate the average. These processes are repeated for all possible fragment comparisons in the two sequences. The Dayhoff score and mean correlation are then added and scaled such that each contributes equally, on average, to the combined score. Considering only those oligopeptide matches where mutation and structure agree reduces greatly the noise level in the search matrix.* The standard deviation and mean of the combined scores are calculated, and then the combined scores are expressed as the number of standard deviations (u) above the mean. The scores are entered into a search matrix bordered by the two sequences if they are greater than some threshold (usually 3.00). The o values are entered over the entire probe length, and overlaps are resolved by maintaining the largest values. For example, if positions 11 to 20 in one sequence (window length 10) were compared with 41 to 50 in the other sequence with a combined and normalized score of 3.1 D, then this value is placed at matrix sites (11,41), (12,42), . . . , (20,50). Now, when 12 to 21 is compared with 42 to 5 1 at 5.2~1,then 5.2~ is placed at ( 12,42), ( 13,43), . . . , (21,51); 5.2~ overlaps 3.1~ in 9 positions and replaces the lower values. The combined scores for several probe lengths are collected into one search matrix by merely repeating the above procedures for each window length, the highest standard deviation score always resolving any overlap. Usually it is sufficient to use window lengths of 5 to 35 in steps of 1,2, or 3 (depending on available computer time and noting that insertions/dele-

[211

SENSITIVE

SEQUENCE

COMPARISON

355

tions invariably occur after 35 or fewer residues have been consecutively matched). The use of the number of standard deviations as a score allows an equivalent comparison of the results from each probe where the noise or background is taken from the search at a given window length. This approach, along with plotting over the entire probe length, literally “fills in” the homology. An example is given in Fig. 1. Usually only those peaks above a certain threshold are placed in the search matrix (3.0~ for sequences less than 500 residues and 3.5~ for those longer); the remaining positions are set to 0.0~. A computer program has been developed9 that allows the user to select, through a “mouse” device, an alignment path in the search matrix directly on a graphics screen. Gap penalties are thus avoided, and lower valued or suboptimal peaks consistent with the overall alignment trend can be easily selected. Alternatively, a path can be traced automatically by considering all possible paths with an efficient algorithm.5 Plotting over the entire probe length and selection of a minimum v value for display force the automated procedure to maintain a particular high-peak trace not achieved when only single residues are compared. Four approaches to alignment were allowed to compete’ in order to achieve a standard-of-truth alignment between chymotrypsin and cr-lytic protease (176 matched positions, 20.5% identity, oligopeptide GDSGG conserved exactly); lupin and human hemoglobins (139 matches, 16.4% identity); southern bean mosaic and tomato bushy stunt virus amino-terminal region (122 alignment sites, 29.5% identity) as well as a carboxy-terminal region (60 residues, 26.7% identity); and valyl- and leucyl-tRNA synthetases (536 matches, 25.4% identity, oligopeptide TTRPETL shared) for which an unambiguous trace path could be visualized in the sensitive search matrix. Table I shows the number of errors for the five cases over four procedures using different gap penalties when appropriate. The acronyms for the techniques are ISSC (user interaction with the sensitive search matrix);9 AUTO (automated trace path through the sensitive search matrix);9 BESTFIT (automated alignment procedure relying on a modified Dayhoff matrix and the algorithm of Smith and Watermani and found in the UWGCG program package);” and ALIGN (developed by Dayhoff and colleagues using the Dayhoff matrix and automated alignment.)’ It is clear from Table I that the sensitive search matrix used in the interactive or automated (lacking the advantage of various gap penalties)

9 R. Rechid, M. Vingron, and P. Argos, CABOIS 5, 107 (1989). lo T. F. Smith and M. S. Waterman, Adv. Appl. Math. 2,482 (1981). I’ J. Devereux, P. Haeberli, and 0. Smithies, Nucleic Acids Rex 12, 387 (1984).

356

ALIGNING

0 I 0

PROTEIN

46

92

AND

1 139

NUCLEIC

185

ACID

1 231

1211

SEQUENCES

277

324

370

7 416

080 INTEGRASE WINOOW 25 OAYHOFF +5 PARAMETERS FIG. 1. (a) Search matrix, based on the Dayhoff and five parameter scores, for comparison of integrases from bacteriophages P2 and 480. A constant search window length of 25 residues was used. The symbols indicate the number of standard deviations above the matrix mean for the search peak heights (S) according to the following scheme: 3.0~ C S < 3.3~ (thin lines); 3.3~ s S < 3.7~ (thick lines); 3.7~ =GS < 4.0~ (wavy lines); 4.0~ a S s 4.2~ (circles). (b) As (a), except using windows 7 to 25 in steps 1 with symbol ranges of 3.7~ G S < 4.00 (thin lines); 4.0~ C S < 4.2~ (thick lines); 4.2~ G S C 4.6~ (circles). It is clear that using many windows facilitates delineation of the alignment path by literally “filling it in.”

modes yields alignments closest to those resulting from tertiary structural superposition. In the case of the synthetases, where the sensitive search matrix path was unambiguous, ALIGN and BESTFIT were substantially different. Functionally important and exactly conserved oligopeptides are not missed by the sensitive procedures while the other methods may or may not delineate them depending on the gap penalties (see Fig. 2 for an illustration). ALIGN, under all conditions, never discovered the TTREPTL of the synthetases. Furthermore, different gap penalties had to be utilized to achieve optimal alignments and to delineate the well-con-

El1

SENSITIVE

b

SEQUENCE

357

COMPARISON

370-

324-

I / / fl

0

(6

92

139

105 231 277 080 INTEGRASE WINDOWS 7 TO 25 OAYHOFF +S PARAMETERS FIG. lb.

326

370

,

, 416

served oligopeptides for the various examples. Choosing the appropriate values a priori is difficult without knowledge of tertiary structure, which is almost always not the case. Multiple

Sequence

Alignment

Weak homologies may become visible with simultaneous alignments of several related sequences. This is accomplished by a multiple sequence alignment algorithm. One natural approach to the multiple alignment problem is the generalization of the classic procedure described previously. A multidimensional search matrix is created with one sequence along each dimension and the Dayhoff mutation values entered at each matrix coordinate according to the two amino acids compared. To trace all possible paths which include

358

ALIGNING

PROTEIN

AND

ACID SEQUENCES

NUCLEIC

TABLE I ERROR RATES FOR GIVEN SEQUENCE Method ISSC AUTO (GA, GB) 10, 1 BESTFIT (GA, GB) 6.0, 0.4 5.0,0.3 4.0,0.2 3.0,o. 1 2.0,o.o ALIGN (K. GA) +2,0 +2, 10 +2,20 +10,0 + 10, 10 + 10,20 + 20,oo +20, 10 + 20,20

Proteases

Globins

1211

COMPARISONP

Virus amino terminus

Virus amino terminus

tRNA synthetases

Ill(Y)

8

12

12

122(Y)

8

19

60

106(Y)

162(N) 162(N) 123(N) 112(N) 123(Y)

24 24 21 21 21

12 12 12 17 14

60 60 60 60 60

233(N) 168(N) 139(Y) 146(Y) 194(Y)

148(Y) 127(Y) 149(N) 154(N) 139(N) 131(N) 156(N) 139(N) 134(N)

133 19 18 54 22 25 40 22 25

65 15 12 33 16 12 59 12 12

60 60 60 60 60 60 60 60 60

O(Y)

493(N)

2WN) 179(N) 430(N) 474(N) 464(N) 470(N) 468(N) 454(N)

n K is the constant added to the Dayhoff matrix as allowed by the ALIGN program while GA and GB are the gap penalties to initiate and extend, respectively, an insertion/deletion. For proteases and tRNA synthetases, a pentapeptide (GDSGG) and heptapeptide (TTREPTL), respectively, were identical in the two compared sequences. An (N) next to an error rate indicates that the sequence identity was NOT recognized by the alignment technique at the given parameter settings while a (Y) indicates recognition. The error count was determined by subtracting the number of correct matches produced by the alignment scheme from the total number of alignment positions in the standard-of-truth alignment with gap sites not counted.

insertions and deletions becomes practically impossible, even with shortcuts (except for the three-sequence case12), on present-day computers. Another approach builds multiple alignments from only pairwise comparisons of all the sequences. I3 In this case, the alignment of the two most related sequences is accepted using selected gap penalties. A third sequence, known to be most related to the previous sequences from all the I2 M. Murata, J. S. Richardson, and J. L. Sussman, Proc. Natl. Acad. Sci. U.S.A. 82, 3073 (1985). I3 G. J. Barton and M. J. E. Stemberg, J. Mol. Biol. 191, 153 (1987).

1211 LeuSyn ValSynl ValSyn2 ValSyn3 LeuSyn ValSynl ValSyn2 ValSyn3

SENSITIVE

(238) (241) (204) (201)

SEQUENCE

359

COMPARISON

WIGESIGAELVFKVADSKLENLIVFTTRPETLFAV RYKDLIGKYVILPLVNRRIP--1vGDEHADMEKGT LADGAKTADGK--------DYLWATTRPETLLGD RYPLADGA ----KTADGK-DYLWATTRPETLLGD YVALALDHPIVQKYSE CVKITPAHDF-----GVAVNPEDPRYKDLIG GVAVNPEDPRYKDLIG

(5.0,0.3) (5.0,0.2) (4.0,0.2)

FIG. 2. Alignment of particular regions in bacterial leucyl-(LeuSyn) and valyl-tRNA synthetases (ValSyn). The number in parentheses indicates the starting sequence position. The number following ValSyn indicates the respective alignments resulting from the program BESTFIT, with initiating (I) and extending (E) gap penalties given as (I,E) at the end of each line. It is clear that only a slight change in gap penalty yields very different alignments, even to the extent of missing seven consecutive and identical amino acids (TTRPETL).

pairwise comparisons, is aligned to the first two by the profile technique.r4 The set of previously aligned sequences is represented by a matrix, each column of which corresponds to a position in the underlying alignment. The number of rows is the number (20) of amino acids plus one added for the gap possibility. Every column of the profile matrix contains the distribution of the amino acids at the corresponding alignment position. A sequence is aligned to a set of matched sequences by creating a search matrix bounded by the sequence to be aligned and by the profile columns corresponding to the number of alignment positions in the sequence cluster. The score for matching a profile column with an amino acid from the sequence is calculated as the weighted sum over the Dayhoff values for the amino acid of the one sequence versus those occurring in the alignment position to be matched. Every search matrix value is weighted by the frequency of the residue as given in the profile column. All possible paths under given gap penalties can now be traced through the profile versus the one-sequence search matrix and the optimal one found. A new profile matrix is then calculated from the three aligned sequences, and then a fourth sequence can be compared. The procedure is repeated until the multiple alignment is effected. Though the iterative pairwise alignments and profile analyses are clever and useful, they are flawed. Different gap penalty choices once gain result in different alignments, which for weaker homologies are difficult to choose among. The final alignment is dependent on sequence order; for I4 M. Gribskov, A. D. McLachlan, and D. Eisenberg, Proc. Natl. Acad. Sci. U.S.A. 84,4355 (1987).

360

ALIGNING

PROTEIN

AND

NUCLEIC

ACID

SEQUENCES

I211

example, once gaps are introduced from the first few sequences, they can never be eliminated or modified with the addition of further sequences. Furthermore, there is no assurance that the early matched sequences will be correctly aligned in all local regions. The authors have developed a multiple sequence alignment technique that avoids or mitigates some of these problemsi The dipeptides contained in each sequence are distinguished and counted. Those sequences that show a high count of the same 2-mers are clustered; they should be closely related. Within each cluster the sequences are aligned so that the largest number of dipeptides are brought into multiple coincidence for all the clustered subsets with preservation of alignment progression. For example, suppose the sequences MLRKENRKSELAGYFGG, RKRTTELYDGG, and MISTRKEMGGSTELLGFGG are to be aligned. It is clear that dipeptides RK, EL, and GG will allow the best progressive match of the sequences: ML[RK]ENRKS-[EL]AGYF[GG] MI S T[RK]EMGGST[EL]LGF[GG] -RTT[EL]-YD[GG] VWAligning the carboxyl-terminal [GG] of the first sequence with the first-occurring (amino-terminal) one in the second sequence would of course not be performed. The regions bordered by the 2-mers are then matched from longest to shortest (e.g., EMGGST with ENRKS with RTT); this minimizes gaps. The first two (EMGGST and ENRKS) are matched by considering all possible paths in a classic search matrix dependent on Dayhoff mutation values and chosen gap penalties. The third is fitted to the profile of the first two, and so forth. The procedure is repeated for all intervals within a given cluster and then for all clusters. The two clusters sharing the greatest number of dipeptides are joined by aligning each of their profiles. Two profiles are compared by once again calculating a search matrix where the columns correspond to the alignment positions of one sequence set and consist of their amino acid compositional distribution, and the rows are the similar values for the other sequence cluster. The score at each search matrix coordinate is the sum, over all possible amino acid comparisons, of the product of the residue compositions in each sequence cluster, the Dayhoff value for the residues compared, and the weights associated with each sequence in their respective clusters. All possible paths are traced in the resultant search matrix and the two profiles aligned. A new profile is then determined from the alignment of the two previous profiles. The next sequence cluster is added according to the dipeptide count, and its profile is aligned with the just created one. Is M. Vingron

and P. Argos,

CABZOS 5, 115 (1989).

r211

SENSITIVE

SEQUENCE

COMPARISON

361

Then a more extensive profile is calculated from the three clusters and the procedure repeated until all sequences are matched. The process can be performed even if a subset contains only one sequence. Weights are attached to each sequence in a given alignment and can be used when calculating scores in the comparison of two profiles. A comparatively high number of similar sequences would normally bias a profile. A single sequence distantly related to the rest would contribute only marginally. The need to correct for underrepresentation of sequences arises. The total number of mismatches between one sequence and the remaining sequences at each alignment position is a useful indicator. A high number of mismatches means that the sequence is distant from the average sequence of the profile, while mismatch counts similar for all the sequences indicate a homogeneous set. After counting the mismatches for each sequence, weights are determined through division of each of the counts by the smallest one. For instance, if an alignment cluster consisted of two identical sequences and a very different one, the corrected profile would attach weights 1, 1, and 2 for the three sequences. This approach is not so order dependent as previous methods owing to the preclustering of the sequences. Furthermore, aligning profiles is more accurate than aligning single sequences or one sequence to a profile. In a comparison of 9 azurins with 15 plastocyanins, where the standard-oftruth match contained 94 match positions not counting gaps, only 36 errors were encountered in comparing the two profiles, while relating the two single sequences with known tertiary structure yielded, under the same gap penalties, 62 errors. Even if the single sequence comparisons were allowed several possible gap penalties, the best error rate was 43. Profile comparison has also been shown to recognize more easily conserved catalytic regions despite the use of different gap weights. In this case the reinforcement of multiple sequences containing a conserved pattern allows easy recognition by profile fitting between sequence sets. Weighting of the residue count in a given profile column according to the closeness of the sequences also improves sensitivity. For the azurin - plastocyanin comparison, a 30% improvement was observed over the unweighted procedure.i5 Aligning only the intervals between dipeptide anchors increases greatly the speed of the method. Nonetheless, it still suffers from the required setting of gap penalties and certain cut-off scores, such as those used for clustering. Significance

of Alignment

If the identity level is above about 30% or so with uniform distribution over the sequence alignment, and there are no long insertions or deletions, meaningfulness is obvious to the eye and by several mathematical criteria.

362

ALIGNING

PROTEIN

AND

NUCLEIC

ACID

SEQUENCES

1211

For most other cases, significance estimation is not straightforward. Certain regions of the protein sequences may be aligned well, while others can hardly be matched. An alignment may be reasonable over one relatively short stretch (25-50 residues) and yet cannot be extended. Sometimes four or more amino acids show a perfect or near-perfect consecutive match, and yet the remaining regions do not align well. In other cases, the entire sequences can be aligned, but with only 15 - 30% identity. Three approaches have generally been used to assessalignment quality: ( 1) Monte Carlo methods based on random shuffling of the two sequences and matching by the given procedure, (2) analytical formulas derived for particular alignment algorithms to yield expected match quality in shuffled sequences, and (3) use of real but unrelated protein sequences in large reference samples for analysis. In all cases a mean assessment and standard deviation (v) are calculated from the controls. For longer sequence alignments (say, greater than 50 residues), the literature consensus seems to state that greater than 3~ is necessary for consideration of the match as “possible.” Around 5~ or 6v it becomes probable, and above IOU, “certain.” It has been found that scores greater than 6v yield 75% correct alignments in helices and /I strands of known tertiary folds.13 Application of biological knowledge is also important. If the proteins share some function, a sequence relationship is more likely. Of course, no protein would use a sequence of randomly selected letters; careful engineering and bias prevail.16 Thus, significance estimates relying on real sequences are more trustworthy. The authors, along with Rechid,9 have run several unrelated sequence pairs through the sensitive technique, produced search matrices and associated automated alignments using weaker gap penalties to emphasize the use of large search scores, and determined that the average correlation coefficient for five residue properties, over all alignment positions not including gaps, is +0.282 (not 0.000 as random shuffling would yield). This value is subtracted from the mean correlation of the alignment to be tested and the result divided by the standard deviation of the trial scores, which depends on the alignment length. Then, in a strict sense, if the answer is greater than 2v (the 95% confidence level), the alignment may well be significant. If the biological connection is compelling, comparisons at a significance level less than 2v may even be considered. Obviously, the need to be greater than 3v for the typical control test is a result of the artificial shuffling of the sequences. With any questionable homology it is always important to examine visually the search matrix calculated over several window sizes. The authors have encountered cases where the same percent identity and the same I6 P. McCaldon

and P. Argos,

Proteins 4,99 (1988).

[211

SENSITIVE

SEQUENCE

COMPARISON

363

standard deviation significance have been achieved, yet the search matrices do not prove equally convincing. ” With mere credibility scores, the distribution of high peaks along the entire path cannot be seen and the number and size of noise peaks relative to match peak cannot be assessed; the presence of weak peaks that fill a homology path well and that are unlikely to occur by chance must be visualized. The high credibility score may result from only a small fraction of the entire alignment region, leaving most of the matches in doubt. Examination of the search matrix can alleviate these pitfalls. Perhaps the best and only significance test is the observance of a good distribution of peaks collinear with the search matrix diagonal. Despite biological and statistical correlations, mistakes can still be made. Two nucleotide binding domain sequences from alcohol and lactate dehydrogenases were compared.* A reasonable search matrix path resulted. Both sequence regions were known to bind NAD, yet the alignment (Fig. 3) was nearly two-thirds incorrect relative to the tertiary structural superposition, which achieved only 12% identity, half that from the computer results with about the same number of insertions/deletions. Available

Computer

Programs

Program packages executing the sensitive procedure in comparing two sequences and the multiple sequence alignment technique are available from the authors.8,9,15 Interested researchers can write to the authors; send a FAX request (0049-622 l-387306); or relay a computer network message (ARGOS@EMBL for the sensitive procedure and VINGRONEEMBL for the multiple alignment method, both on BITNET). A magnetic tape containing the source code and executable files will be sent to academic institutions. The sensitive protein sequence alignment package9 consists of a main program (in C or Fortran languages) that calculates search matrix values for one or more sequence pairs. The user can choose the sequences, comparison matrix (Dayhoff or identity or whatever), residue characteristics, and probe lengths. The resultant files can be fed to various satellite programs that utilize the search matrix. The user can visually interact with the matrix by choosing peak lines and alignment paths with a “mouse” device. Sequence intervals not matched by an observable path can be automatically aligned. The alignment, its significance assessment, and the size of the matrix peaks utilized are all shown. The interactive programs are written for SUN2,3,4 workstations or microVAXs (II or 3000 series) I7 P. Argos and S. D. Fuller, EMBO J. 7, 819 (1988).

364

ALIGNING

PROTEIN

AND NUCLEIC

ACID SEQUENCES

LADH LDH(1) LDH(2)

QGSTCAVFGLGGVGLSVIMGCKAAG-AARIIGVDI

LADH LDH(1) LDH(2)

NKDK--------

LADH LDH(1) LD,,(2)

IQEVLTEHSNGGVDFSFEVIG-------------AGSK~VV~A~AR~QEG~--S-------------- - _ _ - - _ _ - SA~SKLVVIT~ARQQEGESRLNLVQ

LADH LDH(1) LDH(2)

- - -RLDTHVTALSCCQEAYGVSVIVGVPPDSQNLS - - -

LADH LDH(1) LDH(2)

M-NPMLLLSGRTWKGAI----------

1211

FIG. 3. Alignment of the NAD-binding domain from horse liver alcohol dehydrogenase (LADH) with that from dogfish lactate dehydrogenase (LDH) as determined by the tertiary structural superposition [LDH(2)] and by the sensitive search matrix [LDH( I)]. The bracketed arrows show the initiation site where the two alignments do not agree. Residues are boxed in LDH( I) and LDH(2) if they are conserved or identical with those in LADH. The computer search yielded 18 and 34 identical and conserved residues, respectively, while the structural superposition (“standard of truth”) resulted in only 7 and 16.

which have graphics capability; the computer language used is C. If interaction (recommended) is not possible, then routines exist (variously written in C or Fortran) that provide a laser writer output (postscript format) of the search matrix, an automatic sequence alignment based on the sensitive search matrix and selected gap penalties, and alignment assessment. The main program requires some computer time and should be used to discover weak homologies (< 35% identity), to extend relationships limited in sequence alignment length, or to delineate matches in sequence intervals. The multiple sequence alignment package,ls written in the C language, consists of two main programs. The first simply accepts whatever sequences are given and automatically clusters them (if desired), determines the optimal set of dipeptide anchors for each cluster, then aligns intervals between anchors for each cluster, and finally outputs each of the aligned clusters successively according to their closeness. The output files containing the alignments of the first two sequence sets are presented to the second program, which aligns them by the profile method. The resultant output and the third sequence set are then given to the same program, yielding the three-cluster match. The process is repeated as necessary, even if a single sequence constitutes a cluster. The user may select values for certain

1221

THREE-WAY

ALIGNMENT

ALGORITHM

365

parameters: (1) the two gap penalties for initiation and extension of insertions/deletions; (2) ON/OFF end gap setting if the sequences vary greatly in length; (3) the scoring matrix; (4) the oligopeptide length for determining anchor points and clustering (dipeptides for most cases and tripeptides if the sequences are closely related); (5) threshold oligopeptide match count to determine the clusters; (6) ON/OFF use of the weighting scheme in the profile analysis to avoid compositional bias in closely matched sequences; and (7) maximum difference between sequence positions within a matching set of oligopeptide anchors. The programs work quickly unless the sequences are particularly long and numerous.

1221 Three-Way

NeedlemanBy

MITSUO

Wunsch Algorithm

MURATA

Introduction In 1970 Needleman and Wunsch developed an elegant algorithm for aligning two protein sequences. r The algorithm finds the optimal alignment among all possible alignments that can be generated from the two sequences. Since then, the method has been a valuable tool in molecular biology. Properly aligned sequences, for example, often reveal structurally and functionally important regions of proteins such as catalytic sites and ligand binding sites. To find structurally equivalent residues from the best aligned sequences is the first essential step in predicting the tertiary structure of a protein based on a homologous protein whose three-dimensional structure is known. Obviously, information from sequence comparisons becomes more reliable as we increase the number of sequences to be compared. To align more than two sequences is easy, if they are from closely related proteins; they may be so similar that a good alignment can be obtained visually or by manually adjusting the alignments from pairwise comparisons. When sequence similarity is less obvious, however, the manual approach becomes very tedious and the results more subjective. Such a situation was encountered when three copper-containing proteins, cucumber basic blue protein, stellacyanin, and plastocyanin, were compared. The difficulty of obtaining a consistent alignment of these sequences by pair-wise comparisons moti-

’ S. B. Needleman and C. D. Wunsch, J. Mol. Biol. 48,443 (1970). METHODS

IN ENZYMOLOGY,

VOL.

183

Copyright 0 1990 by Academic Press, Inc. All rights of reproduction in any form resewed.

Sensitivity comparison of protein amino acid sequences.

352 ALIGNING PROTEIN AND NUCLEIC ACID SEQUENCES [2 11 Sensitivity Comparison of Protein Sequences Amino 1211 Acid By PATRICKARGOS and MARTIN~IN...
889KB Sizes 0 Downloads 0 Views