[201

MUTATION

[203 Mutation By DAVID

G. GEORGE,

DATA

MATRIX

Data Matrix WINONA

333

and Its Uses

C. BARKER,

and LOIS T. HUNT

Introduction More than one-half of the chapters in this volume are concerned with methods of comparing protein sequences. Since the 1960s as sequence data have accumulated, sequence comparison techniques have proved to be extremely effective in studying the structural and functional properties of proteins and how these, and the organisms in which they are expressed, have evolved. It is important to realize at the outset that these methods are approximate and have many inherent limitations. The fundamental problem is to judge whether two or more sequences exhibit similar threedimensional structure and/or function based on similarities observed in their amino acid sequences. Unfortunately, present biological understanding of the processes involved in determining protein structure and/or function is not sufficient to allow an adequate assessment of the sequence characteristics important for these processes. As a result, all sequence comparison algorithms are limited by the underlying model of sequence similarity and hence are not guaranteed to produce biologically meaningful results. In this sense, the results of these methods must be evaluated in the same manner as results are analyzed in classic comparative biology. No single method is superior to all others; evidence is drawn by comparing the results of several approaches, and reasoning based on other biological knowledge must be incorporated. The interpretation of evolutionary histories derived from comparisons of protein (and/or nucleic acid) sequences suffers similar problems in that evolutionary processes are also not fully understood. Basic to all sequence comparison is the concept of an alignment that defines the relationship between sequences on a residue-by-residue basis. Aligned residues are presumed to be related in an evolutionary and/or functional sense; residues occupying equivalent positions are believed to share common ancestors and/or to have equivalent biological roles. There is a wealth of genetic evidence supporting the role of insertions and deletions in the evolution of macromolecules, and it is customary to allow for the presence of unrelated segments reflecting these events in the alignment. A series of dashes is often used to indicate the “gaps” in sequences resulting from these events. Unfortunately, the number of identities or similarities seen between sequences that may not be related can also be greatly inMETHODS

IN ENZYMOLOGY,

VOL.

183

Copyright 0 1990 by Academic Press, Inc. All rights of reprcduciion in any form reserved.

334

ALIGNING

PROTEIN

AND

NUCLEIC

ACID

SEQUENCES

I201

creased by allowing gaps in the alignment. Even when it is clear that gaps must be inserted to align related sequences, the proper positioning of the gaps is often not obvious; therefore, great care must be taken not to overinterpret the significance of identities observed after gaps have been inserted. Sequence alignment algorithms attempt to place these gaps based on some model of sequence similarity. In any sequence comparison application, whether it be database searching, pattern recognition, or sequence alignment, the best solution is selected from all other possible solutions based on a set of scoring rules that define the similarity model. The optimization generally involves assigning a score to each possible alignment or subalignment and selecting those with the “best” scores. The results of the analysis are dependent on the choice of scoring rules; thus, it is customary to vary these according to the type of relationship being analyzed. For example, in examining possible evolutionary relationships between sequences, it is useful to base the scoring matrix on observed exchange frequencies, whereas when investigating possible structural similarities in proteins, it may be more appropriate to match residues of similar size and hydrophobicity. Similarity

Scoring Matrices

There are two distinct types of scoring systems: similarity scoring systems and difference scoring systems. Difference scoring systems have been introduced primarily for derivation of evolutionary trees; the evolutionary distance between proteins or nucleic acids can be inferred from differences observed between their sequences. However, inasmuch as one is generally more interested in the similar regions within sequences, it is more natural to use similarity measures for most other sequence comparison purposes. Sequence comparison methods use a scoring matrix that assigns a value to each possible pair of aligned amino acids. In a similarity scoring matrix, higher values are assigned to more similar pairs and lower values to dissimilar pairs. The simplest similarity matrix, which we call the unitary matrix (UM), assigns values of + 1 for identities and 0 for nonidentities. A scoring matrix based on the genetic code (GCM) reflects the maximum numbers of nucleotides that the codons for two amino acids may have in common, for example, scoring +3 for identities, +2 for amino acids whose codons must differ by at least one nucleotide, +1 for those whose codons must differ by two nucleotides, and 0 for those whose codons cannot have any nucleotides in common. Similarity scoring matrices can be based on any property of amino acids that can be expressed numerically. A similarity matrix can be computed from the amino acid property values in the following way. A matrix whose elements are the absolute values of differences between each of the

[201

MUTATION

DATA

335

MATRIX

TABLE I AMINO ACID HYDROPHILICITY VALUESO Symbol

Amino acid

Hydrophilicity

Symbol

Amino acid

Hydrophilicity

A R N D

Ala Ar8 Asn Asp

-0.50 3.00 0.20 2.50

M F P S

Met Phe Pro Ser

-1.30 -2.50 - 1.40 0.30

: E G H I L K

Gln CYS GhI GUY His Ile Leu LYS

-1.00 0.20

W T Y V B Z X

Thr Trp Tyr Val Asx Glx Unk

-3.40 -0.40 -2.30

2.50

0.00 -0.50

-1.80 - 1.80

- 1.50 1.40

1.40 -0.40

3.00

a From Levitt.’

property values is constructed. The maximum matrix element is determined, and each matrix element is divided by this value. The resultant matrix is a dissimilarity matrix with values between 0 and 1. This matrix can be converted to a similarity matrix with values between 1 and 0 by subtracting each element from 1. The similarity matrix elements are converted to integer values by multiplying each element by a scaling factor and then rounding off to the nearest integer. Table I shows the amino acid hydrophilicity data of Levitt.’ Figure 1 shows the similarity scoring matrix derived from these data; a scaling factor of 10 was used in the derivation. In 197 1, McLachlar? published a scoring matrix (AAAM) based on the alternative amino acids found at corresponding positions in alignments of groups of related sequences. Relative substitution frequencies were converted to integers that range from 1 to 6, identities were assigned scores of 8 or 9, and gaps were given a score of zero. A score of 3 represents a neutral substitution. Often the evolutionary histories of two sequences being compared are not known. Consequently, if at some site the first sequence contains amino acid A and the second amino acid B, it is not known whether the difference has resulted from amino acid A changing to amino acid B or from amino acid B changing to amino acid A. Hence, comparison methods usually require that similarity scoring matrices be symmetric; all of the matrices described above satisfy this criterion. I M. Levitt, J. Mol. Biol. 104, 59 (1976). * A. D. McLachlan, J. Mol. Biol. 61,409 (197 1).

336 R R K D E B 2 S N Q G X

F W

ALIGNING K

D

E

B

PROTEIN Z

S

N

1010 9 9 8 8 8 8 1010 9 9 8 8 6 6 9 91010 8 8 7 8 91010 8 8 7 8 9 8 8 8 810108888777766655544 8 8 8 810108888777766655544 6 6 7 7 8 8 10 10 6 6 6 6 8 8 10 10 6 6 6 6 8 8 10 10 5 5 6 8 8 8 10 10 5 5 5 5 7 7 9 9 5 5 5 5 7 7 9 9 5 5 5 5 7 7 9 9 5 5 5 5 7 7 9 9 4455668888999 3344668888999 33446678888899 33445577788888 3333557777888899 3333557777888899 2233448666777788999 1122446666777788889 0011334445555587778889

AND

NUCLEIC

ACID

1201

SEQUENCES

QGXTHACMPVLIYF

W

6 6 6 6

5 5 6 6

5 5 5 5

5 5 5 5

5 5 5 5

5 5 5 5

4 4 5 5

3 3 4 4

3 3 4 4

3 3 4 4

3 3 3 3

3 3 3 3

2 2 3 3

1 1 2 2

10 10 10 10 9 9 9 9

10 10 10 10 9 9 9 9

9 9 9 9 10 10 10 10

9 9 9 9 10 10 10 10

9 9 9 9 10 10 10 10

9 9 9 9 10 10 10 10 9 9

8 8 8 8 9 9 9 9 10 10 9 9

8 8 8 8 9 9 9 9 10 10 10 10

7 8 8 8 8 8 9 9 9 10 10 10 9 9

7 7 7 8 8 8 8 8 9 10 10 10 10 10

7 7 7 7 8 8 8 8 9 9 9 10 10 10

7 7 7 7 8 8 8 8 9 9 9 10 10 10 9 9

6 6 6 6 7 7 7 7 8 8 9 9 9 9 10 10

6 6 6 6 7 7 7 7 8 8 8 8 9 9 10 10

FIG. 1. Hydrophobicity scoring matrix constructed from the hydrophilicity data of Levitt,’ as described in the text. The matrix can be used in comparison of two sequences to generate dot matrix graphic plots, using, for example, a window size (segment length) of 25 and a minimum score in the range of 200 to 2 10. When using the matrix with the ALIGN program, the matrix bias and the gap penalty should be about 24 rather than the default values of 6 and 6 used with the MDM (see text).

Dayhoff

Mutation

Data Matrix

One of the most widely used similarity measures is the mutation data matrix (MDM) developed by Dayhoff and colleagues. The first MDM, published in 1968, was derived from over 400 accepted point mutations (evolutionary replacements of one amino acid for another at homologous positions) between present-day sequences and inferred ancestral sequences.3 The relative frequency of exposure of each type of amino acid to mutational change and relative mutability were also taken into account, as described below. The MDM used extensively in the 1980s was calculated on the basis of nearly 1600 accepted point mutations in 7 1 groups of closely related proteins (< 15% different).4

3 M. 0. Dayhoff and R. V. Eck, in “Atlas of Protein Sequence and Structure” (M. 0. Dayhoff and R. V. Eck, eds.) Vol. 3, p. 33. National Biomedical Research Foundation, Silver Spring, Maryland, 1968. 4 M. 0. Dayhoff, R. M. Schwartz, and B. C. Orcutt, in “Atlas of Protein Sequence and Structure” (M. 0. Dayhoff, ed.), Vol. 5, Suppl. 3, p. 345. National Biomedical Research Foundation, Washington, D.C. 1979.

0 0

5 5 6

7 8 8 8 9 10

[203

MUTATION

DATA

MATRIX

337

The original Dayhoff model of amino acid mutation assumes a Markovian model of amino acid substitution, i.e., the model assumes that the probability of a mutation at any site within a protein is independent of its previous history. For example, the probability of cysteine being transformed to serine by mutational processes is the same for all cysteine residues irrespective of the identity of their immediate ancestors. Within the Markovian model, the MDM is derived from a transition probability matrix in which each matrix element gives the probability that amino acid A will be replaced by amino acid B in one unit of evolutionary change. The diagonal elements give the probabilities that the amino acids will remain unchanged. The sum of the diagonal elements gives the probability that there will be no change during the represented evolutionary interval. In the Dayhoff derivation, the probability matrix was normalized such that this sum corresponded to a chance of 99 out of 100. Thus, the unit of evolution represented by the probability matrix corresponds to one accepted amino acid substitution per hundred sites (1 PAM unit). Note that time is not explicitly taken into account in this derivation. The transition probabilities are assumed to be constant with respect to a unit of amino acid substitution. They are not functions of the length of time required for the substitution to occur, and the time intervals may vary. Hence, within the limitations of this approximation, it is valid to apply this model to proteins that exhibit widely varying rates of evolutionary change. Each element of the mutation probability matrix was calculated as the product of the conditional probability of amino acid A being replaced by amino acid B, given that amino acid A is replaced, and the probability that amino acid A is replaced. The conditional probabilities were calculated as the observed frequencies of exchange between the amino acids; these frequencies were derived in the following way. Groups of closely related sequences (< 15% different) were aligned, and, based on a given unrooted evolutionary topology, ancestral sequences were generated for each node in the topology. The data were compiled from groups of closely related sequences to reduce the chance of the observed differences being the result of superimposed mutations, i.e., amino acid A changing to amino acid B, which then changes to amino acid C. The exchanges between each sequence and its immediate ancestor were compiled and summed for all sequence pairs. Exchanges involving positions in the ancestral sequences where an amino acid could not be unambiguously assigned were treated statistically by giving fractional scores to each alternative replacement. In the derivation it was assumed that the exchanges are symmetric; hence, an exchange of amino acid A for amino acid B was also counted as an exchange of amino acid B for amino acid A. This assumption is reasonable because the conditional replacement probability is expected to depend primarily on the chemical and physical similarity of the two amino acids.

338

ALIGNING

PROTEIN

AND NUCLEIC

I201

ACID SEQUENCES

The probability of an amino acid being replaced is estimated as its relative mutability, which is calculated as the ratio of the number of observed changes of an amino acid to its total exposure to change. The total number of observed changes was computed as the total number of times the amino acid was observed to change in the sequences examined. For each pair of sequences examined, the exposure to change of each amino acid was calculated as the frequency of occurrence of the amino acid multiplied by the total number of all amino acid changes observed for that sequence pair per hundred sites. The last factor normalizes the values to equal sequence length and equal evolutionary distance. These values were summed for all sequence pairs to compute the total exposure. In accordance with the Markovian model, mutation probability matrices corresponding to larger intervals of evolutionary distance can be obtained by repeatedly multiplying the original matrix by itself, i.e., the 2-PAM matrix corresponds to the square of the l-PAM matrix, the 3-PAM to the cube, etc. For each evolutionary distance, the sum of the diagonal elements gives a measure of the expected number of observed amino acid changes. A plot of these data (Fig. 2) clearly shows a saturation phenomenon. As the number of mutations increases, the probability of superimposed mutations at individual sites increases. Inasmuch as these multiple mutations are indistinguishable from single mutations, the observed number of changes asymptotically approaches a constant value. For the 1978

100.

80.

loo

300 Evolutionary

400 Distance

500

700

m PAMs

FIG. 2. Correspondence of the observed percent difference and the estimated evolutionary distance between two amino acid sequences.

800

[201

MUTATION

DATA

MATRIX

339

mutation data, this value corresponds to 94% difference. At large evolutionary distances, the matrix values themselves exhibit an asymptotic behavior (this is a general property of Markovian models). Transition to any amino acid becomes equal to its frequency of occurrence irrespective of the identity of the initial amino acid. The equilibrium frequencies are implicitly determined by the data set from which the matrix was derived. This same effect is observed in simulation studies. Repeated applications of the l-PAM probability matrix will produce sequences that tend toward the equilibrium composition regardless of the composition of the initial sequence. Although the PAM values represent estimates rather than observed quantities, they provide a more realistic measure of evolutionary distance than do the numbers of percentages of observed differences between pairs of sequences. When calculating topologies by the least-squares matrix method, it has been found that correction of the observed percent difference values for inferred superimposed mutations, using the data depicted in Fig. 2, gives a more realistic reconstruction of the evolutionary topologies.5 For generalized sequence comparisons it is more useful to employ a similarity matrix whose elements reflect the ratio of the probability of an amino acid exchange to the probabilities of the two amino acids occurring at random. These ratios are given by the elements of the relatedness odds matrix. This matrix was derived by dividing the elements of the mutation probability matrix by the normalized frequencies of occurrence of the replacement amino acids. Each element gives the probability of replacement of amino acid A with B per occurrence of A per occurrence of B. The matrix has the desired attribute that it is symmetric. McLachlan’s scoring matrix2 reflects a similar ratio, but it was derived by a different method. When one protein is compared with another, position by position, one should multiply the odds for each position to calculate an odds for the comparison of the entire sequences. However, it is more convenient to add the logarithms of the odds. Thus, the MDM scoring matrix contains the logarithms of the elements of the 250-PAM odds matrix; to allow more rapid computation by avoiding floating point operations, the elements of the log odds matrix are multiplied by 10 and rounded to the nearest integer. The neutral score is zero. A score of i- 10 indicates that a pair of amino acids is expected to occur 10 times as frequently in related sequences (after 250 PAMs of evolutionary change) as would occur by

5 W. C. Barker, L. T. Hunt, and D. G. George, in “Computer Simulation of Carcinogenic Processes” (B. D. Silverman, ed.) p. 1. CRC Press, Boca Ratan, Florida, 1988.

340

ALIGNING

PROTEIN

AND

NUCLEIC

ACID SEQUENCES

I201

chance. The elements of the matrix range from - 8 (for Trp/Cys) to + 17 (for Trp/Trp). Simulation studies have indicated that for distantly related sequences (between 73 and 86% difference) the MDM for 250 PAMs is optimal for distinguishing between related proteins and those whose observed similarity is due to chance.6 Hence, the matrix corresponding to 250 PAMs has become the standard matrix for sequence comparison studies. It has been reported’ that in sequence comparison it is more effective to use a matrix corresponding more closely to the actual evolutionary distance between the sequences being compared, and a computer program that facilitates this approach has been developed.8 These results are not unexpected, however, this approach is limited because it requires a priori knowledge of the approximate evolutionary distance between the sequences being compared. The MDM (shown in Fig. 3) clearly reflects the physicochemical properties of the amino acids. Figure 3 has been arranged to designate groups of chemically similar amino acids: the aromatic group (tryptophan, tyrosine, phenylalanine); the hydrophobic group (vahne, leucine, isoleucine, methionine); the basic group (lysine, arginine, histidine); the acid-acid amide group (glutamine, glutamic acid, aspartic acid, asparagine); the group of amino acids that are small and not strongly hydrophilic or hydrophobic (glycine, alanine, proline, threonine, se&e); and cysteine. Some groups overlap: the basic and the acid-acid amide groups tend to replace one another to some extent, and phenylalanine interchanges with the hydrophobic amino acids more often than chance expectation predicts. These patterns are imposed principally by natural selection and only secondarily by constraints of the genetic code; they reflect the similarity of the functions of the amino acids in their interactions with one another in the three-dimensional conformation of proteins. Some of the properties of an amino acid residue that determine these interactions are size, shape, and local concentrations of electric charge; conformation of van der Waals surface; and ability to form salt bonds, hydrophobic bonds, and hydrogen bonds. For specific applications, it is possible to derive scoring matrices that reflect one or another of these particular properties. The values in the MDM correlate particularly well with similarity of hydrophobicity and molecular bulk of the side chains of the amino acidsgJO as well as with secondary structure-forming propensity.g 6 R. M. Schwartz and M. 0. Dayhoff, in “Atlas of Protein Sequence and Structure” (M. 0. Dayhoff, ed.), Vol. 5, Suppl. 3, p. 353. National Biomedical Research Foundation, Washington, D.C., 1979. 7 J. F. Collins, A. F. W. Co&on, and A. Lyall, CABIOS4,67 (1988). 8 A. H. Reisner and C. A. Bucholtz, Nucleic Acids Res. 14,233 (1986). 9 S. French and B. Robson, J. Mol. Evol. 19, 171 (1983).

WI c cys

MUTATION C -

12

341

DATA MATRIX

sulfhydryl

s ser

P Pro

G my N AS"

-4

1

0

-I

0

0

0 ASP

-5

0

0

-1

0

1

0

-I

E GIU

-5

0 Gin

-5

-I

0

-1

0

0

-3

-1

-I

0

-I

-2

R Arg

-4

0

-1

0

-2

-3

K Lys

-5

0

0

-1

-1

-2

M Met

-5

-2

-1

-2

-1

-3

I

Ile

-2

-I

0

-2

-1

-3

L teu

-6

-3

-2

-3

-2

-4

Y “aI

-2

-1

0

-1

F Ptx

-4

-3

-3

-5

0

0 4

0 -I 2

3

11

10

0

6

10

3

5

-2

-3

-2

-I

-2

0

0

6

-3

-4

-3

-2

-2

-3

-3

4

2

-4

-6

-5

-5

-2

-4

-5

0

I

V -

small

-1 -5

F -

Y Tyr

0

-3

-3

-5

-3

-5

-2

-4

-4

-4

0

-4

-4

-2

-I

-I

-2

w 1rp

-6

-2

-5

-6

-6

-7

-4

-7

-1

-5

-3

2

-3

-4

-5

-2

-6

0

S

T

P

A

G

N

0

E

9

H

R

K

V

F

C

hydrophobic

6

Cys Ser Thr Pro Ala Gly

M

IL

sn Asp Glu Gl" HIS Arg Lys Met He Le" "al

Y

P+,e Tyr

aromatic

W Trp

FIG. 3. Mutation data matrix for 250 PAMs (amino acid mutations per 100 residues). The amino acids were arranged by assuming that positive values represent evolutionarily conservative replacements; the clusters correspond to groupings based on the physicochemical prop erties of the amino acids. [Reproduced, with permission, from D. G. George, L. T. Hunt, and W. C. Barker, in “Macromolecular Sequencing and Synthesis” (D. H. Schlesinger, ed.), p. 127. Alan R. Liss, New York, 1988.1 Limitations

of Model

It was known at the time of original publication of the matrix that the observed frequencies of amino acid exchanges that require two or more nucleotide replacements were significantly greater than that expected, based on the frequencies of amino acid replacements requiring single nucleotide exchanges3 As was correctly pointed out by Wilbur,‘r this is 10W. R. Taylor, J. Theor. Biol. 119, 205 (1986). I1 W. J. Wilbur, Mol. Biol. Evol. 2,434 (1985).

342

ALIGNING

PROTEIN

AND

NUCLEIC

ACID

SEQUENCES

1201

inconsistent with a simple Markovian model of nucleotide replacement. It is important to realize that the Dayhoff model makes no assumptions about the genetic mechanisms involved in point mutations on the DNA level, and non-Markovian behavior on the nucleotide level does not necessarily imply that amino acid exchanges cannot be treated in Markovian fashion. The two-step mutation model described by Wilbur” implies that at each step (each nucleotide replacement in a multiple nucleotide exchange), the mutation must be fixed in the population before the next step may begin. It has never been demonstrated that this is required, and the data seem to suggest otherwise. Hence, one may justifiably ignore events at the nucleotide level and model amino acid mutation based entirely on observed amino acid exchange frequencies. More importantly, the model makes the assumptions that there are no correlations in exchange frequencies between neighboring sites and that the exchange frequencies are the same regardless of the position of the site within the protein sequence. Molecular modeling studies have indicated that correlations between neighboring sites play a significant role in structure formation. Moreover, it is well known that different sites within proteins show dramatically different levels of variability. These two approximations limit the utility of this matrix as a measure of biological similarity. These restrictions apply equally to virtually all currently employed sequence comparison methods; therefore, using this matrix in these procedures does not introduce any additional approximations. In general, such methodologies do not provide a true measure of biological similarity.‘* Several new approaches to the sequence comparison problem have been reported that attempt to rectify these problems,i3**4 but these approaches introduce their own approximations and generally require a priori knowledge concerning the class of proteins in questions. Three other criticisms of the methodology have repeatedly appeared in the literature. (1) The derivation of the matrix employs circular logic, i.e., a matrix is required to construct the sequence alignments, and these alignments are used in the derivation of the matrix. All the sequence alignments examined in these studies were between sequences less than 15% different. Within such alignments, very few gaps (insertion/deletions) are observed, and very few decisions are required for their construction. As a result, any Sequencing and I2 D. G. George, L. T. Hunt, and W. C. Barker, in “Macromolecular Synthesis” (D. H. Schlesinger, ed.), p. 127. Alan R. Liss, New York, 1988. I3 M. Gribskov, A. D. McLachlan, and D. Eisenberg, Proc. Natl. Acad. Sci. U.S.A. 84,4355 (1987). I4 M. Gribskov, M. Homyak, J. Edenfield, and D. Eisenberg, CAEIOS 4,6 1 (1988).

[201

MUTATION

DATA

MATRIX

343

realistic method of sequence alignment (including alignment by eye) and any similarity scoring matrix (including the most simple matrix, which assigns one score for matches and one score for mismatches) will virtually always produce the same alignment. (2) The methodology suffers because it depends heavily on ancestral sequence methods, which are known to have significant limitations. The limitations of these methods involve the problems associated with multiple mutation events. Among closely related sequences, these problems have very little effect. (3) The methodology depends heavily on the selection of the correct topological relationships among the sequences in each group. The effect of this limitation is more difficult to assess as there is no rigorous measure for defining correct topology. The chief criterion for selecting the topologies to be used was minimum overall length of the topology (total number of mutations). When there was a choice between several nearly minimal trees, the one that was consistent with generally accepted phylogeny or with topologies derived from other sequences was chosen, if such information was available. It is noteworthy that the values of the matrix elements compiled in 1978 did not significantly differ from those compiled in 1968, although during this time there was a doubling of the amount of sequence information examined, new groups of sequences previously unrepresented in the data set were introduced, and as a result the topologies corresponding to many of the previously examined groups substantially changed. Hence, this does not appear to be a severe restriction on the methodology. There is a valid question concerning how well this matrix represents the currently available sequence data, however. Since 1978, the amount of protein sequence data has increased LO-fold. More importantly, data are now available from groups of sequences that were underrepresented or not represented at all in the 1978 data set. Some of the groups, such as the nonaqueous soluble (hydrophobic) proteins and the viral proteins, exhibit properties dramatically different from those represented in the 1978 data set. Thus, the lack of significant change between the 1968 and 1978 derivations cannot be taken to indicate that no appreciable changes would be observed in a new derivation.

Computer

Applications

Using MDM

Many types of sequence comparison and searching applications use scoring matrices in order to enhance their sensitivity. Among these are the dynamic programming approaches to aligning sequences, the slidingwindow segment comparison method and its graphic implementations, various database searching methods, and methods for making alignments

344

ALIGNING

PROTEIN

AND

NUCLEIC

ACID

SEQUENCES

c201

of three or more related sequences. We6J5 and others have found that incorporating the MDM into such methods markedly increases their sensitivity for illuminating even very distant relationships. In the discussion that follows, we draw examples from several of these methods, there are many examples, some of which are described in other chapters of this volume. The ALIGN programi2*i6 is a global alignment program based on the algorithm of Needleman and Wunsch;” it is most sensitive in comparing sequences of similar size and architecture that are assumed to be related along their entire lengths. It determines an optimal alignment (including gaps) of a pair of sequences by dynamic programming. To obtain realistic alignments, two types of gap penalty are employed: one is a penalty applied every time a break (gap) is inserted in either sequence (the break penalty), regardless of the length of the break; the other assessesa penalty based on the size of the break. These parameters can be shown to be equivalent to those of the optimal sequence alignment method of Smith and colleagues.i8 The ALIGN program exacts the second penalty by giving a bonus to every score for an amino acid matching another amino acid, whereas the score for an amino acid aligned with a gap is always zero. This bonus is called a matrix bias because it is achieved by adding the value to every element in the scoring matrix rather than by modifying the algorithm. We most often set both the break penalty and the matrix bias at 6. This value for the matrix bias ensures that almost all replacements of one amino acid for another receive a higher score than an insertion or deletion. If 8 is used, all replacements except Trp/Cys receive scores higher than 0. Schwartz and Dayhof f 6 tested eight pairs of distantly related sequences using ALIGN and four different scoring matrices (Table II). The average score using the MDM was twice as high as that using the unitary matrix (UM), with the genetic code matrix (GCM) and the matrix of McLachlan2 (AAAM) giving intermediate scores. The MDM gave the highest score for six of the eight comparisons, with UM and AAAM each giving the highest score in one case. Similar results were reported by Feng and Doolittlei9 who used a different global alignment program. If the sequences being compared are of markedly different lengths or if only portions of the sequences being compared are related, the ALIGN program may not produce a reasonable alignment or score; in such cases, using a high break Is W. C. Barker and M. 0. Dayhoff, in “Atlas of Protein Sequence and Structure” (M. 0. Dayhoff, ed.), Vol. 5, p. 101. National Biomedical Research Foundation, Washington, D.C., 1972. I6 M. 0. Dayhoff, W. C. Barker, and L. T. Hunt, this series, Vol. 91, p. 524. I7 S. B. Needleman and C. D. Wunsch, J. Mol. Biol. 48,443 (1970). I8 W. M. Fitch and T. F. Smith, Proc. Natl. Acad. Sci. U.S.A. 80, 1382 (1983). I9 D.-F. Feng and R. F. Doolittle, J. Mol. Evol. 25,351 (1987).

[201

MUTATION

DATA

TABLE SEQUENCE

Range (SD)

Matrix

Scores

WITH

RELATE Scores

Mean (SD)

Range (SD)

Mean (SW

0.5-8.0 1.3-9.3 0.7-8.0 3.5- 15.2

4.9 4.6 4.6 7.4

UM

0.1-5.8

3.0

GCM

0.4-9.0 0.4-9.9 2.9-12.1

3.9 4.9 5.9

AAAM MDM

II

COMPARISON SCORES OBTAINED VARIOUS SCORING MATRICES”

ALIGN

345

MATRIX

a Adapted from Schwartz and Dayhoff? Eight pairs of distantly related proteins were compared using the program ALIGN, five pairs of distantly related proteins and four proteins with internal duplication were tested with program RELATE.

penalty may give satisfactory results. Often, however, regions of similarity must be preselected to obtain reasonable results. Computer methods that locate and compare subregions within proteins are collectively known as local similarity methods. the RELATE program,16 based on a method first introduced by Fitch,*O was one of the earliest of such methods. It compares all segments of a given length from one sequence with all segments of the same length from the second sequence. This method is useful when the correspondence of segments is not known or when looking for regions of similarity in otherwise unrelated sequences. The segment score is the sum of the scores for each pair of amino acids occupying corresponding positions within the two segments. If the proteins are related, there will be two populations of scores: a large population from unrelated segments and a very small population of higher scores from the related segments. The program calculates the expected number of scores in this second population and computes the average of that same number of top scores from the upper tail of the distribution. The sequences are permuted, and a distribution of average top scores is generated and used to calculate a 2 value, the RELATE score (in SD units). Using the RELATE program and four scoring matrices (Table II), Schwartz and Dayhof f6 tested five pairs of distantly related sequences and four internally duplicated sequences. The MDM gave the highest score for seven of the nine tests, with UM giving higher scores in two cases. 2o W. M. Fitch, J. Mol. Biol. 16,9

(1966).

346

ALIGNING

PROTEIN

AND

NUCLEIC

ACID

SEQUENCES

I201

The DOTMATRIX graphic comparison program’*~** is a convenient method for locating and visualizing regions of similarity between two sequences, especially when the lengths are different, and for finding regions of repetitive structure within a sequence. The plot is a graphic equivalent of the list of highest segment comparison scores obtained with RELATE. It is derived by computing a comparison matrix whose elements reflect the similarity of the segments being compared. The sequences are represented on the axes; dots are placed within a rectangular grid at positions corresponding to the central residues in the segments for which the comparison scores exceed a specified minimum value. Regions of sequence similarity are visualized as diagonal lines composed of contiguous dots; an offset in a diagonal represents a break (insertion or deletion) in one sequence with respect to the other in the region of similarity. Shorter diagonals above and/or below a main diagonal may represent repeats in one or both sequences. Varying the parameters (scoring matrix, window size, minimum score) of the dot matrix comparison allows more subtle similarities to be resolved. A low minimum score brings out weaker internal similarities, although the background noise is much higher. A larger window (segment length) favors longer repeats and may be used to visualize these in a highly periodic sequence.2L Figure 4 shows four DOTMATRIX comparisons of troponin C** on the horizontal axes and myosin L2 chain23 on the vertical axes; both proteins are from rabbit skeletal muscle. The plot using the MDM (Fig. 4a) has a very prominent main diagonal showing the similarity of the two proteins along their entire lengths; their amino halves have nearly twice as many identities as their carboxyl halves. In addition, three parallel broken diagonals can be seen above and below the main diagonal. These reflect the comparisons of the more well-conserved regions of myosin L2 with the four homologous calcium-binding domains of troponin C. The plot with the hydrophobicity matrix (Fig. 4b) shows the main diagonal and the homology of both halves of troponin C to the first half of myosin L2, but only some short segments of the off-diagonals can be seen. Many of these are present in the plot made using the unitary matrix and a minimum score of 5 identities in a 25residue span (Fig. 4c), but the plot is very noisy, as many nonhomologous pairs of segments also show this much identity. Raising the threshold to 7 (Fig. 4d) eliminates both the noise and most of the diagonals representing homologous comparisons. 21W. C. Barker, L. T. Hunt, and D. G. George, Protein Seq. Data Anal. 1,363 (1988). ** J. H. Collins, M. L. Greaser, J. D. Potter, and M. J. Horn, J. Biol. Chem. 252,6356 (1977). *’ G. Matsuda, T. Maita, Y. Suzuyama, M. Setoguchi, and T. Umegane, Hoppe-Seylers Z. Physiol. Chem. 359,629( 1978).

[201

MUTATION

DATA

MATRIX

347

FIG. 4. Dot matrix graphic plots of rabbit skeletal muscle troponin C (horizontal axes) and myosin L2 (DTNB) regulatory light chain (vertical axes) with different scoring matrices. The segment length (window size) is 25 in all plots. (a) MDM with minimum score of 15; (b) hydrophobicity matrix with minimum score of 195; (c) unitary matrix with minimum score of 5; (d) unitary matrix with minimum score of 7.

Several other approaches to the local similarity problem, in which the MDM was among the scoring matrices employed, have recently been reported. Hall and Myers developed a programz4 that can find locally optimal alignments between two sequences even when they are separated by long nonsimilar regions or when the sequences have little overall similarity. Lawrence and Goldman employed an alternative approachZ5 to dynamic programming for detecting and evaluating similarities. Based on a nonlinear similarity score and their modification of the DD algorithm of Altschul and Erickson,26*27 and using cost matrices such as one they derive from the MDM, this method finds the boundaries of so-called homology 24J. D. Hall and E. W. Myers, CABIOS 4,35 (1988). 25C. B. Lawrence and D. A. Goldman, CABIOS 4,25 (1988). x S. F. Altschul and B. W. Erickson, Bull. Math. Biol. 48,6 17 (1986). 27S. F. Altschul and B. W. Erickson, Bull. Math. Biol. 48,633 (1986).

348

ALIGNING

PROTEIN

AND

NUCLEIC

ACID

SEQUENCES

1201

domains. An aggregate score for all such domains found in a comparison can be calculated by a method similar to the derivation of an ALIGN score. By far the most effective method for large-scale database searching, which was introduced by Wilbur and Lipman, uses a fast algorithm for locating the maximum diagonals in a dot matrix-type comparison of two sequences. Lipman and Pearson 29 developed FASTP, a version of the program specifically designed for rapid searching of protein sequence databases and incorporating the MDM into the scoring procedure. As in the original method, a maximum diagonal is selected to represent each sequence comparison. In this implementation, however, the five best scoring diagonals from each sequence comparison are first selected, based on the relative number of matches versus mismatches. The selected diagonals are then restored using the MDM, and, based on these scores, one representative diagonal is selected. The program then proceeds as before except that the optimized score is computed using the MDM as the similarity matrix. This implementation incorporates other changes, including an improvement to the methodology used for constructing the maximum diagonals, that result in an overall increase in speed of the program, but the major advance is the dramatic increase in the sensitivity resulting from the introduction of a similarity scoring matrix. Pearson and Lipmai9” have recently introduced a refinement of this method. In the new program, FASTA, the unknown sequence is compared to database sequences in a four-step process that can combine multiple sequential regions of similarity to produce a longer diagonal than in FASTP (in which only one diagonal is selected) for final comparison of each pair of sequences. FASTA is a modification for detecting local simi-. larities, including repetitive structure; it saves and displays, as alignments or dot matrix plots, all diagonals (not just the top 10) with similarity scores above a threshold value. FASTP and FASTA are not as effective as some other searching methods, however, such as the SEARCH program of the PIR, in finding similar segments of short length (less than 30 - 40 residues) or all copies of repetitive segments. The SEARCH program’2J6,31 compares a sequence of specified length with all other sequences of the same length in the database and computes the score based on a specified scoring matrix. Typically, for 28W. J. Wilbur and D. J. Lipman, Proc. Nat/ Acad. Sci. U.S.A. 80,126 (1983). 29D. J. Lipman and W. R. Pearson, Science 227, 1435 (1985). 3oW. R. Pearson and D. J. Lipman, Proc. Natl. Acad. Sci. U.S.A. 85,2444 (1988). 31W. C. Barker, D. G. George and L. T. Hunt in “Computer Analysis for Life Science” (C. Kawabata and A. R. Bishop, eds.), p. 194. OHMSHA Ltd., Tokyo, 1986.

1201

MUTATION

DATA

MATRIX

349

a 25-residue piece, the corresponding segments from all closely related sequences (> 50% identical overall), and many from more distantly related sequences, will appear above the distribution of segments from unrelated proteins. When very little sequence data are available, we frequently perform the search using both MDM and UM. The former is usually more effective in locating segments from more distantly related proteins; the latter may be more effective when several residues must be absolutely conserved to retain an activity or to locate possible cross-reactive epitopes in unrelated sequences. The MDM has been found to be useful in several methods of making consensus sequence alignments. These methods allow one to base comparisons on genetic and other data in addition to sequence similarities. If the sequence alignments can be improved, then the evolutionary trees derived from them will have more accurate branch lengths and branching order. Furthermore, it may be possible to detect and establish extremely distant relationships, thus extending the information on protein and species evolution. The ALIGN program is capable of revealing similarity (and suggesting possible relationships) between sequences 65 - 80% different, whereas profile analysisi may allow correspondence to be established between sequences 80 - 90% different. Profile analysis is a two-step, two-program method that defines a consensus sequence, or profile, from an alignment and matches this profile against other sequences. First an alignment of structurally and functionally similar sequences, either entire proteins or domains, is used to construct a sequence position-specific scoring matrix. Using a comparison table based on the MDM, the first program (PROFMAKE) then generates a consensus sequence or profile, in which position-dependent gap penalties are included. The profile is thus a pattern that can simultaneously embody the aggregate of structural, functional, and genetic information about a specific group of proteins. In the second step of the analysis, the profile is compared either with a database to find any similar sequences (PROFANALDB) or with a single sequence (PROFANAL) to learn if it belongs to a known family (or superfamily) and, if so, to obtain the best alignments. Once a library of such profiles has been accumulated, it can be used as a database against which a target sequence can be searched; Gribskov et ~1.‘~ have developed program PROFILESCAN, which incorporates a modified form of the MDM, to do this. Several other approaches that employ a consensus sequence concept (reviewed by Taylo132,33) have used the MDM as the scoring matrix at some 32 W. R. Taylor, Protein Eng. 2,77 (1988). 33 W. R. Taylor, J. Mol. Evol. 28, 161 (1988).

350

ALIGNING

PROTEIN

AND

NUCLEIC

ACID

SEQUENCES

1201

point. The consensus sequence is derived, as in profile analysis, from alignments of known related sequences. Usually the alignments are constructed by beginning with the most similar or recently diverged sequences (two to several), deriving a consensus, then adding more sequences in repeated pairwise comparisons of new sequence to consensus sequence.19+33,” A variation is to derive a consensus from each of several groups of closely related sequences and then to compare the consensus sequences.” The MDM, or a modification of it, is the similarity scoring matrix most often used in the sequence comparisons,L9*33,34 as the results appear to fit best, compared with global optimization methods, with other biological information. I9 Again, all available information is used to establish gaps; the progressive alignment procedure of Feng and Doolittle retains any gaps once established, as more sequences are added.r9 Patthy was particularly concerned with deriving patterns (of identities, similarities, variable regions, gaps) in order to identify very distantly related sequences and employed both a modified MDM and the unitary matrix in the protocol.34 TaylorlO and Risler et a1.35among others, have examined various scoring matrices, including the MDM, to determine which properties are important in conservation of amino acids in structurally related proteins. Taylori briefly reviewed previous efforts to apply multidimensional scaling techniques to the MDM; he developed two-dimensional Venn diagrams to represent properties of the amino acids. Size and hydrophobicity were the properties found to correlate with the values in the MDM. He also pointed out a type of situation for which the MDM may not be the best measure of conservation. Risler et al.35 developed a matrix based on superposition of three-dimensional structures of related proteins, using observed replacements where o-carbon atoms at corresponding positions are less than 1.2 A apart. Comparison of several scoring matrices in aligning sequences indicates that the MDM may weight identities too heavily. It is also pointed out20*32*35that the MDM may be inadequate for optimal alignment of very distant sequences because it does not give sufficient information about the local structural environment.

New Similarity

Matrices

We are in the process of deriving new matrices based on much more data than were previously available. Although the MDM derived in 1978 showed only little change from the matrix published in 1969,36 the amount 34 L. Patthy, J. Mol. Biol. 198, 567 (1987). 3sJ. L. Risler, M. 0. Delome, H. Delacroix, and A. Henaut, J. Mol. Biol. 204, 1019 (1988).

1201

MUTATION

DATA

MATRIX

351

of sequence data had only doubled and most of the groups of proteins present in the 1978 data were also represented in the 1969 data. There has been a nearly IO-fold increase of protein sequences in the family and subfamily categories since the MDM was last compiled. Nearly 70% of the entries in the current sequence database contain protein sequences obtained by inference from the elucidation of the corresponding nucleic acid sequences. Prior to 1978 the database was strongly biased toward soluble proteins that occur naturally in large quantities, as these types can be most easily sequenced by classic methods. Entire classes of proteins were previously absent, the most prominent being the membrane proteins, whose hydrophobic nature markedly distinguish them from other groups. In addition, many of the groups of proteins now represented in the database previously contained either no sequences or too few sequences to make any contribution to the compiled amino acid replacement data. Thus, even if the new derivation does not yield a different set of matrix values, the reliability of these values will be dramatically increased, and this recompilation will provide a more sensitive probe for the understanding of protein relationships and will have a direct influence on the investigation of many evolutionary questions. As more sequence data have become available, it has become increasingly apparent that sequence similarity between proteins has profound biochemical implications irrespective of evolutionary relationships and that general evolutionary constraints may differ from those imposed by purely functional requirements. Furthermore, it is likely that acceptable replacement frequencies may vary among functionally distinct classes of proteins. Replacement matrices derived explicitly from specific functional groups of proteins are likely to be much more sensitive for examining questions of functional relatedness. Clearly, a generalization of the methods introduced by Dayhoff that allows the compilation of amino acid replacement data, independent of evolutionary considerations, is warranted. As more sequence data become available from specific classes of proteins, the techniques that we are now developing will lead to a wide variety of empirical matrix probes designed to study specific biological questions. Acknowledgments This work was supported by National Institutes of Health Grant GM37273 from the National Institute of General Medical Sciences. We wish to thank James K. Bair for editorial and technical support in the preparation of the manuscript and illustrations. 36 M. 0. Dayhoff, R. V. Eck, and C. M. Park, in “Atlas of Protein Sequence and Structure” (M. 0. Dayhoff, ed.), Vol. 4, p. 75. National Biomedical Research Foundation, Silver Spring, Maryland, 1969.

Mutation data matrix and its uses.

[201 MUTATION [203 Mutation By DAVID G. GEORGE, DATA MATRIX Data Matrix WINONA 333 and Its Uses C. BARKER, and LOIS T. HUNT Introduction Mo...
1MB Sizes 0 Downloads 0 Views