ANNUAL REVIEWS Annu.

Further

Quick links to online content

Rev. Biophys. Biophys. Chem. 1991. 20:175-203

Annu. Rev. Biophys. Biophys. Chem. 1991.20:175-203. Downloaded from www.annualreviews.org by University of Leicester on 06/11/13. For personal use only.

STATISTICAL METHODS AND INSIGHTS FOR PROTEIN AND DNA SEQUENCESl Samuel Karlin,2 Philipp Bucher, and Volker Brendel Department of Mathematics, Stanford University, Stanford, 94305

California

Stephen F. Altschul National Center for Biotechnology Information, National Library o f

Medicine, National Institutes o f Health, Bethesda, Maryland 20894 KEY

WORDS:

sequence analysis, sequence statistics, sequence alignments, charge distribution in protein sequences, residue associations

CONTENTS PERSPECTIVES AND OVERVIEW ............................ . ... . .. . . . . ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..

176

Five Ma jor Sequence Problems . . . . . .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Charge Distribution in Protein Sequences ..... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...............................

176 178

....... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alphabets (Nucleotide and Amino Acid Classifications) . . . . . . . . . . . . . . ............... . .. . ... . . . . . . . Sequence Statistics and Word Relations . . . . .... ......... . . . . . . . . . . . . . . . . . . . . . . . . . . . .... . . .... ..... . . . . . . . . Statistical Significance . . . . . . . . . . . . . . . . . . . . . . . . ..... . . . . . . . . . . . ................ .................................. ....

1 79 1 79 180 180 184 184 186 188 1 90 1 90 1 90 191

SEQUENCE CONCEPTS AND STATISTICAL SIGNIFICANCE

SEQUENCE COMPARISONS AND SEARCHES

. .... .. .......

... . . . . . . . . . . .... . . . . . . . . . . . . . . . . . . .... . .

Global Sequence Similarity ............................... . . .... . . . . . . . . . . . . ................ . ... . . . . . . . . . . . . . . . . . . Local Sequence Comparisons . . . . . . . . . . . . .... ... .. .. ... . . . . . . . . .... . . . . .. . ...... ... . . . . . . . . . . . . . . . . . . . Patterns in Multiple Sequences . . . . . . . . . . . . . . . . . . . . . . ...... ... ......... . ............. .. ..... .................. . .

EVALUATION OF CLUSTERING IN PROTEIN SEQUENCES ............... . .. . . . . . . . . . . . . . . . . . . . ............ . .. . . Clusters of Certain Residue Types ... . . ... . . . . . . . . . . . . . . . . . . . . .......... ......................... ............. Runs of Certain Residue Types . . . . . . . .... . . . . . . . . . . . . . . . . . . . . . . ....... . . . .. ....................... ............. Case Study: G Protein-Coupled Receptors . . . ... . . . . . . . . . . . . . . . ... . ....... . . . . . . . . . . . . . . . . . . .. . . . . .... . . .

I The U S Gov ernment h as the r ight t o retain a nonexcl us ive, royalty- free lic ense i n a nd t o any c opyright c ov ering this p aper. 'Supp orted in part by NIH Gra nts GM3 9907- 02 , GM 1 0452-26, a nd NSF Gra nt DMS8606244.

1 75

1 76

KARLIN ET AL

. . . . . . . .... . . . . . . . . . . . . . . . ... .. . ... . Residue Usage . . . . . . . . . . . . . . . . . . . . ............. .. . . . . . . . . .... . . . . . . . . . . . . ... . . . . . . . . . . . . . . . ... . . . . . . . . . . . . . . . ........... R esidue Associations............... . . .......... . . . . . .. . . . . . . . . . . . . .. . . . . . . . . .. . . . . .. . . . . . . . . . . . . .. . . . . . . .. ...........

COMPARATIVE COMPOSITIONAL ANALYSIS OF PROTEIN SEQUENCES

.

196

......... ........... ...

199

UNUSUAL SPACINGS BETWEEN SEQUENCE LETTERS OR WORDS ................. ........................

The Leucine Zipper . . . . . . . . . . . . . . . . . . . . .. . . . ...... . .. . . . .. . ............ . ........ . . .. . . . . ........ . . . . . . . . . . . . . . . . . . . . . Extremal Spacings....................................................................................................

Annu. Rev. Biophys. Biophys. Chem. 1991.20:175-203. Downloaded from www.annualreviews.org by University of Leicester on 06/11/13. For personal use only.

PROSPECTS AND LIMITATIONS..............................................................

192 193 195 196 197

PERSPECTIVES AND OVERVIEW Among the major objectives of nucleic acid and protein sequence analysis is the discovery of significant patterns related to gene expression, to protein folding and function, and to the evolutionary development of these patterns. The ability to distinguish what is likely from what is unlikely to occur by chance allows one to identify such patterns and target them for possible experimental study. For the most part, gross average assessments have guided interpretation of molecular sequence data, and researchers have paid little attention to statistical fluctuations. For example, when studying a physical map of restriction sites where adjacent sites are sep­ arated on average by 64 kb (kilobase pairs)-e.g. NOT I-one might interpret the observation of five sites within 1 50 kb as excessive clustering. However, assuming sites are distributed randomly over the genome, what is the probability that five or more such sites are seen in any IS0-kb stretch of DNA? A complementary question concerns the probability that no such restriction site is seen in a 500-kb stretch. Similarly, consider the locations in a protein of a given amino acid type (e.g. cysteines, acidic residues). How does one assess anomalies in the distribution of the locations of this type (in either primary or tertiary structures)-i.e. excessive clustering, gaps, or regularity? We discuss this problem of heterogeneity in DNA and protein sequences as well as its possible biological implications below. This article focuses on the statistics of protein sequences and the insights they can provide to structure, function, and phylogenetic relate dness.

Five Major Sequence Problems One objective has been to catalogue the PROTEIN SEQUENCE TAXONOMY protein universe by grouping related protein subregions. The essential features of related sequences might then be described by templates or generalized consensus sequences ( 1 6, 22, 85, 97). Ideally, the number of groups should be small, but the groups should be sufficiently rich that each protein sequence can be decomposed into disjoint regions; each matched to some template of the catalogue . This classification problem is formidable with respect to both primary and tertiary structure. Secondary and tertiary structure prediction based on statistical principles is at present unreliable,

Annu. Rev. Biophys. Biophys. Chem. 1991.20:175-203. Downloaded from www.annualreviews.org by University of Leicester on 06/11/13. For personal use only.

PROTEIN AND DNA SEQUENCES

1 77

in considerable part because of the small sample size: only about 80 to 1 00 distinct (from about 400 actual crystal structures) high-resolution three­ dimensional crystal structures are available, mostly for globular proteins. These structures may not be a representative sample of the protein universe, and new structures are being determined at the slow rate of only 1 0 to 20 per year (78). Currently, NMR methods are used mainly for small proteins ( 1 0 1 ) . At present, finding correlations between amino acid­ sequence features and protein function without knowledge of thc medi­ ating molecular structures may yield more promising results. A common procedure is to compare a newly sequenced protein or translated mRNA with all protein sequences of known properties (2, 69). The premise is that sufficient sequence similarity corresponds to structural and/or functional similarity. In this approach to molecular sequence data, a fundamental concern is the detection of pat­ terns stronger than those one would find purely by chance. Given the accelerating growth of protein and nucleic acid data bases, this problem of distinguishing meaningful patterns from random noise will become ever more forbidding. In addition to similarities, sequence features that vary distinctively between the compared sequences may be of importance.

QUERY SEQUENCE SEARCHES

SEQUENCE MOTIFS Sequence motifs for a class of proteins of common structure or function (e.g. methylases, aminoacyl-tRNA synthetases) or for similar proteins within and among species (multigene families) have been sought that are diagnostic for the given protein class (8, 2 1 , 7 1 , 80, 84, 90). The motifs should distinguish the given function class from the universe of other proteins, as well as from randomly constructed sequcnces of similar amino acid composition.

The foregoing problem starts with protein collections defined by biological and/or phylogenetic criteria, and statistically significant descriptions are sought that correlate with structure and function. An inverse statistical approach begins by defining statistically significant sequence patterns of many kinds. The idea is to use these features to delineate protein classes and attendant associations with structure and function (40). Such sequence features might include anom­ alous distributions of charged residues (charge clusters and runs), dis­ tinctive hydropathy arrangements (signal peptides, multiple membrane segments), special amino acid structural patterns (cysteine kringles, EGF­ like domains), extremes in amino acid usage (histidine, hydrophobic resi­ dues), repetitive structures (doublets, higher-order patterns), periodicities (leucine heptad repeats, amphipathic strands), and unusual spacings of certain amino acid types. INVERSE STATISTICAL SEQUENCE PROBLEMS

178

KARLIN ET AL

Summary statistics of amino acid usage provide useful information for comparing and contrasting protein classes. Some immediate examples are: (a) Quantile tables for amino acid usage: the quantile distribution Q(x) for a residue type for a given set of proteins describes the percent of proteins in which that residue type occurs with frequency less than x%. For example, the average cysteine occurrence is approximately 1 % among the unicellular Escherichia coli and yeast pro­ teins with narrow quantile distributions but is over 2% on average among mammalian proteins with a broad quantile distribution. Quantile dis­ tributions may be calculated for different alphabets (amino acid classi­ fications). (b) Diresidue and higher order-associations: for example, the percentages of positively and negatively charged residues in protein sequences are positively correlated (with coefficient about 0.3) and have equal medians of about 1 1 . 5 % .

Annu. Rev. Biophys. Biophys. Chem. 1991.20:175-203. Downloaded from www.annualreviews.org by University of Leicester on 06/11/13. For personal use only.

STATISTICS OF PROTEIN SEQUENCES

Charge Distribution in Protein Sequences The charge of amino acids is an important characteristic in several of the sequence problems outlined above. We describe here several types of charge configuration of possible interest. A charge cluster is a short protein segment (25-75 residues) with significantly high charge content relative to the charge composition of the whole protein. In particular, a positive (or negative) charge cluster is a segment with high positive (or negative) net charge, and a mixed charge cluster is a segment high in charged residues of both signs. A positive-, negative-, or mixed-charge run is a succession of charged residues of the appropriate sign with few intermittent errors. Periodic charge patterns are repetitive patterns of charged and uncharged residues. Charge clusters are often associated with transcriptional acti­ vation, developmental control, and membrane receptor regulatory activi­ ties (13, 32, 38, 65). Short positive-charge runs are found in nuclear location signals (36), and long acidic or mixed-charge runs are prominent among nuclear autoantigens (12). The charge pattern ( +,0,0)5-8 is conspicuously conserved in voltage-gated ion channel proteins ( 15). Methods to evaluate the statistical significance of charge clusters, runs, and periodic patterns are described elsewhere (40, 4 1). A study of more than 2500 sequences revealed that approximately 16% of all mammalian proteins contain at least one significant charge cluster and less than 4% involve multiple charge clusters (38). One can ask whether proteins that have multiple charge clusters, or exceptionally long charge runs, tend to have common function, be localized to specific regions of the cell, or be associated with specific species, or whether there is any difference between the charge structures of DNA and RNA viral proteins or of prokaryotic and eukaryotic proteins (4 1, 43).

Annu. Rev. Biophys. Biophys. Chem. 1991.20:175-203. Downloaded from www.annualreviews.org by University of Leicester on 06/11/13. For personal use only.

PROTEIN AND DNA SEQUENCES

179

In this article, we first review pertinent sequence statistical concepts and means for assessing statistical significance highlighting unusual sequence features defined by the assignment of appropriate scores to nucleotide and amino acid types. We then offer a brief critique of various sequence alignment schemes. The second half of this article presents several ex­ amples and case studies that demonstrate the utility of statistical method­ ology including significant charge configurations, homologous genes that are very different, the over-reporting of leucine zippers, unusual word spacings, and the value of compositional quantiles and diresidue associ­ ations. Among important molecular sequence problems not dealt with in this review are (a) phylogenetic reconstruction (9, 24), (b) characterization of DNA and RNA regulatory elements (promoters, terminators, splicing signals) (93), (c) protein and RNA secondary and tertiary structure pre­ diction (34, 8 1 ), and (d) codon bias (33, 42). SEQUENCE CONCEPTS AND STATISTICAL SIGNIFICANCE

Alphabets (Nucleotide and Amino Acid Classifications) A useful concept applicable to all sequence statistics is the grouping of letters of one alphabet to form natural new alphabets. For example, it is meaningful to group the nucleotides to form the two-letter purine vs pyrimidine classification, distinguished by chemical type and size. Amino acid classifications have been based on criteria such as physiochemical properties, charge, and minimal base differences between codons ( 1 6, 1 9 , 28, 49-5 1 , 6 1 , 63). The following are examples o f alphabets. DNA alphabets include: (a) purines vs pyrimidines (R, V); (b) strong (S = { C, G } ) vs weak (W = {A, T } ) hydrogen bonding; (c) { T } vs { C, A, G } because of the special properties of T (thymine dimer, stop codons); (d) 6 1 -codon alphabet. 2. Amino acid alphabets (we use the one-letter code), include: Ca) standard 20-letter alphabet; (b) 3-letter structure alphabets ( 1 9)-external (R, K, H, D, E, N, Q), internal (L, V, T, F, M), or ambivalent CA, C, G, P, S, T, W, V); (c) 8-letter chemical alphabet (61 )-acidic (D, E), aliphatic (A, 0, I, L, V), amide (N, Q), aromatic (F, W, Y), basic (K, R, H), hydroxyl (S, T), imino (P), sulfur (C, M); (d) 3-letter charge alphabets­ positive, negative, and uncharged. 3. Some other amino acid alphabets are: (a) statistical (empirical) sub­ stitutability alphabet ( 1 6, 62); (b) an alphabet based on size and shape; (c) groupings related to secondary structures (ex-helices, f3-sheets, turns); (d) random alphabets; (e) optimal alphabets (49). I.

1 80

KARLIN ET AL

Annu. Rev. Biophys. Biophys. Chem. 1991.20:175-203. Downloaded from www.annualreviews.org by University of Leicester on 06/11/13. For personal use only.

Multiple alphabet comparisons can expose significant regions that are conserved in some, but not in other alphabets (49, 5 1 ). Such results may suggest contrasting functional or structural properties for different regions of a sequence. Significance of the same positions in many alphabets enhances their possible functional importance (the criteria for significance will vary from one alphabet to another).

Sequence Statistics and Word Relations We list several useful sequence patterns and word relations and appropriate statistics computable for them (37, 57). Examples of word relations (in any alphabet) can be: (a) matched (aligned or in unrestricted locations); (b) matched with a prescribed number or percentage of errors or with systematic errors (5, 6, 55); (c) dyad words; (d) matched according to structure rather than content (37, 47). Examples of sequence statistics (in any alphabet) are: (a) length of longest (second longest, etc) repeat (with errors); (b) counts and spacings of moderate or long repeats; (c) repeats of special composition or structure (37, 49); (d) any of the above for word relationships (e.g. dyad symmetries, complementary words in the charge alphabet); (e) long common words among many sequences; (1) long runs of specific letter types or patterned runs; (g) count occurrence distributions of words with a defined relation­ ship (37, 49, 57).

Statistical Significance Assessments of statistical significance of sequence patterns can be based on theoretical models and on permutation procedures. PERMUTATION METHODS These methods are widely practiced III many areas of data analysis (23, 72). Doolittle (20) has reviewed protein sequence permu tati on pr oce du res (see also 1,26). We advocate a spectrum of per­ mutation procedures that selectively shuffle the sequence letters. Each permutation model is identified by the class of letters to be shuffled; for example, purine-only shuffiings permute randomly all the A + G positions while all C and T letters are left in place. To assess their variability under each permutation class, sequence statistics are computed for 100 or more permutations of the data. The value of the original statistic relative to the collection of values for the permutations helps in dissecting the nonrandom characteristics of the original data. Insights can also be gained from com­ parisons of different permutation models. Useful permutation classes include: (a) complete shuffling of a single or pooled set of sequences; (b) intradomain shuffling (exons, introns, flanks); (c) shuffling of purines

PROTEIN AND DNA SEQUENCES

181

among themselves and of pyrimidines among themselves; (d) shuffling of amino acids with similar properties; (e) shuffling of only site 3 of codons. We use a random model appropriate to the data as a standard for studying the distributional properties of various data statistics. In the independence random model, the successive letters of the vth sequence are sampled independently such that the ith letter type occurs with probability p�v) (the p�v) are commonly chosen to be the actual base frequencies in the vth observed sequence). More complex random models accommodate short- and/or long-range neighbor dependencies (see 5 5 , 56). For these models, distributional properties have been obtained for a variety of statistics including the length of the longest common word present in at least r out of s sequences, the length of the longest run or repetitive pattern of certain letter types, and the length of the longest word satisfying a prescribed relationship (5, 6, 55, 56). We illustrate the use of two statistics and their interpretation.

Annu. Rev. Biophys. Biophys. Chem. 1991.20:175-203. Downloaded from www.annualreviews.org by University of Leicester on 06/11/13. For personal use only.

THEORETICAL SEQUENCE MODELS

Maximal common word in multiple random sequences For s random sequences (independence model) with letter probabilities [p�v)]r� 10 V = I, . . . ,s, and length N" the order growth of the length, L *, of the longest common word has the asymptotic distribution

depending on the local match probability parameter m

A = I TIP�v)

i= 1 v= 1 and provided the sequences are not drastically different in composition (56, 57). The fJ-globin family serves as an example. We compared the family of human fJ-like globin genomic sequences (fJ, 6, y, e) (each composed of exons, introns, and flanks, with average aggregate length 2 kb) in both the standard DNA alphabet and the R-Y alphabet. Using the formula above for these sequences, one finds that a common word (oligonucleotide) in all four sequences is statistically significant at the 0.01 level (has probability ::;0.0 1 ), provided the word's length is at least 1 0 bp. The longest oligo­ nucleotides common to all four human globin genes are 20 and 19 bp separated by a single mismatch, starting at 1 8 1 5' in E2 (i.e. the 1 8 1 st bp from the 5' end of exon 2) and 202 5' in E2, respectively. This matching region expands to the longest word identity, 58 bp, between fJ and 6, overlapping the second intron by 7 bp, and to a 48-bp block identity

Annu. Rev. Biophys. Biophys. Chem. 1991.20:175-203. Downloaded from www.annualreviews.org by University of Leicester on 06/11/13. For personal use only.

182

KARLIN ET AL

between y and G, overlapping the second intron by 6 bp. Other significant word identities common to the four genes (lengths 1 1 , 1 0, and 1 1 bp) retain perfect alignment and are all located in E2. The conserved DNA segments code for 8 of the 1 8 amino acids relating to the heme region and 7 of the 12 amino acids engaged in those contacts between IY. and f3 globins that are involved in the change of conformation between the oxygenated and deoxygenated states. These facts support the suggestion that in these regions tertiary and quaternary structures and possibly codon usage are decisive. Of note, the strongest DNA conservation is not at these amino acid sites but is concentrated at the splice junctions (47). With respect to the R-Y alphabet, the human f3-globin gencs reveal a remarkable extent of similarity. DNA (R-Y) preservation around the splice junctions of the genes is prominent. The longest R Y word identity common to all four sequences is 61 bp starting at bp 1 7 1 5' in E2 and extending 9 bp into the following intron. The extent ofR-Y word identities, both in coding and noncoding regions, demonstrates that DNA transitions are better tolerated than transversions. Of course, coding regions subject to functional constraints are expected to show a bias toward transitions. A similar analysis shows that the longest statistically significant common word to the human, mouse, rabbit, and chicken genomic f3-g10bin DNA sequences (each about 2 kb long) is 1 7 bp long and includes the interface between the second exon and its following intron. The second-longest common word is 14 bp, which covers the interface between the first exon and its following intron (37, 49). Similar results are obtained from the comparison of immunoglobin kappa genes from mammalian species (37). Accuracy in RNA splicing apparently requires conservation of definite sequenccs spanning the splicc junctions of the same gene in similar species, mostly into the exon regions, perhaps indicating that the control of splicing at the donor site resides largely in the coding region. -

For a given word length k and sequence we define the frequency distribution, fk(v), v 0, 1 , 2, .. . , to be the count of k-words occurring v times. In Table 1 , we present the count occurrence distributions of six-words for the three papovavirus genomes, Simian virus-40 (SV40; N = 5243 bp), polyoma (N = 5293), and human BKV Dunlop strain (N 5 1 53). Polyoma is distinguished from both SV40 and BKV by its reduced number of highly repeated words. This comparison holds for all repeat occurrence distributionsfk(v), 4 :::;; k :::;; IS (data not shown). These observations suggest a greater number of dupli­ cation or homogenization events, a more recent divergence, or a lower mutation rate during the evolution of SV40 and BKV when compared to the polyoma virus. The count-occurrence distributions in all the papovaCount occurrence distributions

S,

=

=

183

PROTEIN AND DNA SEQUENCES Table 1

Rep eat-occurrence dis tributi on of oli gonucleoti des of length 6"

Annu. Rev. Biophys. Biophys. Chem. 1991.20:175-203. Downloaded from www.annualreviews.org by University of Leicester on 06/11/13. For personal use only.

f6(v) 2 vi 898 628 BKV-Dun 943 627 S V40 1174 740 Polyoma 1391 899 Ra ndom

3 361 363 421 423

4 175 183 198 132

5 87 90 78 42

6 57 47 22 11

7 30 29 5 3

g

14 II

2

9 7 7 I 0

\0 1

12 0 3

I

0 0

0 0

0 0

5

0 0

13 2

II 0

14 0 2 0 0

18 1 0 0 0

a The entries indicate the number of distinct six-words that occur exactly v times in the specified sequences. The row labeled random is the minimum of the cumulative distribution of 20 random per­ mutations of the polyoma sequence.

viruses, even polyoma, show an excess of long repeated words when compared to the count-occurrence distributions of the same sequences with the letters randomly permuted. SEQUENCE FEATURES USING GENERAL SCORING SCHEMES Consider a scoring scheme (for examples, see below) associated with the alphabet used. Statis­ tical theory is available for high aggregate segment scores and for the distribution of the number of separate segments of high score ( 18,39,45). Formulas have also been established that describe the letter composition of high scoring segments, which in certain contexts provides a method and rationale for choosing scores (see below). One may calculate the asymptotic probability that some segment from a random sequence has a score greater than any given value. In particular, one can tell when a high score is in the high 1 % tail of all segment scores. These results have applications in at least two important contexts: (a) analysis of a single protein sequence to identify segments with statistically significant high scores of, for example, hydrophobicity, charge, secondary structure proclivity or leader sequence motifs and (b) multiple sequence comparison to establish homology or locate protein segments with common function. Scoring schemes may be based on a variety of considerations (39). For example, scores can be based on charge. For the positively charged amino acids K and R, score s +2; for the negatively charged amino acids D and E, s = -2; for H (histidine), S = 0 (at pH 7.2 in blood serum) or S + 1 (at pH 6. 1 in muscle cells); for other amino acids, s = - 1 . Scores can also be derived from target frequencies. In a random sequence, the letters are sampled with probabilities {PI.'" ,Pr}, respectively. Let {q b Q2, , qr} be a set of desirable target frequencies of the letter types. It was proved ( 18, see also 7) that in a high-scoring segment of a random sequence, letter i tends to occur with the target frequency q; = P; exp (.1.*s;), where .1.* is the unique positive solution of the equation ��= 1 P i exp {.1.*sJ = l . Therefore, apart from a scale factor, Si can be expressed as a =

=

. . .

184

KARLIN ET AL

log likelihood ratio Si log (qi!P) From this perspective, in order to con­ struct the appropriate set of scores, we need merely characterize the letter distributions for the type of region we seek to identify (39) . For example, based on a SWISS-PROT collection of 512 sequences annotated as con­ taining transmembrane domains, we have determined target (q;) and back­ ground frequencies (Pi) and derived the following associated scores for the 20 amino acids characteristic of membrane spanning domains: A, 0.466; C, 0.196; D, -2.691; E, -2.969; F, 0.697; G, 0.149; H, -0.807; I, 0.874; K, -3 .123; L, 0.776; M, 0.401; N, -1.213; P, - 0.836; Q, -1.672; R, - 2.165; S, -0.349; T, -0.120; V, 0.725; W, 0.378; Y, 0.041. In the simplest model, the random sequence consists of letters drawn independently from an alphabet. Associated with each letter ai is a score Sj. We are intercsted in the segmcnt of the sequence with maximal additive score, or the top-scoring segments. For a sequence of length n, let M(n) denote the maximal segment score. Subject to some natural restrictions on the scoring schemes, it can be proved that

Annu. Rev. Biophys. Biophys. Chem. 1991.20:175-203. Downloaded from www.annualreviews.org by University of Leicester on 06/11/13. For personal use only.

=

prob M(n) >

[

�: +x]

l



l-exp {-K*e-)h}

with explicit formulas available for K* and ..1* (18, 39, 45). This deter­ mination allows one to calculate the probability that the maximal scoring segment has a score exceeding any given value. SEQUENCE COMPARISONS AND SEARCHES Global Sequence Similarity

In studying evolutionary and functional relationships (homology, analogy), a helpful procedure is to compare relevant molecular sequences and extract a numerical measure of their similarity. However, given exon shuffling, gene duplication, divergence, conversion processes, recom­ bination, translocation, and inversion events, to summarize sequence simi­ larity by a single number may be an oversimplification. A multidimensional vector measure able to describe significantly similar as well as significantly dissimilar aspects of two sequences may be more revealing. ALIGNMENT-BASED SIMILARITY A widely used method for defining the global similarity of two sequences is based upon alignments that have gaps. A scoring scheme is defined for aligned letters and gaps; for example, + 1 for matches, - ('J. for mismatches, and - fJ for letters aligned with nulls. The aggregate score for any alignment is calculated and the similarity of two sequences is commonly taken as the maximal score over all possible

Annu. Rev. Biophys. Biophys. Chem. 1991.20:175-203. Downloaded from www.annualreviews.org by University of Leicester on 06/11/13. For personal use only.

PROTEIN AND DNA SEQUENCES

1 85

alignments. Some simulation results are available for such measures (76). For ex = f3 = 0, this similarity value is simply the length of the longest common subsequence (17, 79). The similarity of two sequences, as defined above, can be found using dynamic programming, a well-established technique of operations research. This method was introduced to molecular sequence comparisons by Needleman & Wunsch (66, cf. 82), and many technical advances have followed (for review, see 89). One problem with dynamic programming is the computation time it requires. A comparison of two sequences of length n, using either affine or concave gap costs (affine costs are linear in the length of a gap plus an extra penalty per gap), requires order n2 execution time (89). A simultaneous comparison of r sequences generally requires order nr time. However, some essential drawbacks to these global, alignment-based similarity measures are independent of algorithmic considerations: (a) Few (perhaps prohibitively few) theoretical results are available to evaluate the statistical significance of global similarity measures. Given two random sequences of length n, let In be the length of their longest common sub­ sequence. As n grows In/n approaches a limiting constant c. However, no method is known for calculating these constants; only rough upper and lower bounds are available (17). Furthermore, meager analytic results are available concerning the variance of In (92) and no distributional properties are known. (b) Optimal alignments are highly sensitive to the specific scoring scheme (error and gap costs). More seriously, because no biological rationale is available for choosing gap costs, the costs remain largely arbitrary. (c) Global alignment frequently has little biological meaning. The similarity of two sequences may be confined to small regions, and the sort of similarity evident may vary from region to region. Cd) Pairwise alignments are not necessarily transitive and can engender inconsistencies when many sequences are compared. Many methods of integrating pair­ wise alignments into a multiple alignment have been proposed, but none has yet gained general acceptance. For r sequences, a common practice is to perform all r(r - l )/2 pairwise comparisons and to attempt to resolve (combine or classify) the pairwise alignments. The methods for the last step often seem arbitrary. As stated above, rigorous multiple alignment quickly becomes infeasible. Alternative methods that concentrate on local alignments have good theoretical, empirical, and biological rationale (see below). Another measure of global similarity is based upon the relative frequencies with which short words appear in

WORD USAGE-BASED SIMILARITY

186

KARLIN ET AI,

Annu. Rev. Biophys. Biophys. Chem. 1991.20:175-203. Downloaded from www.annualreviews.org by University of Leicester on 06/11/13. For personal use only.

various sequences. A sequence is summarized by a count of all words of, say, length 4 that the sequence contains. This count is a vector repre­ sentation of a sequence, and the distance between two sequences can be measured, for instance, by the angle they form ( 1 1 ). Similarities based on such vector measures are easy to compute and are statistically tractable (9). Unfortunately, these measurements discard most of the order information contained in a sequence.

Local Sequence Comparisons Because protein or nucleic acid sequences may share only isolated regions of similarity, e.g. in the vicinity of an active site, local measures of sequence similarity are generally preferable to global measures. The basic idea is to consider only relatively conserved subsequences; dissimilar regions do not contribute to the measure. Several approaches to local similarity are described below. Among the simplest measure of local similarity between sequences is their longest common word. For long sequences of similar composition and approxi­ mately equal length, the distribution of the longest word present in r out of s random sequences can be calculated, as can the distribution for the longest common word allowing for k errors, where either a mismatch or an insertion/deletion is considered an error (50, 56, 57). Order growth results are available for the longest common word with a specified pro­ portion of mismatches (6).

MATCH RUNS WITH A FIXED NUMBER OR PROPORTION OF ERRORS

The longest-matching-word measure of local simi­ larity can be extended by uniting matching words that occur near one another into alignment groups. Methods that identify statistically sig­ nificant matching segments (allowing errors) in multiple sequences can effectively highlight regions of conservation. The segment-matching pro­ tocol outlined below need not produce a single measure of similarity; it finds instead alignment groups with varying degrees of similarity (53, 54, 88). First, we consider repeats in a single sequence. A repeat segment is an aggregate of exactly repeated words, each of length 2: K and separated by error blocks, each of length �e letters. The number K should be chosen such that m" � n � m'

Statistical methods and insights for protein and DNA sequences.

ANNUAL REVIEWS Annu. Further Quick links to online content Rev. Biophys. Biophys. Chem. 1991. 20:175-203 Annu. Rev. Biophys. Biophys. Chem. 1991.2...
947KB Sizes 0 Downloads 0 Views