1331

STATISTICAL

ESTIMATION

[33] Statistical By

TAKASHI

General

OF

SEQUENCE

Methods for Estimating Divergence

GOJOBORI,

ETSUKO

N.

and

MORIYAMA,

531

DIVERGENCE

Sequence MOTOO

KIMURA

Principle

Methods for estimating the number of nucleotide base substitutions are crucial for studies of molecular evolution. Knowledge of the number of base substitutions is particularly important for computing the evolutionary rate and constructing phylogenetic trees at the DNA level. From the perspective of population genetics, nucleotide and amino acid substitutions can both be treated as stochastic processes.’ Methods for estimating the number of DNA base substitutions, however, are different from those for estimating the number of amino acid substitutions. Because only four kinds of nucleotide bases (usually denoted A, T, C, and G) exist, multiple and superimposed nucleotide substitutions at the same site may occur undetected, especially when sequence divergence is great. For example, when two comparable DNA sequences have different bases, say A and G, at the corresponding (i.e., homologous) site, the changes of A + T + G for one sequence and no change for the other may have occurred at the site (Fig. 1). The observed number of nucleotide differences between the two DNA sequences is thus frequently different from the total number of nucleotide substitutions that have actually occurred during their divergence. Statistical methods for estimating the number of nucleotide substitutions are therefore required for comparative studies of DNA sequences. In this chapter, we first describe various methods for estimating the number of nucleotide substitutions and then discuss the advantages and disadvantages of these methods. Methods

for Estimating

One-Parameter

Total

Number

of Nucleotide

Substitutions

Method

In order to derive an estimation formula for the number of nucleotide substitutions, we must assume a specific model for the pattern of base substitutions among the four kinds of nucleotide bases. We start with the ’ M. Kimura, “The Neutral Theory of Molecular Evolution.” Cambridge, England, 1983.

METHODS

IN ENZYMOLOGY,

VOL.

183

Cambridge University Press,

Copyright 0 1990 by Academic Press, Inc. All tights of reproduction in any form reserved.

532

ESTIMATING

a A A

SEQUENCE

DIVERGENCE

1331

b A

JP 1G T A

A G FIG. 1. Examples of nucleotide base changes in the course of evolution, where the ancestral base at a given site is assumed to be A. In these examples, two descendant DNA sequences compared t years later have nucleotides A and G at this site. In example a, one nucleotide substitution (A -+ G) has occurred, while in b two nucleotide substitutions (A + T and T + G) have occurred.

one-parameter model, in which the rate of substitution is assumed to be equal between any pair of nucleotides. 2*3Thus, the process of substitution is described by a single parameter (Y, which stands for the rate of substitution of one particular base (say, A) for another (say, T), and this is assumed to be constant over evolutionary time (Table I). The estimation formula based on this model may be derived as follows. Consider two homologous DNA sequences which diverged from a common ancestor t years ago. We denote by I(t) the probability that two nucleotide bases at corresponding (homologous) sites at time t are identical to each other. At the two corresponding sites, two bases are either identical to or different from each other. Thus, their probabilities are, respectively, I(t) and 1 - I(t). Let us derive a recurrence formula for I(t). We first consider the probability that two homologous nucleotide sites at time t are identical to each other and are also identical at time t + 1. This consists of two mutually exclusive events: One represents the case in which both nucleotide bases change into two other identical bases, and the other represents the case in which both remain unchanged. The probability of the former event is 3012and that of the latter is (1 - 3a)Z. Therefore, the probability that the two nucleotide bases remain identical at the site at time t + 1 when they are identical at time t is given by [( 1 - 3~4~ + 3d]I(t). Next, we consider the probability that bases at two corresponding sites at time t are different from each other, but that they become identical at time t + 1. This also consists of two mutually exclusive events. The first is the case in which a change occurs at one of the two corresponding sites but the other site remains unchanged. This probability is 2cy(1 - 3a). The second is the case in which both nucleotide bases change into two other identical bases simultaneously. This probability is 2o?. Therefore, the * T. H. Jukes and C. R. Cantor, in “Mammalian Protein Metabolism ed.), p. 21, Academic Press, New York, 1969. 3 M. Kimura and T. Ohta, J. Mol. Evol. 2,87 (1972).

III” (H. N. Munro,

1331

STATISTICAL

ESTIMATION

OF SEQUENCE

TABLE PATTERN

DIVERGENCE

533

I

OFNUCLEOTIDE

SUBSTITUTION'

Substituted nucleotide Original nucleotide One-parameter A T C G

model

Two-parameter A T C G

model

Three-parameter A T C G Four-parameter A T C G

A

T

C

G

model

model

Six-parameter model A T C G a Each element represents the rate of substitution between a given pair of nucleotides per site per year.

probability that the two nucleotide sites become identical at time t + 1 when they are different from each other at time t is [2cr( 1 - 30) + 2a2][ 1 - I(t)]. Thus, we obtain I(t + 1) = [( 1 - 3a)2 + 3d]1(t)

+ [24 1 - 3a) + 2arz][ 1 - I(t)]

(1)

The solution of this equation which satisfies the initial condition I(0) = 1 is I(t) = [ 1 + 3( 1 - 8~x + 16&)l]/4

(2)

534

ESTIMATING

SEQUENCE

1331

DIVERGENCE

Sincecyis generallyvery small (e.g.,cr = 10m9),we may neglectterms of cu2 so that we get I(t) = [ 1 + 3(1 - 8@‘]/4 (3) This may alsobe expressedas I(t) = 1 - $( 1 - es8”) (4) The same result can also be obtained by solving a differential equation derivedby substitutingdl(t)/dt for I(t + 1)- I(t) in Eq. (1).4 The mean number of nucleotidesubstitutionsaccumulatedper site at time t is given by 2 X 3at, wherethe factor 2 comesfrom the fact that the divergenceof two sequencesalways involves two evolutionary lines each having the time length c (Fig. 1). Denoting the evolutionary distancein terms of accumulatedchanges(i.e., 2 X 3~) by K, we have K= -$ln(l -$Fn) (5) whereFD = 1 - 1(t). In Eq. (5),FD standsfor the fraction of different sites betweenthe two DNA sequencescompared.Equation (5), first presented by Jukesand Canto? (seealso Kimura and Ohta3),is not only simple but also very useful becauseit gives a reasonableestimate when sequence divergenceis small (note that K = F,, when FD < 1). Moreover, the standarderror (aK)of this estimate(K) is given by a,=

@DU

- FD)

~-$FD

wheren is the total number of sitescompared.3 Two-ParameterMethod The molecular structuresof nucleotidebasesA and G (purines) are similar, and thoseof C and T (pyrimidines) are also similar. Evolutionary basesubstitutionsbetweenpurines (A -B G or G + A) and betweenpyrimidines (C --+T or T - C) may be called transition-typesubstitutions, while thosebetweenpurities and pyrimidines are called transversion-type substitutions.It has beenshown that for many nuclear genesthe rate of transition-type changesis much higher than that of transversion-type ones.sIt hasalsobeenreportedthat more than 90%of nucleotidesubstitutions in mitochondrial DNAs of primates such as humans are of the 4 M. Nei, “Molecular

Population

Genetics and Evolution.”

North-Holland,

1975.

s T. Gojobori, W.-H. Li, and D. Grant, .I. Mol. Evol. 18,360

(1982).

Amsterdam,

1331

STATISTICAL

ESTIMATION

OF SEQUENCE

DIVERGENCE

535

FIG. 2. Two-parameter model of base substitutions. The rates of transition and transversion substitutions are represented by a and p, respectively.

transition type. 6*7 To deal with such a situation, Kimura* developed a two-parameter method for estimating the number of nucleotide substitutions as follows. As shown in Fig. 2 (also see Table I), (Y and j3 represent the rates of transition and transversion substitutions (per site per year), respectively. Let P and Q be, respectively, the fractions of nucleotide sites showing transition and transversion differences between the two sequences compared [e.g., P is the fraction of sites showing AG (or GA) and CT (or TC) nucleotide pairs when the two DNA sequences are compared]. A set of differential equations for P and Q at time t are -dP(t) = 2a - 4(a + j?)P(t) - 2(a - j?)Q(t) dt ffy

(7)

= 4p - 8/3Q(t)

For details concerning the derivation of Eqs. (7) and (8), see Kimura.6 Under the initial conditions P(0) = Q(0) = 0, the relevant solutions are P(t) = $ - +e-4(a+B)r + se-w (9 Q(t) = f - +e--*B

(10)

The rate (k) of nucleotide substitution, which is the number of nucleotide substitutions per site per year, is given by k = (Y + 28. This is based on the following consideration. Suppose that a given site is occupied by a particular base, say, A. It changes to one of the other three bases (T, C, or G). In this case, either A changes to G at the rate (Yor A changes to T or C at the rate of 2j.I. Thus, we have k = a + 2j? per year. Therefore, the total number (K) of nucleotide substitutions per site between the two sequences which diverged from the common ancestor t 6 W. M. Brown, E. M. Prager, A. Wang, and A. C. Wilson, J. Mol. Evol. 18,225 ’ C. F. Aquadro and B. D. Greenberg, Genetics 103,287 (1983). 8 M. Kimura, J. Mol. Evol. 16, 111 (1980).

(1982).

536

ESTIMATING

SEQUENCE

DIVERGENCE

1331

years ago is given by K = 2kt = 2crt + 4/3t. From Eqs.(9) and (lo), we have 8pt = -In[ 1 - 2Q(t)]

(11)

4at = -ln[ 1 - 2P(t) - Q(t)] + f ln[ I - 2Q(t)]

(12)

Thus, K=-fln[(l -2P-Q)m] (13) In Eq. (13)the letter t in P(t) and Q(t) is droppedfor simplicity. From Eqs.(9) and (lo), we note that when (Y= j?,we haveP = (1/2)Q. SubstitutingP = (1/2)Q into Eq. ( 13)and noting that P + Q is equalto the fraction of siteshaving different basesbetweentwo DNA sequences(i.e., F,), Eq. (13) reducesto Eq. (5), the equationof the one-parametermodel. The standarderror of the estimateK obtainedby usingEq. (13)is given by a, = $

J(u2P + bzQ) - (aP + be)2 (14) n wherea = l/( 1 - 2P - Q), b = [l/( 1 - 2P - Q) + l/( 1 - 2Q)]/2, and n is the total number of sitescompared.* Three-Parameter

Method

Kimurag also studieda method assumingthree parametersin order to estimatethe number of nucleotidesubstitutions.He derivedthe estimation formula in the sameway as for the two-parametermethod. He originally called this model “the 3ST model” (three substitution-typemodel) (see Table I). We denoteby P(t) the fraction of siteshaving TC or AG nucleotide pairs at time t, by e(t) the fraction of siteshaving TA or CG nucleotide pairs,and by&t) the fraction of siteshaving TG or CA nucleotidepairs. Then, P(t), Q(t), and R(t) may be expressedas follows: e-W+Yk

+

e-W+Y"]/J

(15)

Q(t) = [ 1 - e-W+B)t+

e-W+Y)l

-

e-W+YP]/4

(16)

R(t)

e-W+Y)r

-

e-W+YN]/4

P(t)=[l

=

-e-

[1 +

W+B)I -

e+a+BI

-

(17) As shown in Table I, the distance or the total number of nucleotide substitutionsper site is given by K = 2(a + j?+ y)t. From Eqs. (15)-( 17), we have 4((1!+ P)t = -In{ 1 - 2[P(t) + Q(t)]> (18) 9 M. Kimura, Proc. Nat/. Acad. Sci. U.S.A. 78,454 (1981).

r331

STATISTICAL

ESTIMATION

OF SEQUENCE

DIVERGENCE

4(a + y)t = -In{ 1 - 2[P(t) + R(l)]} 4(/l + y)t = -In{ 1 - 2[?&) + R(t)])

537 (19) (20)

Thus, the distance (K) may be estimated by K=-Sln[(l-2P-2QX1-2P-2R)(l-2~-2R)]

(21)

where the argument t is omitted from P(t), a(t), and R(t) to simplify the presentation. Equation (21) is the formula for the three-parameter model. Note that the two-parameter model is a special case in which y = /3, and therefore Q of the two-parameter method is given by Q = 2Q = 2R. Under these conditions, it is clear that Eq. (21) reduces to Eq. (13). The standard error of K estimated by Eq. (2 1) is given by

oK= $ d(a2P+ b2a + c2R)- (UP+ ba + cR)~

(22)

where a = (Cl2 + C&2, b_=(Cl2 + C2,)/2, and c = (Cl3 + C2,)/2, in which C,2= l/(1 -2P-2Q), C,3= l/(1 -2P-2R), and Cz3= l/(1 -

2Q - 2R).9 Four-ParameterMethod Takahata and Kimural proposed a method for estimating the number of substitutions based on a four-parameter model. The model is shown in Table I. The total number (K) of nucleotide substitutions may be estimated by K=-+ln

(sn -

~,>

- W-W/212

0(1-o) 1 - 2,‘:-“o)l”““-“-1}

(23)

where o stands for the fraction of A + T in the two DNA sequences compared (i.e., A + T content). In this formula, & represents the fraction of sites hav&g AA 0’ TT nucleotide pairs, S,, represents that of CC or GG pairs, and Qr and Q,, respectively, denote the proportions of sites having AT and GC pairs. Finally, P stands for the fraction of sites having CT or AG nucleotide pairs, and R stands for that of GT or AC pairs. Because the formula for the standard error of K, as estimated by Eq. (23), is complicated, a computer is required for actual use. A different type of four-parameter method was proposed by Tajima lo N. Takahata and M. Kimura, Genetics 98,641 (1981).

538

ESTIMATING

SEQUENCE

1331

DIVERGENCE

and Neil ’ to derive a formula similar to Eq. (5), assuming a so-called equal input model. In this model, the rate of substitution to the ith nucleotide is assumed to be the same, regardless of the original bases, i.e., Ati = 1, = ilci = rlGi = (Y~,excluding the CASTS which correspond to 12ib where L, (i, j = A, T, C, and G) is the element of the matrix representing the pattern of base substitution as shown in Table I. Although this model appears to be rather artificial, it leads to better estimates than the one-parameter model.

Six-ParameterMethod The six-parameter model is based on the model originally proposed by Kimura: who called it “the 2FC model” (two frequency class model). Its exact formulation was given by Gojobori et al.,‘* who showed that the number of nucleotide substitutions per site between two sequences compared is given by K=--ln(2)-Fln[&(Fiz-&-%)I

FM-B,+%

(24)

1

In Eq. (24), qA, qT, q,-, and qo stand, respectively, for the contents of A, T, C, and G in the DNA sequences compared. Also, we use the following

notations:p = qA+ qT, q = qc + qG,B, = pq - (xAc + xAG+ xTC+ xTG), 42

=

hq

-

XAC

-

xAGkfl

-

XTC

-

XTG),

EN

=

(qcp

-

xAC

-

XTChGP

-

-p2+3qAq,,andF,=Xcc+X,xcG - q2 •k 3qcqo, where Xii represents the fraction of sites having the same base pairs i, and 2x0 (i #i) represents the fraction of sites having different base pairs i and j (i, j = A, T, C, and G). In the special case in which all six parameters are equal to each other (a = cyi = cy2= p = pi = fi2), Eq. (24) reduces to Eq. (5) of the one-parameter method. In fact, in this case, we can show that qi = a, Xii = xM, xo=x,+T (i#j), 4Xm+ 12XAT= 1, P=q=f, B,=$-4x,,, E,*= Es4= (i - 2XAT)*, and F12= FM = 2~~ - XAT - &. By substitution, Eq. (24) reduces to Eq. (5). The formula for the standard error of K in this case is complicated, so a computer program is required for actual use. Taking into account the fact that the base content varies from one DNA sequence to another, Hasegawa et al.” modified Kimura’s twoparameter model. They considered not only transition and transversion substitutions separately but also difference of contents of the four nucleoxAG

-

XTG

),1;12

=

XAA

+

h-r

-

xAT

‘I F. Tajima and M. Nei, Mol. Biol. Evol. 1,269 (1984). I* T. Gojobori, K. Ishii, and M. Nei, J. Mol. Evol. 18,414 (1982). I3 M. Hasegawa, H. Kishino, and T. Yano, J. Mol. Evol. 22, 160 (1985).

1331

STATISTICAL

ESTIMATION

OF SEQUENCE

539

DIVERGENCE

TABLE II Formulas for Estimating Number of Nucleotide Substitutions0 Method One-parameter Two-parameter Three-parameter

Estimating formula K=-jln(1 -jFo) K=-fIn[(l -2P-Q)m] K = -$ In[ 1 - 2P - 2G)(l - 2P - 2R)(1 - 2Q - 2R)]

Four-parameter

P+R

w'--0)--l

1 I

Six-parameter

D Definitions of each parameter are given in Table III.

tide bases. Thus, their method may be regarded as a six-parameter method. Since their method is quite intricate mathematically, we recommend that interested readers consult Kishino and Hasegawa ([34], this volume). Analyzing the nucleotide substitution patterns of mammalian mitochondrial genes, Lanave et al. I4 developed a method for estimating the evolutionary distance between gene sequences compared. In their method, all parameters representing the pattern of base substitution (see Table I) are estimated from actual data. Therefore, their method may be called the 12-parameter method. Readers who are interested in their method are invited to consult Saccone et al. ([35], this volume).

Application Various formulas for estimating the divergence of DNA sequences in terms of the number of base substitutions per site (i.e., K) are summarized in Table II. The definitions of parameters appearing in these formulas are listed in Table III. The application of these methods to actual data is demonstrated below. For this purpose, let us compare DNA sequences of the hemoglobin j? chain gene between mouse and rabbit (data from Konkel et ~1.‘~ and Hardison et ~1.‘~). In Fig. 3, the two sequences are aligned codon by codon with respect to the coding region of this gene. We estimate the distance in terms of the number of nucleotide substitutions at the third codon position I* C. Lanave, G. Preparata, C. Saccone, and G. Serio, J. Mol. Evol. 20,86 (1984). I5 D. A. Konkel, J. V. Maize], Jr., and P. Leder, Cell 18, 865 (1979). I6 R. C. Hardison, E. T. Butler III, E. Lacy, T. Maniatis, N. Rosenthal, and A. Efstratiadis, Cell 18, 1285 (1979).

x,=

A+A,l

0.178

x,=

0.007

x,=

TABLE

III

0.226

+s,=o.500x,= 0.274

G-G,40

A+G,9

2x&-= 0.062

+P=O.l99+ +P=O.l99--r +P=0.199--, 2Xx= 0.137

,

T-C,20

Transitional C+G,5

Fu=O.315

SUBSTITUTIONS

A-,C,2

2-L

= 0.014

+R = 0.055 + +R=0.055-, zu,, = 0.041

Q=O.l16

T-G,6

Transversional

-p=o.O61+ iz = a = 0.027 0.034 2X== %A = 0.027 0.034

-

T-A,4

Type of base pair and number of sites”

QUANTITIES REQUIRED FOR ESTIMATING NUMBER OF NUCLEOTIDE FROM SEQUENCE COMPARISON SHOWN IN FIG. 3

C-‘C,33

Identical

OF VARIOUS

+ S,, = 0.185 +

T-T,26

VALUES

*

l

K

0.564’

0.426 0.426 0.463b

0.409

0 The total number of sites considered is 146. b w = q* + qT = 0.34. c qA = 0.059, qT = 0.281, qc = 0.318, q. = 0.342, p = 0.34, q = 0.66, B, = 0.0974, E,, = 9.1 X BY5, E,, = 0.0021, FL2 = 0.1056, F, = 0.3737.

Six-parameter

One-parameter Two-parameter Three-parameter Four-parameter

Method

OBSERVED

1331

STATISTICAL

1

ESTIMATION

OF

SEQUENCE

541

DIVERGENCE

10 7.0 CAC CTG ACT GAT GCT GAG AAG GCT GCT GTC TCT TGC CTG TGG GGA AAG GTG AAC TCC l

l

.

I.

l

.*

.

.

.*

.

.

.*.

CAT CTG TCC AGT GAG GAG AAG TCT GCG GTC ACT GCC CTG TGG GGC AAG GTG AAT GTG

U&O”%?

30 40 GAT GAA GTT GGT GGT GAG GCC CTG GGC AGG CTG CTG GTT GTC TAC CCT TGG ACC CAG CGG

rabbit

GAA GAA GTT GGT GGT GAG GCC CTG GGC AGG CTG CTG GTT GTC TAC CCA TGG ACC CAG AGG

mouse

TAC TTT

rabbit

TTC TTC GAG TCC TTT

l

.

mouse rabbit

UlO”SCZ rabbit

illO"Se rabbit

.

l

l

GAT AGC TTT .

.

.

50 60 GGA GAC CTA TCC TCT GCC TCT GCT ATC ATG GGT AAT GCC AAA GTG .

l

.

l

.

.

.

l

.

.

.

.

.

GGG GAC CTG TCC TCT GCA AAT GCT GTT ATG AAC AAT CCT AAG GTG

70 80 AAG GCC CAT GGC AAG AAG GTG ATA ACT GCC TTT AAC GAT GGC CTG AAT CAC TTG GAC AGC . l l . l l * . l l . l AAG GCT CAT GGC AAG AAG GTG CTG GCT GCC TTC AGT GAG GGT CTG AGT CAC CTG GAC AAC

CTC AAG GGC ACC TTT . CTC AAA GGC ACC TTT

90 100 GCC AGC CTC AGT GAG CTC CAC TGT GAC AAG CTG CAT GTG GAT CCT l

l

.

l

.

.

l

GCT AAG CTG AGT GAA CTG CAC TGT GAC AAG CTG CAC GTG GAT CCT

110 GAG AAC TTC AGG CTC CTG GGC AAT ATG ATC GTG ATT ." . . . GAG AAC TTC AGG CTC CTG GGC AAC GTG CTG GTT ATT

120 GTG CTG GGC CAC CAC CTT GGC AAG . ..* . . . GTG CTG TCT CAT CAT TTT GGC AAA

rabbit

130 140 GAT TTC ACC CCC GCT GCA CAG GCT GCC TTC CAG AAG GTG GTG GCT GGA GTG GCC ACT GCC . . l l . . l . .* l l GAA TTC ACT CCT CAG GTG CAG GCT GCC TAT CAG AAG GTG GTG GCT GGT GTG GCC AAT GCC

rabbit

146 TTG GCT CAC AAG TAC CAC TAA . . CTG GCT CAC AAA TAC CAC TGA

UlO"S-3

1.

FIG. 3. Comparison of nucleotide sequences of hemoglobin p chain genes of mouse and rabbit. The two sequences are aligned with each other codon by codon. The coding region used in the present analysis is bracketed, and asterisks indicate the sites where the two sequences have different bases.

between these two sequences. The initiation and termination codons are excluded from the comparison, because the former is usually invariant and changes of the latter are quite restrictive. Thus, the total number of sites compared is 146. In Fig. 3, the sites where two sequences differ are indicated by asterisks. Table III lists observed values of various quantities required for each method. One-Parameter Method. Because nucleotide bases are different at 46 of 146 sites compared at the third codon positions, the fraction of different sites is F,, = & = 0.3 15. Putting this value into Eq. (5), we obtain K=-$ln(l

-$X0.315)=0.409

namely, the estimated number of substitutions per site is 0.409. Note that the estimated value (K) is much larger than the observed value (F,, =

542

ESTIMATING

SEQUENCE

DIVERGENCE

r331

0.3 15). In general, the more the value of FD increases, the more the difference between K and FD increases. This is the reason why statistical methods for estimating the number of nucleotide substitutions are required, particularly when the divergence between the DNA sequences compared is large. Note that the estimation formula breaks down when the observed value FD happens to exceed $, because the argument of the logarithmic function in Eq. (5) becomes negative. This is due to the fact that the expected maximum value of FD is 4 at the equilibrium state of random substitutions. Similar problems exist more or less in other methods, too, because the estimation formulas similarly contain logarithmic functions which come from correction of multiple changes in the Poisson process. Two-Parameter Method. In the /I hemoglobin example, 29 of 46 different sites show transition-type differences whereas the remaining 17 show transversion-type differences. Thus, we have P = & = 0.199 and Q = & = 0.116. Plugging these values into Eq. (13), we obtain K = 0.426. Three-Parameter Method. Of 17 sites having transversion-type differences, 9 have TA or CG nuclztide pairs and the remaining 8 have TG or AC pairs. This means that Q = & = 0.061, R = 0.055, and P = 0.199. Substituting these values into Eq. (2 I), we get K = 0.426, which happens to be the same value as that obtained from the two-parameter method. Four-Parameter Method. As shown in TableIII, Q ofthe three-parameter method is separated into two parameters, Q, and Q,. The number of sites showing TA nucleotide @airs is 4, and the numbe_r of sites showing CG pairs is 5. Thus, we have Q, = & = 0.027 and Q, = & = 0.034. The number of sites having TT or AA nucleotide pairs is 27 and that having CC or GG pairs is 73. This means S,s = & = 0.185 and S, = & = 0.5. We have already obtained P = 0.199 and R = 0.055 (see the three-parameter method). Also, we find that the A + T content of two DNA sequences is o = 0.34 (see footnote b in Table III). Therefore, we obtain K= 0.463 from Eq. (23). Six-Parameter Method. The six-parameter method requires the numbers of all types of nucleotide pairs, as shown in Table III. The frequencies are xrr=&= 0.178, xAA=&= 0.007, . . ., 2xro=T$= 0.041, and = & = 0.0 14. Then, the contents of various bases (q,, , q-r, q,-, and w4c qG) of the two sequences are +

qAEXAA

XAT

+

XAC

+

XAG

=

xAT

+X,+x~+x~~=o.281

qC =

xAC

+x~~+xW+X,=O.~~~

xAG

+

qT

qG

=

XTG

+

Xa

+

Xm

= 0.059

= 0.342

I331

STATISTICAL

ESTIMATION

OF SEQUENCE

DIVERGENCE

543

Also, we compute the A + T content p = q* + qT = 0.34, and the C + G content q = qc + qG = 0.66. In addition, we have B, =pq - (x, + XAG+ xx + XTG) = 0.0974, & = (q,& - x*C - x,&(qTq - x, - xx) = 9.1 x lo+, EN = (&p - XAc - xTc)(qGp - XAG - XTG) = 0.0021, F,2 = - p2 + 3&q, = 0.1056, and Fw = xc + x, - x, xAA+XIT-XAT q2 + 3q,q, = 0.3737. Putting all these values into Eq. (24), we obtain K = 0.564. The rate of nucleotide substitution per site per year (k) may be computed by k = K/(2t), where K represents the number of nucleotide substitutions that have occurred between the two DNA sequences which diverged t years ago. In this example, the estimated rate of nucleotide substitution turns out to be k = 0.564/(2 X 80 X 106) = 3.53 X 10m9/site/ year, where we assumed t = 80 X lo6 as the divergence time of mice and rabbits from their common ancestor. Advantages and Disadvantages

of Various Methods

As shown by the above treatments of the example given in Table III, the estimated value of K(0.409) by the one-parameter method is much smaller than that of the six-parameter method (K = 0.564). This is due to the fact that the rate of nucleotide substitution is much different between different base pairs. In fact, as shown in Table III, the number of sites showing base pair AC is only 2 while the number showing base pair TC is 20. When the numbers of nucleotide differences are quite large between pairs of nucleotides, the one-parameter method tends to give underestimates. However, studies based on computer simulations show that when the value of K is much less than 1.O, the difference between the estimated and true values is not very large. *0,i2 As mentioned earlier, it is known that for the mitochondrial genomes of mammals, particularly of primates, more than 90% of nucleotide substitutions are transition-type while less than 10% are transversion-type changes. In such cases, the two-parameter method may be suitable, and, as shown in Table III, the three-parameter method will lead to results not much different from those of the two-parameter method. When the number of differences varies depending on the types of base pairs, and furthermore when the value of K is expected to become more than 1.O, the four- and six-parameter methods are more suitable than other methods. In particular, the estimate obtained by the six-parameter method is often close to the true value even though the value of K becomes much larger. The formula for the six-parameter method, however, frequently tends to become inapplicable owing to sampling and stochastic errors unless the DNA sequences compared are sufficiently long.

544

ESTIMATING

SEQUENCE

Methods for Estimating Numbers Nonsynonymous Substitutions

c331

DIVERGENCE

of Synonymous

and

When we compare two DNA sequences containing a protein-coding region, it is of particular interest to distinguish the rate of silent or, more precisely, synonymous substitutions (not causing amino acid changes) from those that lead to amino acid alterations (nonsynonymous substitutions). If the number of nucleotide differences between two DNA sequences is very small, the number of synonymous and nonsynonymous substitutions can be obtained simply by counting synonymous and nonsynonymous nucleotide differences. However, when two or three nucleotide differences exist between corresponding codons of the two sequences, the distinction between synonymous and nonsynonymous substitutions must be inferred using appropriate statistical methods. Figure 4 illustrates an example of an underlying or hidden process of change in the sequences compared. When there is only one nucleotide difference between two codons compared, we can immediately decide whether the difference is synonymous or nonsynonymous. When we observe two nucleotide differences between the two codons being compared,

(SeqUHlCe

1) ..’

(Sequence

2)

a Phe

Leu Leu Ser Set- Leu Leu CTT CTT TCT TCT TTA TTA II II II CTA CCT TCA CCT TCA CTA Leu Pro Ser Pro Ser Leu

Pro 4. Example illustrating synonymous and nonsynonymous nucleotide differences when homologous codons are compared. Thick and thin lines, respectively, indicate synonymous and nonsynonymous nucleotide differences. For details, see text. FIG.

1331

STATISTICAL

ESTIMATION

OF SEQUENCE

DIVERGENCE

545

there are two possible pathways that lead to the observed differences. On the other hand, when there are three nucleotide differences between the two corresponding codons, six underlying pathways are possible between the two codons (see Fig. 4). As an example, consider the comparison between the codon TTT in Sequence 1 and the codon CCA in Sequence 2 shown in Fig. 4. There exist six different possible pathways to change from one to the other. The numbers of synonymous and nonsynonymous changes in the codons can be inferred from these possible evolutionary pathways. The thick lines represent synonymous changes that do not cause amino acid changes, while the thin lines represent nonsynonymous (i.e., amino acid altering) nucleotide changes. For the purpose of estimating the substitution numbers, we have two different types of methods available: the “unweighted pathway method” and the “weighted pathway method.” The main difference between these two types of methods is that in the former an equal weight is given to two or more evolutionary pathways, whereas in the latter a greater weight is given to an evolutionary pathway involving synonymous substitutions than to one involving nonsynonymous substitutions. In this section, we describe briefly the essence of several methods proposed for estimating the numbers of synonymous and nonsynonymous substitutions. Readers interested in the detailed procedures and whether computer programs are available should consult the original papers cited.

&weighted PathwayMethods Method of Perler et al. Perler et al.I7 developed an unweighted pathway method for estimating synonymous substitutions. In their method, nucleotide sites are separated into groups of synonymous and nonsynonymous sites, and each group consists of three different categories. These categories are based on the numbers (1,2, and 3) of possible nucleotide changes from the nucleotide present at a site. Nucleotide differences are also classified into synonymous and nonsynonymous differences, each with three categories, making comparison quite complicated. Moreover, the method tends to give serious underestimates of synonymous substitutions. Using their own method, Perler et al.l7 claimed that, in the evolution of preproinsulin and globin genes, the synonymous changes accumulate about 7 times more rapidly than nonsynonymous changes, but the accumulation saturates in 85- 100 million years. However, Kimura’p9 pointed out that such a “saturation” phenomenon of synonymous substitutions is a statistical artifact, and then the computer simulation of GojoI7 F. Perler, A. Efstratiadis, P. Lomedico, W. Gilbert, R. Kolcdner, and J. Jhdgson, Cell 20, 555 (1980).

546

ESTIMATING

SEQUENCE

DIVERGENCE

r331

bori’* showed that the method of Perler et al. gives serious underestimates at the synonymous sites, particularly when K > 0.6. Method of Nei and Gojobori. The method of Nei and Gojobori19 (N-G method) is essentially an unweighted version of the method of Miyata and Yasunaga.*O The N-G method is much simpler than that of Perler et al. but gives a better estimate. In the N-G method, the proportion of synonymous changes at the ith position of a codon is denoted by fi’ (i = 1, 2, and 3). The number of synonymous sites for this codon is given by s = ZQ, J and that of nonsynonymous sites by n = 3 - s. For example, consider codon TTA (Leu). For this codon, we havef, = +,fi = 0, andf, = +, as shown by changing bases at the first, second, and third positions. Thus, s = 3 and n = 3. For a DNA sequence consisting of r codons, the total numbers of synonymous and nonsynonymous sites are, therefore, given by S = E&, Sj and N = 3r - S, where Sj is the value of s for the fib codon. We now explain how to compute the numbers of synonymous and nonsynonymous differences between two nucleotide sequences. Let us denote the numbers of synonymous and nonsynonymous differences per codon by s, and &, , respectively. For example, if the codons compared are GAT (Asp) and GAC (Asp), sd = 1 and nd = 0 [see Fig. 4 (top, right)]. The comparison of TTT (Phe) and CCA (Pro) [Fig. 4 (top, middle)] is more complicated. There are six pathways by which to reach one codon from the other. Of the six, five contain one synonymous and two nonsynonymous nucleotide differences in each pathway, whereas for the remaining pathway, three nonsynonymous nucleotide differences are involved. Taking the average over these pathways (giving equal weight to each), we have &, = (lX5-i-OX1)/6=$andn,=(2X5+3X1)/6=~(seeFig.4).Similarly, for the case of two nucleotide differences, sd and & can be computed. The total numbers of synonymous and nonsynonymous nucleotide differences can be obtained by summing these values over all codons. That is, Sd = 2-i and Nd = E&i ndi, where and ndj are the numbers of synonymous and nonsynonymous differences for the&h codon and r is the number of codons compared. Note that Sd + Nd is equal to the total number of nucleotide differences between the two DNA sequences compared. Thus, we estimate the fractions of synonymous (Fsn) and nonsynonymous (&&) differences by F,, = Sd/S and &, = N,/N, where S and N are the average numbers of synonymous and nonsynonymous sites in the sdj

sdj

lsT. Gojobori, Genetics 105, 101 I (1983). I9 M. Nei and T. Gojoboxi, Mol. Biol. Evol. 3,418 (1986). z” T. Miyata and T. Yasunaga, .I Mol. Evol. 16,23 (1980).

r331

STATISTICAL

ESTIMATION

OF

SEQUENCE

DIVERGENCE

547

two sequences compared. Finally, in order to estimate the numbers of synonymous (Ks) and nonsynonymous (&) substitutions per site, we use the one-parameter method and replace F,, by F,, and Fm, respectively, in IQ. (5). This is done to correct for multiple substitutions. Weighted Pathway Methods Method of Miyata and Yasunaga. In molecular evolution, pathways containing more synonymous nucleotide differences tend to occur with higher probability than those having fewer synonymous nucleotide differences. Moreover, with respect to nonsynonymous (amino acid altering) nucleotide differences, we note that the replacement of similar amino acids tends to occur more frequently than that of dissimilar amino acids. Taking such observations into account, Miyata and YasunagaZo developed a weighted pathway method (M-Y method) for estimating the numbers of synonymous and nonsynonymous substitutions. In the M-Y method, the weight given for each nucleotide difference depends on the biochemical similarity of the amino acid replacement involved. The estimated numbers of synonymous and nonsynonymous sites may also be obtained by suitably weighting different pathways. Then, using the estimated fractions of synonymous and nonsynonymous changes, we can compute the numbers of evolutionary substitutions by the formula for the one-parameter method. Although the M-Y method appears to give better estimates of synonymous and nonsynonymous substitutions than the N-G method, computer simulations have shown that the two methods give very similar estimates and that the estimates obtained by the former method are not substantially better than those obtained by the latter. The reason for this is that the genetic code seems to have such a property that the weights for different pathways should be quite similar.‘g~21 Method of Li et al. Li et al. 22 developed another weighted pathway method. They used a different weighting scheme based on the expected and observed frequencies of nucleotide substitutions. They also considered nondegenerate, 2-fold degenerate, and 4-fold degenerate sites separately. However, their method again leads to essentially the same estimates of K, and KN as obtained by the N-G method. Application, Advantages, and Disadvantages of Each Method. Although we can rely on hand calculations to obtain estimates by methods such as that of Perler et al. and the N-G method, in general, computer programs are required or preferable when we use the above methods. In the following, we show, using the sequence comparison given in Fig. 3, the estimated *I M. Nei, “Molecular Evolutionary Genetics.” Columbia University Press, New York, 1987. 22 W.-H. Li, C.-I Wu, and C.-C. Luo, MoI. Biol. Evol. 2, 150 (1985).

548

ESTIMATING

SEQUENCE

DIVERGENCE

1331

numbers of synonymous and nonsynonymous substitutions as obtained by various methods. If we use the N-G unweighted pathway method, the estimated total numbers of synonymous and nonsynonymous sites are S= 104.3 and N = 333.7, and the total numbers of synonymous and nonsynonymous nucleotide differences are S,, = 4 1.8 and Nd = 42.2, respectively. Thus, the fractions of synonymous and nonsynonymous differences are given by Far, = S,/S= 0.401 and FND = NJN = 0.126, respectively. Substituting these values into Eq. (5), we get KS = 0.574 for synonymous substitutions and KN = 0.138 for nonsynonymous substitutions. If we use the M-Y weighted pathway method, we obtain S = 104.4 and N = 333.6, and also S, = 42.3 and Nd = 41.7. Thus, the proportions of synonymous and nonsynonymous differences are given by Fs,, = &/S = 0.405 and Fm = NJN = 0.125, respectively. Putting these values into Eq. (5), we have KS = 0.582 for synonymous substitution and KN = 0.137 for nonsynonymous substitution. Through comparison of the sets of K values obtained by the two methods (N-G and M-Y methods), it is clear that they give very similar results. Therefore, we recommend the N-G method rather than the M-Y method, because the former is much simpler to use than the latter. The method of Perler et al. is not recommended since it tends to give serious underestimates, particularly for the number of synonymous substitutions. For example, when the nucleotide sequence of the Drosophila pupal cuticle protein gene is compared with that of the Drosophila larval cuticle protein gene (data from Henikoff et ~1.~~and Snyder et al.“), the method of Perler et al. gives KS = 1.227 whereas the N-G method gives KS = 1.5 18, the difference in KS thus being quite significant. The method of Li et al. may be useful when one wants to know the numbers of synonymous substitutions at 4- and 2-fold degeneracy sites separately. Based on consideration of the set of synonymous codons and the variance of the estimated frequencies of the different synonymous codons, Lewontin2s proposed a method for estimating the number of synonymous substitutions per codon. This method, however, is quite impractical because it is very sensitive to small differences in codon usage of the DNA sequences compared.

23 S. Henikoff, M. A. Keene, J. S. Sloan, J. Bleskan, R. Ha&, and D. Patterson, hoc. Nutf. Acad. Sci. U.S.A. 83, 720 (1986). 24M. Snyder, M. Hunkapiller, D. Yuen, D. Silvert, J. Fristrome, and N. Davidson, Cell 29, 1027 (1982). 2sR. C. Lewontin, Mol. Biol. Evol. 6, 15 (1989).

r331

STATISTICAL

ESTIMATION

OF SEQUENCE

DIVERGENCE

549

Discussion When the distance or the number of evolutionary base substitutions that separate the two DNA sequences is rather small, various methods as explained above (the one-, two-, and three-parameter, etc., methods) give similar results, with similar estimates of the K values. In such cases, more elaborate methods do not necessarily have much advantage over a simpler one. As the real distance increases, however, and particularly when the real value of K is larger than unity, exact estimation of the actual number of base substitutions becomes very difficult. This is because, as the number of multiple substitutions increases, it becomes increasingly difficult to estimate superimposed or hidden substitutions. It is in such a situation that the real merit of the more exact method becomes apparent. Actually, already at the range of intermediate K values, the advantage of the six-parameter method over the one-parameter method becomes evident as revealed by the examples in Table III, in which one-parameter method gives K = 0.409 whereas six-parameter method gives K = 0.564. One important but unsolved problem inherent in the estimation of K, when K is large, is how to treat “inapplicable cases.” All the methods so far developed treat DNA sequences as if they were infinite in length. In other words, they disregard the sampling effect which arises from the fact that all actual sequences are finite in length. This causes trouble when the two sequences compared have a large distance. In such a case, the probability that the observed number of base differences happens to exceed the theoretical “saturation level” increases, and this makes the estimation formula inapplicable. For example, in the one-parameter method, if the observed fraction of difference (Fo) happens to exceed 3, the argument of the logarithmic term becomes negative, and Eq. (5) can no longer be applied. Such inapplicable cases increase as the true K value increases, and discarding inapplicable cases leads to serious underestimates, as pointed out by Kimura.9 There has been some invalid criticism that the estimation methods constructed under the above models would not apply to actual data because the real patterns of nucleotide substitutions may be different from the model. Similarly, LewontitP claims that the above methods cannot be applied to codons in the protein-coding region, because amino acid changes are generally much constrained and synonymous and nonsynonymous changes belong to different classes. We cannot accept such an extreme view. Furthermore, the method of LewontinZ5 itself contains some problems, and, even for estimating the number of synonymous changes, this method cannot be recommended. Needless to say, it is important to

550

ESTIMATING

SEQUENCE

DIVERGENCE

1341

know the real pattern of nucleotide substitutions, and, taking this into account, to develop better methods for estimating the number of nucleotide substitutions. In fact, some authors have attempted to estimate the pattern of nucleotide substitutions from actual data.‘2*22 We should also add that some of the widely used methods, particularly the method of Perler et al.,” lead to serious underestimates and therefore cannot be recommended. In applying various estimation methods, we note that as long as we know the statistical behavior and level of the estimated value of K, aswell as the application range of the formulas, the methods reviewed above are sufficiently useful for our purposes. However, considering the great importance of the estimation of sequence divergence for studies of molecular evolution, further efforts are needed to develop better statistical methods. Also, in order to assess the validity of various statistical methods already available, extensive simulation experiments will have to be conducted. In other words, we should perform a systematic investigation using the Monte Carlo method and assuming diverse patterns of evolutionary mutant substitutions with known parameters and with known outcomes of multiple substitutions to see how well these statistical methods are able to estimate the true values. In fact, much work remains to be done in this field.

1341 Converting

Distance to Time: Evolution

By HIROHISA

KISHINO

and

Application

MASAMI

to Human

HASEGAWA

Introduction DNA sequence data provide us with a good source of information on the evolutionary history of organisms. Since nucleotide substitution in evolution is best regarded as stochastic, statistical methods based on probabilistic models are required for data analysis. Among these, the maximum likelihood method has a firm basis in probability theory. It requires an explicit probabilistic model of nucleotide substitution to calculate the likelihood function and helps to make the assumptions clear. In molecular phylogenetics, we want to solve two problems: (1) inference of branching order, and (2) estimation of branching dates. This chapter is concerned mainly with the latter problem, but the two are interrelated. The estimation of branching dates has been based on an approximate constancy of the molecular evolutionary rate that is called the METHODS

IN ENZYMOLOGY.

VOL.

183

Copyright 0 1990 by Academic Ress, Inc. All rights of reproduction in any form reserved.

Statistical methods for estimating sequence divergence.

1331 STATISTICAL ESTIMATION [33] Statistical By TAKASHI General OF SEQUENCE Methods for Estimating Divergence GOJOBORI, ETSUKO N. and MOR...
1MB Sizes 0 Downloads 0 Views