j. Mol. Evol. 13, 5 7 - 7 2 (1979)

Journal of Molecular Evolution © by Spfinger-Verlag. 1979

Evolutionary Changes in Protein Composition - Evidence for an Optimal Strategy R. Coutelle 1 , G.L. Hofackerl, and R.D. Levine 2 1Lehrstuhl ffir Theoretische Chemic, Technische Universit~it Mfinchen, D-8046 Garching, Federal Republic of Germany 2Department of Physical Chemistry, The Hebrew University, Jerusalem, Israel

Summary. The information contained in the composition of different proteins of the same family is analyzed. It is found that within each family the gain in information per amino acid replacement is constant. This finding is interpreted to imply that evolutionary changes in proteins follow an "optimal" path in the sense that they maximize the number of potentially functional sequences that can be generated by T accepted point mutations from a given protein, subject to restrictions due to biological function. Key words: Information gain by accepted point mutations - Protein evolution - Protein compositions. Introduction

There is considerable current discussion as to whether evolutionary trends can be detected in the composition of protein molecules (Kimura, 1968; c.f. King and Jukes, 1969; Smith, 1969; Ohta and Kimura, 1971 a and b; Vogel and Zuckerkandl, 1971 ; Reichert et al., 1973 ; Gatlin, 1972, 1974, 1976; Chirpich, 1975 ; Holmquist, 1975 a and b; Hasegawa and Yano, 1975 ; Vogel, 1975). Statistical tests have tended to suppose a Non-Darwinian point of view holding that most evolutionary changes in proteins may be due to "neutral" mutations (King and Jukes, 1969; Ohta and Kimura, 1971 a and b; Gatlin, 1974; Holmquist, 1975 a and b). On the other hand, despite seemingly composition-and-sequence-independent probabilities of accepted point mutations (Dayhoff et al., 1972), the correlations between composition or sequence and function are known to call for a selective process at the level of the proteins. The question arises whether the results of mutation and selection, as preserved in natural proteins at various stages of evolutionary development, follow a law describing gross characteristics of evolution in a similar sense as the second law of thermodynamics provides a macroscopic description of the net results of processes at the molecular level. To test this conjecture, we employ basic concepts of information theory which, in our opinion, are most suited to analyze detailed empirical data with regard to the inherent complexity of the observed process.

0022-2844/79/0013/0057/~ 03.20

58

R. Coutelle et al.

The present study considers only the evolutionary trend as reflected in the compositional changes of the proteins. The evolutionary distance is measured in terms of the replacements in the composition (by analogy to the PAMS of Dayhoff et al. discussed further in a later section). Several different families of proteins are examined. Within each family we are able to determine the (constant) increment in the compositional information content (defined later) per accepted amino acid replacement. It is shown in some detail that a random process will not operate in this fashion and that deviations from the normal distribution are only partly due to finite lengths of the proteins. There is a wide possible range of variation in the compositional information content per each replacement. Yet the actual compositions of natural proteins (of a given family) appear to follow a unique trend. In judging the information content of a protein by its composition we ignore the information contained in the sequential amino acid correlations. There is indication that this part may not be too large (Finkelstein and Ptitsyn, 1971; Hopfinger, 1972 ; Nagano, 1973), so that our quantitative analysis of compositional changes should disclose the most salient features of protein evolution. The Standard Composition To analyze a given protein composition in terms of probability measures it is necessary to have a reference (or 'standard') composition with which it can be compared. Such a standard has been provided by the work of Dayhoff et al. (1972). They have studied compositional changes in the evolution of proteins as a stochastic process characterized by a replacement probability matrix for the amino acids. Under repeated replacements all proteins, regardless of their initial composition, in the absence of constraints, tend on the average towards the standard composition, which is reproduced in Table 1. Since we are interested in the characterization of evolutionary trends this amino acid distribution, remaining stationary under evolutionary random processes, has the unique property of exhibiting, by comparison, the effects of specific selection on protein compositions. The standard composition in Table 1 is somewhat similar to that based on the degeneracy of the genetic code, but avoids the assumptions implicit in the establishment of the latter standard composition (e.g., equiprobable codons); it also bears out the fact that studies of protein evolution are based on observed mutations, i.e., those accepted by natural selection. Dayhoff's replacement matrix implies averaging over different protein families with differen t biological functions. It will therefore represent the most prevalent evolutionary trends but will not account for the selective constraints which hold for special Table 1. Amino acid frequencies (in percentages) in the standard composition (adapted from Figure 9-6, of Dayhoff et al., 1972) Ala Gly Lys Leu Val

9,6 9,0 8,5 8,4 7,8

Thr Ser Asp Glu Phe

6,2 5,7 5,3 5,3 4,5

Asn Pro Ile His Arg

4,2 4,1 3,5 3,4 3,4

Gin Tyr Cys Met Trp

3,2 3,0 2,5 1,2 1,2

Evolutionary Changes in Protein Composition

59

proteins and protein families. This is borne out by the observed protein compositions showing considerable deviations from the standard composition. We therefore envisage evolutionary progress via accepted point mutations as a random process, alike for all proteins according to Dayhoff, with superimposed specific constraints for individual proteins and protein families. It seems to be the latter kind of selective processes by which evolutionary pressure on a particular protein manifests itself. Consequently, the deviance of a protein composition from the standard one is due to functional requirements set by evolutionary pressures. It is fair to assume that most amino acid sequences corresponding to given compositions make functionally unacceptable proteins. Conversely, one may assume that the chance of finding a protein functionally acceptable under some selective pressure may be rather uniform (except for the very small number of extreme sequences and compositions). If so, it is clear that protein evolution starting from standard composition proteins will not take an arbitrary path. The crucial problem by which measure to judge deviations from the standard composition will be discussed subsequently. A Measure of Evolutionary Distance

We require a measure for the evolutionary distance between two related proteins which can be computed in terms of their composition alone. Thus, to change a protein of standard composition (frequencies fio) to another protein of a different composition (frequencies fi) we need at least A x = l O 0 ~i [ f i - f i o [ / 2

(1)

replacements for a protein composed of 100 amino acids. This corresponds to the "Q" measure introduced by Holmquist (1974), who used a reference composition proportional to the multiplicity of the genetic code. This is the minimal number of replacements and does not include those mutations that leave the composition unchanged. To account for these and other (e.g., Ala -+ Gly -+ Ala) replacements which are not reflected in compositional changes, Dayhoff et al. (1972) have introduced the concept of the PAMs (= accepted pQint mutations) as the most probable number of mutations connecting two sequences (of length 100 amino acids). Using their replacement matrix they have provided a correlation between the minimal number of replacements and the PAMs values. Naturally, the correlation is nearly linear at low PAMs values. In general, the PAMs exceed the minimal number of replacements (Table 29 of Dayhoff et al., 1976). For each protein we have computed (from its known composition) the minimal number of replacements [using (1)] and obtained the corresponding PAM value by the formula I PAM -- -74.55 - in (1 -Z~x/74.55), which is a very good approximation for the interesting range of PAMs ~ 50 to the Table of Dayhoff, cited above.

1We gratefully acknowledge the communication of this useful formula by an anonymous referee

60

R. CouteUe et al.

The Compositional Information Content A proper measure of the deviance of the actual composition of a protein from the standard one can be provided by standard information theoretic arguments. We consider all possible sequences of N amino acids and ask how many of these sequences have the specified composition. A protein with a given sequence, being selected from the total number of sequences of same composition for its functional properties, will then be judged by the rarity or uniqueness of these properties. Assuming that all other proteins of that composition have different properties, the uniqueness of a specific protein will then be measured by the total number of sequences with that composition. Since the pioneering work of Hartley (1928) it is known that such a judgement of uniqueness should be made on the basis of a logarithmic measure. Quite generally their information content (Shannon, 1949). For a sequence of N amino acids let N i be the number of times that amino acid i appears, fi = Ni/N, ~ fi = 1. The probability of generating a particular sequence of composition N1, N2, ..., N20 in a process with corresponding probabilities fio is 20 Ni i=~'l fio

(2)

An equivalent route to the same result is as follows. Consider all possible sequences of proteins (of length N) that have evolved according to the Dayhoff et al. replacement probabilities. After an initial induction period the average (over all sequences) frequency of the i'th amino acid is the standard one, fio. Expression (2) is then the probability of a particular sequence having the stated composition. There are 20 N!/JTNi! i---1

(3)

distinct sequences of the same composition. Hence the fraction of all sequences (of length N) with a specified composition is 20 20 Ni 20 20 Ni P(N)= (N!/i=~1 Nit) i=~1 rio = (Nt/i__~1 (Nfi)D i=~1 rio

(4)

The notation P(N) means that P is a function of all the Ni's. One can now verify that the average (over all sequences) composition is indeed the standard one

N1

~ .... N2 Ni=N

~ N20

N i P(N) = Nfio

(5)

Evolutionary Changes in Protein Composition

61

In (5) summation is over all possible values o f Ni (O to N) subject to the restriction that Z N i = N. Using Stirling's approximation in both numerator and denominator of (4) we obtain after some simple cancellations Ni W P(N) = " i ' (fio/fi)

(6)

- lnP(N) = N Z i fi ln(fi/fio)

(7)

or

In terms of the information content, AS, defined by AS = 2; i filn(fi/fio ) '

(8)

we can write (6) as P(N) = exp(-NaS)

.

(9)

The information content has bene extensively studied in information theory (Kullback, 1959; Renyi, 1962) an d its applications (Schl 6gl, 1975 ; Levine and Bernstein, 1974; Ben-Shaul and Hofacker, 1976; Levine and Ben-Shaul, 1977).

The Surprisal of a Protein Composition The result P(N) = exp(-NaS) (where AS > / 0 and vanishes only for a sequence of standard composition) demonstrates that in the absence of constraints, significant deviations from the standard composition are unlikely; that we would be quite surprised to observe proteins with frequencies which are quite different from the standard ones. Moreover, the longer the sequence, the more likely it is to have a standard composition. It is appropriate to adopt I(N), I(N) = -lnP(N) = NAS

(10)

as a measure of our surprise at a given composition. If it is standard, then zxS = 0 and we are not surprised. A n y non-standard composition has a surprisal value which is significant in as much as the positive value of AS is larger than the one expected for a random process. As in several other studies of systems in disequilibrium (e.g., Levine and Ben-Shaul, 1977), the surprisal affords a convenient way to explore the mechanisms which maintain a system away from its most probable state. This analysis is carried out for the evolutionary changes in protein composition for several families.

62

R. Coutelle et al.

/o

'M

&S

AS

-.2

Cytochrom¢

c

~S = 00073.(PAMS) - 0,056

5

,h 'K

j

r = 0.96

Fcrr¢doxin

/ /+

~? :oB .T "R

E N,. r ' c

"S

t," "uTM

F I ~f:QC

,:/jo..o--o

..3

30

i

120

p

.o

C

y

2

I

a. v . j / / /.p 6

Z~S = 0013 .IPAMS)-0.16 r = 0.96

t d

5 130 F'AMS

I

/'T

.

2~ " i A'N

130

40

50 PAMS

60

I

!Fig. 1.a

I

Fig. 1.b

AS

Hcmogtobin o.

~.IS

AS

/

AS = O,O090"(PAMSI- 0.068

. X / , ,,L/

r = 0.97

_ .15

B..U/Fh. ~ ¢ Y

.1

. /~..M R

K

o/

Hemog(obin [3

AS = 0,0065"(PAMS)- 0.026 r = 0.6B

"D

-.1

E. /

,'p

.05

~ "~ +R-s ~"~/F _,05

,//'~

'O'K "C

1

ilO

120

I

I

20

10

PAMS

PAMS

Fig. 1.d

Fig. 1.c

AS

AS = 0.0085.(PAMS) - 0.064 r = 0.87

.15

,c

.25

-F/ /

/

f

Ribonuc[¢c.s¢

&S

'Q

Myogtobin

AS = 0,O06O.(PAMS) + 0.031 f =0.85

/



.

F

W

/ * P

G

'

~A

M ~ ,~

E-

'

.2

=s ,rz~.hra~cu --.. 7 h~o

,1

P:G~J'~g "H

/N _.15 .05

120 I

Fig. 1.e

_ 110

I

120

I

L~

AMS

--/I

Fig. 1.f

~

130

I PAMS

E v o l u t i o n a r y Changes in P r o t e i n C o m p o s i t i o n

AS Lysozyme

.2

g

.15

' 65 _.2 &-Cryslallin A 6S = 0.0081 (PAMS) - 0.072 r = 0.97

/

A S = 0.0104-[PAM S) - O.097 r = 0,98

63

-

R --/wj~ R ' ~ rr;~'q

.15

_,1

/ _.05

..05

10 I

~ _ _

20 }

F

30 PAMS I_

t __120

~

]30

J

Fig. 1.h

Fig. 1.g

I

110

S

110

i

_ j20

~

130PAMS

Fig. 1.i

Fig. la-i. Compositional information content aS (in units as defined by Eq. (8)) vs. the evolutionary distance (in PAMs); (a) Cytochrome c, (b) Ferredoxin, (c) and (d) Hemoglobin a and 3, (e) Myoglobin, (f) Ribonuclease, (g) Lysozyme and (h) a-Crystallin A. r is the computed correlation coefficient for the family. In addition, the range of possible values of the information content vs. PAMs is shown in panel (i) as a shaded region. These values were obtained by calculating minimal and maximal z~S values of a given evolutionary distance. To show the relationship between classes of species and zxS-PAMs values, the different species are identified. All compositions not specified can be found in the Protein Atlas Vol. V and its supplements. (a): Cytocbrome c: a: rape (Brassica napus); B: bullfrog (Rana catesbiana); b: bonito (Katsuwonus vagrans); C: chicken (Gallus gallus); c: cotton (Gossyplum barbadense) ; D:.puget sound dogfish (Squalus sucklii) ; d: Debaryomyces kloeckeri; E: Eisenia foetida (Lyddiat and Boulter, 1976 a); e: elder (Sambucus nigra); F: fruit fly (Drosopbila melanogaster) ; f: sunflower (Heliantbus annuus) ; G: california gray whale (Racbianectes glaucus) ; g: Euglena gracilis; H: human (Homo sapiens); b: horse (Equus caballus); i: spinach (Spinacia oleracea); K: gray kangaroo (Macropus giganteus) ; L: pacific lamprey (Entospbenus tridentatus) ; l: Humicula lanuginosa; M: silkworm moth (Samia cyntbia); N: Neurospora crassa; n: A butillon tbeopbrasti; O: Critbidia oncopelti; o: locust (Scbistocercagregaria) (Lyddiat and Boulter, 1977); P: king penguin (Aptenodytes patagonica);p: potato (Solanum tuberosum) (Martinez et al., 1974); Q: tobacco horn worm moth (Manduca sexta); R: rhesus monkey (Macaca mulatta); r: rabbit (Oryctolagus cuniculus) ; S: rattle snake (Crotalus adamanteus) ; s: starfish (Asterias rubens) (Lyddiat and Bonlter, 1976b); T: snapping turtle( Cbelbydra serpentina) ; t: tomato (Lycopersicum esculenturn); U: Candida krusei; u: rust fungus (Ustilago spbaerogena); w: wheat (Triticum aestivum); x: boxelder (Acer negundo); y: baker's yeast (Saccbaromyces oviformis). (b): Ferredoxin: A: alfalfa (Medicago sativa) ; B: Clostridium butyricum; C: Cbromatium; D: Desulfovibrio gigas; E: Seenedesmus; F: Bacillus stearotbermopbilus (Hase et al., 1976a); G: Megaspbaera elsdenii; 14.. Halobaeterium balobium (Hase et al., 1977);I: Cblorobium limicola I; J: Cblorobium limicola H

PAMS

64

R. Coutelle et al.

(Tanaka et al., 1975); L: Leucaena glauca; M: Micrococcus aerogenes; N: Nostoc rnuscorurn (Hase et al., 1976b); P: Clostridium pasteurianurn; R: Spirulina platensis (Wada et al., 1975); S: spinach (Spinacia oleracea); T: taro (Colocasia esculenta); U: Clostridiurn acidi-urici; V: Clostridiurn tartarivorurn;'X: Spirulina maxima, (c): Hemoglobin a: A: savannah monkey (Cercopbitecus aetbiops) (Matsuda et al., 1973);B: bovine (Bos taurus); C: chicken A 1 (Gallus gallus) (Takei et al., 1975); c: chicken A 2 (Gallusgallus); D: dog (Canisfarniliaris); E: marmoset (Sanguinusfuscicollis) (Lin et al., 1976); G: desert sucker (Catostomus clarkii); H: human (Homo sapiens); b: horse (Equus caballus) ; K: gray kangaroo (Macropus giganteus) ; L: slow loris (Nycticebus coucang) ; M: mouse (C57 BL9);N: newt (Taricba granulosa); O: opossum (Didelpbis rnarsupialis) (Stenzel, 1974);P: carp (Cyprlnus carpio); R: rhesus monkey (Macaca mulatta); r: rabbit (Oryctolagus cuniculus); S: sheep (Ovis aries); U: capuchin monkey (Cebus apella); V: viper (Vipera aspis); x: echidna 2A; Y: echidna 1B (Tacbyglossus aculeatus aculeatus). (d): Hemoglobin 3: B: bovine (Bos taurus); C: chicken (Gallus gallus): D: dog (Canis farniliaris) ; E: echidna (Tacbyglossus aculeatus aculeatus); F: frog (Rana esculenta); H: human (Homo sapiens); K: gray kangaroo (Macropus giganteus) ; M: mouse (C57 BL);P: potoroo (Potorous tridactylus) ; R: rhesus monkey (Macaca mulatta); r: rabbit (OWctolagus cuniculus); S: spider monkey (Ateles geoffroyi); U: capuchin monkey (Cebus capella). (e): Myoglobin: A: arctic mink whale (Balaenoptera acutorostrata) (Lehman et al., 1977); B: bovine (Bos taurus); b: badger (Meles rneles); c: chicken (Gallus gallus) (Deconink et al., 1975); D: bottlenosed dolphin (Tursiops truncatus) (Jones et al., 1976); d: dog (Canis farniliaris) (Dumur et al., 1976); F: dwarf sperm whale (Kogia sirnus) (Dwulet et al., 1977); G: galago (Galago crassicaudatus); g: hedgehog (Erinaceus europaeus) (Romero-Herrera et al., 1975a); H: human (Homo sapiens); b: horse (Equus caballus); J: harbor porpoise (Pbocaena pbocaena); K: red kangaroo (Macropusrufus); k: killer whale (Orcinus orca) (Castillo and Lehmann, 1977);L: lemur (Lepilernur rnustelinus) ; l: sea lion (Zalopbus californianus) ; M: common marmoset (Callitbrix ]accbus); N: slow loris (Nicticebus coucang) (Romero-Herrera et al., 1976a) ; O: olive baboon (Papio anubis); o: opossum (Didelpbis rnarsupialis) (Romero-Herrera and Lehmann, 1975c); P: pig (Sus scrofa) (Rousseaux et al., 1976) ; p: potto (Perrodicticus porto edwarsi) (Romero-Herrera and Lehmann, 1975b); Q: gastropod mollusc (Aplysia lirnacina) (Tentori et al., 1973); R: amazon river dolphin (Inia geoffrensis) (Dwulet et al., 1975); r: rabbit (Oryctolagus cuniculus) (Romero-Herrera et al., 1976b); S: sheep (Ovis aries); s: harbor seal (Pboca vitulina); T: treeshrew (Tupaia glis belangeri) (Romero-Herrera and Lehmann, 1974); W: sperm whale (Pbyseter catodon); va: california gray whale (Escbricbtius gibbosus) (Bogardt et al., 1976). (f): Ribonuclease: B: bovine (Bos taurus); C: chincilla (Cbincilla brevicaudata) (van den Berg et al., 1976); D: dromedary (Carnelus drornedarius) (Welling et al., 1975); E: red deer (Cervus elapbus); F: fallow deer (Darna dama); G: giraffe (Giraffa carnelopardalis); M: moose (Alces alces); P: pig (Sus scrofa); q,Q: guinea pig A and B (Carla porcellus) (van den Berg et al., 1977); R: rat (Rattus rattus); S: sheep (Ovis aries); T: reindeer (Rangifer tarandus) ; U: gnu (Connocbaetes taurinus (Groen et al., 1975); W: lesser rorqual (Balaenoptera acutorostrata) (Emmens et al., 1976); Y: coypu (Myocastor coypus) (van den Berg and Beintema, 1975); Z: horse (Equus caballus). (g): Lysozyme: A: chachalaca (Ortalis vetula) (Joll~s et al., 1976); B: baboon; C: chicken (Gallus gallus); D: duck (Anas platyrbyncbus) ; G: guinea hen (Numida rneleagri); H: human (Horno sapiens); Q: bobwhite quail (Colinus virgianus); T: Phage T4. (h): a-Crystallin A: B: ox (Bos taurus); D: dog (Canis farniliaris); E: elephant (Loxodonta africana); H: human (Homo sapiens); b: horse (Equus caballus) ; K: red kangaroo (Macropus rufus); M: rhesus monkey (Macaca rnulatta); O: opossum (Didelpbis rnarsupialis); P: pig (Sus scrofa) ; R: rhinozeros (Ceratotberium sirnurn) ; r: rabbit (Oryctolagus cuniculus) ; T: rat (Rattus norvegicus) ; W: whale (Balaenoptera acutorostrata); X: cape hyrax (Procaria capensis). All c~-Crystallins can be found in the paper of de Jong et al., 1977

Evolutionary Changes in Protein Composition

65

Results Typical results of the analysis are shown in Figure la-h as a surprisal vs. PAMs plot. They exhibit some remarkable characteristics: (i) Within each protein family there is a fixed increment in information per accepted point mutation. (ii) The AS values found at a given PAM value are in the very low range. (The range of AS vs. PAMs for all possible sequences is given in Fig. li.) From Eq. (9) it follows that P(N) is very high. The natural proteins have a composition which can be realized by very many different sequences. (iii) A higher deviance from the standard composition corresponds to stronger constraints due to biological function. The plots for cytochrome c and hemoglobin indicate that proteins of more evolved species tend to have a greater evolutionary distance from the standard composition and correspondingly have a higher AS value. The case of ferrodoxin shows that the plant proteins and the ferrodoxin of Halobacterium halobium which bind only two iron and sulfur atoms have the lowest evolutionary distance, whereas the proteins in the middle of the range bind 4-6 and the bacterial ferrodoxins with the highest evolutionary distances bind 7-8 iron and sulfur atoms. In the other cases shown either the range of investigated species (ribonuclease and ~-crystallin A) or the range of observed evolutionary distances (myoglobin and hemoglobin/7) is too short to reveal a similar trend; for lysozyme not enough compositions are known to be tested against the hypothesis stated above. But for a-crystallin A, the marsupials kangaroo and opossum are found separated clearly from the other mammals (including man). (iv) By construction AS -~ O as PAMs --* O. Hence the linear relations shown in Figures la-h should not be linearly extrapolated to zero PAMs. The reason why is obvious from Figure li. When the compositions correspond to the regime of high P(N) (i.e., low AS), then near the standard composition the minimal AS cannot be linear in relation to the PAM values. Moreover, due to finite length of the proteins one is unlikely to find AS values near zero (see the following section).

Effects of Finite Length Proteins generated by a random process following the Dayhoff replacement matrix would also have finite PAMs and AS values. This question was first investigated by Gatlin (1974) and it was noted by Holmquist and Moise (1975) and Vogel (1975) that observed protein compositions differ significantly from randomly generated ones based on the multiplicity of the genetic code. The influence of finite length in an analysis, as given in the "results" section can readily be tested by calculating AS vs. PAM contours for proteins of various length in a procedure by which amino acids are chosen according, to their frequencies rio. The significant deviations of natural proteins from the standard distribution can then be recognized without further statistical detours (Fig. 2a-d). As expected, the randomly generated AS-PAM contours are strongly peaked and shift to smaller PAMs and AS values with increasing N. Figure 2a-d show that the probability for nearly all of these proteins to owe the location of their AS-PAMs point to finite

66

R. Coutelle et al.

AS

b

a

AS

100 AA

02

125 AA

02

tochrom¢ c ( 102 -112 A A )

120

II0

130

~

~

50

RibonucLeas¢ (T23-128 AA)

10

~

Lysozym¢

( 129-130 A AI

200

PAMS

i10

120

130

PAMS

AS _2 -

a-Crysta[tin A

175 AA

&S

(172 -173 AA _ ,15

{3.2

Oil

150 A A

H¢mogtebin 13 {140--146 A A ) \

,

~

5O

200

"MyogLobin ( 153 AA 1

5 _ 05

1000

- -

Hcmog[a bin a {141-142 AA)

~ - ~ , / ] ////..) 1O

h

500~

,

,

j 30

PAN ,S

I

,

,

30 PAMS I

Fig. 2a-d. 100,000 Random proteins were generated to build each of a-d. The random proteins have a different number of amino acids each. The random proteins of a consist of 100 amino acids, the proteins in b-d have 125, 150, and 175 amino acids respectively. All random proteins were generated by drawing random numbers from the interval [0, 1] which was divided into 20 subintervals of length fio, the standard frequency for amino acid given in the manuscript. Thus each random number e [0, 1] was associated with an amino acid. In this way, N amino acids for a random protein of length N were generated. From the composition of this random protein the corresponding z~S and PAMs values were calculated. The ~S-PAMs area shown in a-d were divided in 60 x 120 subareas and the pairs of AS and PAMs values of the random proteins falling in one subarea counted. Contour lines are given. They indicate the border lines between which more than 10, 50, 200, 500, or 1000 random proteins are collected in one subarea. In the contour plots the range of the linear relation found between z~S and PAMs for natural proteins of one family is drawn. The range for natural proteins of length L is drawn in the contour plot for random proteins with the most similar length

length is at best o f the order 1%. 2 Since the absolute values o f PAMs and AS are o f no concern to us, w e see no need to change to AS-PAMs scales shifted b y the m o s t pro-

b a b l e finite corrections. A r a n d o m process w i t h m u t a t i o n probabilities given b y D a y h o f f ' s m a t r i x for proteins of finite length w o u l d change the c o m p o s i t i o n to a z~S-PAMs region o f high probability. There it w o u l d exhibit r a n d o m walk characteristics. This is s h o w n in Figure 3a for a simulation o f the e v o l u t i o n o f c y t o c h r o m e c and in Figure 3b for h e m o g l o b i n 2The extremely deviating values for ribonuclease (Fig. 2b) are mainly due to the richness in the rare amino acids ash, cys, gin and met. a-crystallin A, on the other hand, owes its extreme z~S-PAMs points mainly to the abundance of ser (Fig. 2d)

E v o l u t i o n a r y Changes in Protein C o m p o s i t i o n

AS

Simulation of

Cytochrorn¢

c Evolution

67

AS

S t a r t : Dogfish . ,15

..15

r = 0.834

•.05

SimuEotion

of

H c m o g t o b i n o~ E v o l u t i o n

S t a r t : Sheep r = 0.801

.05

_ _ 1

110

i

120

1

PAMS _

110 i

20 I

PAMS

I

Fig. 3a and b. In a and b a simulation of natural evolution of protein compositions by a random process is given. In a first step the composition of a natural protein with its PAMs and z~S values lying in the middle of the AS-PAMs range of the corresponding proteins was chosen. For cytochrome c this is the composition of puget sound clog-fish cytochrome c and for hemoglobin a the composition of sheep hemoglobin a. The evolution matrix (aij) contains the probability that an amino acid of the kind j is replaced by an amino acid of kind i in a mutation process. Thus the following random process can be performed: in a protein with composition [fi] for the amino acids i the probability that amino acid of kind j is replaced by amino acid of kind i is given by fj aij. The interval [0, 1] was divided in 400 subintervals of the length fj aij (i = 1,20;j = 1,20). If a random number e [0, 1] falls in the interval fj aij a mutation from amino acid j to amino acid i was associated. With every mutation the protein composition is modified to [f~] and the interval [0, 1 ] was divided again in 400 subintervals, now with the length fj' aij. This process was continued until in the average 10 mutations were generated. For the so constructed protein composition, PAMs and zxS values were computed and this protein composition was chosen as a starting point to generate a new protein composition in the same way. This process was continued until 25 proteins were generated. Typical results are shown in a and b; the PAMs and AS values for the proteins 1,2,3 .... 25 are connected by a line, always starting with the values corresponding to the dogfish or sheep composition. In each case, the correlation coefficient for the pairs of AS and PAMs values is given

0c As Figure la-c shows, there exists a u n i f o r m trend between the level of e v o l u t i o n and the distance f r o m the standard composition. This can only be observed if a suitable standard c o m p o s i t i o n is chosen. If, e.g., the c o m p o s i t i o n of chicken c y t o c h r o m e c were chosen as a standard, c y t o c h r o m e c compositions for plants and m a m m a l s w o u l d have similar distances f r o m this standard composition.

Information Gained b y Selecting a Protein o f Given C o m p o s i t i o n A f t e r T PAMs In the " i n f o r m a t i o n c o n t e n t " section we counted the n u m b e r o f sequences o f a given c o m p o s i t i o n and introduced AS as a measure of the fraction P(N), P(N) = exp(-NAS) o f such sequences. The analysis of the compositions of natural proteins then led to the observation that AS is linear as a f u n c t i o n o f the PAM values. F u r t h e r m o r e , by examining a AS vs. PAM plot (Fig. li) for all possible sequences, we c o n c l u d e d that for the natural proteins P(N) is n o t far f r o m maximal. The natural families thus have compositions that allow for a large n u m b e r of possible sequences. To interpret this finding we have to imagine the t o t a l n u m b e r o f successor proteins which can be generated f r o m any selected protein by T PAMs according to " n e u t r a l " replacements based on the D a y h o f f matrix. A m o n g these, one will be selected with

68

R. Coutelle et al.

specified sequence and composition. The information gained by this selection we would then consider a true measure of evolutionary progress. Our first step is to estimate the number of sequences M(T) which can be generated by T replacements from a given protein. This could be done on the basis of the Dayhoff matrix but would require a rather elaborate mathematical presentation. Since, for the purpose of this paper, it will be sufficient to show that for large enough T, In M(T) oc T, we simplify the Dayhoff replacement matrix by letting aij be equal to unity whenever the replacement of amino acid i by amino acid j is acceptable, aij = 0 otherwise. The symmetric matrix [aij] is thus generated by replacing all the non-vanishing elements of the mutation probability matrix (e.g., Fig. 9-7 of Dayhoff et al., 1972) by unity. We now consider all possible proteins obtained by T + 1 mutations such that first amino acid i is replaced by others. Let there be Mi (T + 1) such proteins. In the first step, amino acid i is replaced by any one of the acceptable amino acids j (i.e., by any amino acid j for which aij is non-zero). Hence the set of possible proteins could also be constructed on the basis of T mutations starting with amino acid j, summed over all possible j's. a Mathematically, M i (T + 1) = ~ jaijM j (T). To find M i (T) we anticipate a solution of the form M i (T) = Mi#T and substitute to obtain M i//T+I = ~ jaijMj/~T or Mi/l = ~ jaijMj. //is thus an eigenvalue and the Mj's are an eigenvektor of the symmetric real matrix [aij ]. As T increases, only the largest eigenvalue will matter (but at small T's, Mi (T) will not necessarily have a simple exponential dependence on T). The number M(T) of all possible proteins after T acceptable mutations M(T) ~< ~ iMi(T) increases exponentially with T. lnM(T)/T thus tends to a definite limit as T increases. The same exponential behavior will result in the more complicated case of T replacements following the Dayhoff matrix. Consider now the restriction that the generated sequences have a specified composition. Then after T replacements the number of possible sequences of length N is M(N,T) = P(N)M(T) = exp(-NAS)M(T)

(il)

The information gained by selecting a successor protein of given composition after T PAMs is then I n M(N,T) M(N,T) may be considered the total number of proteins of given composition that a source (mutations which generate different proteins) could produce in T steps. The capability of the source to generate information would then (according to Shannon and Weaver, 1949) be measured by the quantity information per PAM, i.e., 1 lim + l n M ( N , T ) = l i m ( - - ~ N ~ S + l n # ) T-+Oo T~Qo called channel capacity in information theory. Hence, if M(N,T) is to increase exponentially with T it is necessary that AS is at most a linearly increasing function of T. A linear increment in the compositional information content per PAM is the highest possible gain consistent with an exponential growth in the number of successor

Evolutionary Changes in Protein Composition

69

sequences. Any higher information gain per PAM (e.g., AS cc T 2) will eventually cut down the number of successor sequences since then lim [M(N,T)/T]--> 0 T-~Oo In other words, choosing some very special compositions (with high AS values) one very soon runs the risk of dealing with very rare sequences. Conclusions The composition of naturally occurring proteins has been used in an attempt to infer a seemingly universal feature of molecular evolution. It is suggested that evolution (as reflected in protein composition) follows a path determined by two counteracting factors. The first is to maximize the (compositional) information gain per amino acid replacement with a tendency for more specific compositions. Counteracting this tendency is the fact that very specific compositions can be realized by too few sequences. Proteins generated along less efficient evolutionary pathways thus are not likely to be found in nature since at some point of such a pathway the chance of a PAM to suit some selective pressure on this protein will be very small, resulting in extinction or excessive delay in the evolutionary progress of the whole species. In a technical, information theoretic language the evolutionary process apparently tends to maximize its capacity (i.e., the number of sequences to select from), subject to restrictions on the composition. These restrictions are undoubtedly related to the functional requirements of the protein and may differ from one family to another (as borne out by the different increments of the compositional information content per PAM of the different families). It thus appears that it is the inbuilt strategy of evolution to keep the number of potentially significant sequences per mutational step maximal, subject to compositional restraints due to biological function. We have demonstrated that despite these restraints there is still an exponential range of choices at each stage of evolutionary development and that the measure of this range (the entropy deficiency zxS, per PAM, cf. Fig. 1) is a characteristic one for each family of proteins. It should, however, be borne in mind that no conclusion can be drawn regarding the nature of the selective processes. In particular, it is impossible to say whether selection has taken place after each or after a certain number of PAMs. Our findings let it appear possible that the gross tendencies of molecular evolution for the natural proteins can be stated as follows: a naturally occurring protein is most likely to be produced via an evolutionary path characterized by maximal channel capacity.

Acknowledgement. This work was supported by the Stiftung Volkswagenwerk and the Deutsche Forschungsgemeinschaft. One of us (G.L.H.) is indebted to Professor M. Eigen for a fruitful discussion on this topic. We profited greatly from a very intense dialogue and many useful suggestions rendered by the editor of this journal.

70

R. Coutelle et al.

References

Ben-Shaul, A., Hofacker, G.L. (1976). Statistical and Dynamical Models of Population Inversion. In: Handbook of Chemical Lasers, R.W.F. Gross and J.F. Bott, eds., pp. 579-617. New York: Wiley Berg, A. van den, Beintema, J.J. (1975). Nature 253,207-210 Berg, A. van den, Hende-Timmer, L. van den, Beintema, J.J. (1976). BBA 453,400-409 Berg, A. van den, Hende-Timmer, L. van den, Hofsteenge, J., Gaastra, W., Beintema, J.J. (1977). Eur. J. Biochem. 75, 91-100 Bogardt, R.A., Dwulet, F.E., Lehman, L.D., Jones, B.N., Gurd, F.R.N. (1976). Biochemistry 15, 2597-2602 Castillo, O., Lehmann, H. (1977). BBA 492,232-236 Chirpich, T.P. (1975). Science 188, 1022--1023 Dayhoff, M.O., Dayhoff, R.E., Hunt, L.T. (1976). Compositions of Proteins. In: Atlas of Protein Sequence and Structure, M.O. Dayhoff, ed., Vol. 5, Suppl. 2, p. 301. Silver Spring, Maryland: National Biomedical Research Society Dayhoff, M.O., Eck, E.V., Park, C.M. (1972). A Model of Evolutionary Change in Proteins. In: Atlas of Protein Sequence and Structure, M.O. Dayhoff, ed., Vol. 5, pp. 89-99. Silver Spring, Maryland: National Biomedical Research Foundation Deconinck, M., Peiffer, S., Depreter, J., Paul, C., Schnek, A.G., Leonis, J. (1975). BBA 386,567-575 Dumur, V., Dautrevaux, M., Hart, K. (1976). BBA 427,759-761 Dwulet, F.E., Bogardt, R.A., Jones, B.N., Lehman, L.D. (1975). Biochemistry 14, 5336--5443 Dwulet, F.E., Jones, B.N., Lehman, L.D., Gurd, F.R.N. (1977). Biochemistry 16, 873--876 Emmens, M., Welling, G.W., Beintema, J.J. (1976). Biochem. J. 157,317-323 Finkelstein, A.V., Ptitsyn, O.B. (1971). J. Mol. Biol. 62,613-624 Gatlin, L.L. (1972). Information Theory and the Living System, New York: Columbia Univ. Press Gatlin, L.L. (1974). J. Mol. Evol. 3,189-208 Gatlin, L.L. (1976). J. Mol. Evol. 7,185-195 Groen, G., Welling, G.W. Beintema, J.J. (1975). FEBS Lett. 60, 300--304 Hartley, R.V. (1928). Bell System Techn. J. 7,535-563 Hase, T., Ohmiya, M., Matsubara, H., Mullinger, R.N., Rao, K.K., Hall, D.O. (1976a). Biochem. J. 159, 55-63 Hase, T., Wada, K., Ohmiya, M., Matsubara, H., (1976b). J. Biochem. 80,993-999 Hase, T , Wakabayshi, S., Matsubara, H., Kerscher, L., Oesterhelt, D., Rao, K.K., Hall, D.O. (1977). FEBS Lett. 77, 308--310 Hasegawa, M., Yano, T.A. (1975). Origins of Life 6, 219--227 P P Hermann, J., Jolles, J., Buss, D.H., Jolles, P. (1973). J. Mol. Biol. 79,587--595 Holmquist, R. (1975). J. Mol. Evol. 4,277--306 Holmquist, R., Moise, H. (1975). J. Mol. Evol. 6, 1-14 Hopfinger, A.J. (1972). Currents in Mod. Biol. Biosystems 5, 38--42 Joll6s, J., Schoentgen, F., Joll6s, P., Prager, E.M., Wilson, A.C. (1976). J. Mol. Evol. 8, 59--78

Evolutionary Changes in Protein Composition

71

Jones, B.N., Vigna, R.A., Dwulet, F.E., Bogardt, R.A., Lehman, L.D., Gurd, F.R.N. (1976). Biochemistry 15, 4418-4422 Jong, W.W. de, Gleaves, J.T., Boulter, D. (1977). J. Mol. Evol. 10, 123--135 Kimura, M. (1968). Nature 217,624-626 King, J.L., Jukes, T.H. (1969). Science 164, 788--798 Kullback, S. (1959). Information Theory and Statistics, New York: Wiley Lehman, L.D., Dwulet, F.E., Bogardt, R.A., Jones, B.N., Gurd, F.R.N. (1977). Biochemistry 16,706-709 Levine, R.D., Ben-Shaul, A. (1977). In: Chemical and Biochemical Applications of Lasers, C.B. Moore, ed., Vol. II, pp. 145-197, London: Academic Press Levine, R.D., Bernstein, R.B. (1974). Accts. Chem. Res. 7,393-400 Lin, K.D., Kim, V.K., Chernoff, A.J. (1976). Biochem. Genet. 14, 427-440 Lyddiat, A., Boulter, D. (1976a). FEBS Lett. 62, 85--88 Lyddiat, A., Boulter, D. (1976b). FEBS Lett. 67, 331-334 Lyddiat, A., Boulter, D. (1977). Biochem. J. 163,333-338 Martinez, G., Rochat, H., Ducet, G. (1974). FEBS Lett. 47,212-217 Matsuda, G., Maita, T., Watanabe, B., Araya, A., Morokuma, K., Goodman, M., Prychodko, W. (1973). Hoppe-Seyler's Z. Physiol. Chem. 354, 1153-1155 Nagano, K. (1973). J. Mol. Biol. 75,401-420 Ohta, T., Kimura, M. (1971a). J. Mol. Evol. 1, 18-25 Ohta, T., Kimura, M. (1971b). Science 174, 150-153 Reichert, T.A., Cohen, D.N., Wong, A.K.C. (1973). J. Theor. Biol. 42, 245-261 Renyi, A. (1962). Wahrscheinlichkeitsrechnung. Berlin-Ost: Deutscher Verlag der Wissenschaften (in English 1970: Probability Theory, Amsterdam: North-Holland) Romero-Herrera, A.E., Lehmann, H. (1974). BBA 359, 236-241 Romero-Herrera, A.E., Lehmann, H., Fakes, W. (1975a). BBA 379, 13-21 Romero-Herrera, A.E., Lehmann, H. (1975b), BBA 393,205--214 Romero-Herrera, A.E., Lehmann, H. (1975c). BBA 400, 387-398 Romero-Herrera, A.E., Lehmann, H., Castillo, O. (1976a). BBA 420, 387-396 Romero-Herrera, A.E., Lehmann, H., Castillo, O. (1976b). BBA 439, 51--54 Rousseaux, J., Dautrevaux, M., Han, K. (1976). BBA 439, 55-62 Schl6gl, F. (1975). Z. Physik B 20, 177-184 Shannon, C.E., Weaver, W. (1949). The Mathematical Theory of Communication, Univ. of Illinois Press, Urbana Smith, T.F. (1969). Math. Biosciences 4, 179--187 Stenzel, P. (1974). Nature 252, 62-63 Takei, H., Ota, Y., Wu, K.C., Kiyohara, T., Matsuda, G. (1975). J. Biochem. 77, 13451347 Tanaka, M., Haniu, M., Yasunobu, K.T., Evans, M.C.W., Rao, K.K. (1975). Biochemistry 14, 1938-1943 Tentori, L., Vivaldi, G., Carta, S., Marinucci, M., Massa, A., Antonini, N.E., Brunori, M. (1973). I. J. Pept. Prot. Res. 5,187--200 Vogel, H. (1975). J. Mol. Evol. 6, 271--283 Vogel, H. Zuckerkandl, E. (1971). In: Molecular Evolution, E. Schofeniels, ed., Vol. II, pp. 352--365 Wada, K., Hase,T., Tokunaga, H., Matsubara, H. (1975). FEBS Lett. 55, 102--104

72

R. Coutelle et al.

Welling, G.W., Groen, G., Beintema, J.J. (1975). Biochem. J. 147,505-511

Received August I0, 1978

Evolutionary changes in protein composition -- evidence for an optimal strategy.

j. Mol. Evol. 13, 5 7 - 7 2 (1979) Journal of Molecular Evolution © by Spfinger-Verlag. 1979 Evolutionary Changes in Protein Composition - Evidence...
827KB Sizes 0 Downloads 0 Views