J. Mol. Evol. 10, 309--317 (1978)

Journal of Molecular Evolution © by Springer-Verlag 1978

The Evolution of Histones Gerald R. Reeck I

Eric Swanson 2 and David C. Teller 2

1 Department of Biochemistry, Kansas State University, Manhattan, Kansas 66506, USA 2 Department of Biochemistry, University of Washington, Seattle, Washington 98195, USA

Summary. The amino acid sequences o f bovine histones H2A, H2B, H3, and H4 and the first 107 residues of rabbit thymus histone H1 were examined using newly developed procedures designed to detect and evaluate weak similarities (de Hahn et al., 1976). Using the McLachlan scoring system, regions of statistically significant similarity were found between several pairs of the four smallest histones. The probability that this set of similarities could result simply from chance was estimated to be less than 10 -s. No similarity was found between the H1 sequence and the other histones. The results are interpreted to indicate that at least the C-terminal portions of the core histones evolved from a common ancestral protein. Key words: Histones -- Chromosome structure -- Monte Carlo methods - Sequence homology

Introduction The five major histones share several fundamental functional and structural properties: as the major structural proteins o f the eukaryotic chromosome, they function in concert to compact DNA to a mass/length ratio of seven times that o f protein-flee DNA (Griffith, 1975); four of the histones have rather similar, low molecular weights of 11,200 (H4), 13,800 (H2B), 14,000 (H2A), and 15,300 (H3); between 20 and 30% of the amino acid residues in each histone are lysine or arginine, and all the histones lack tryptophan. Because o f these similarities, one might reasonably suspect that these proteins have evolved from a c o m m o n ancestral protein by a process of gene duplication and divergent evolution, as has now been proposed for many families of proteins (Dayhoff, 1972). A common evolutionary origin of the histones might still be reflected in similarities in their amino acid sequences, but simple visual inspection o f these sequences reveals no long segments o f clear-cut similarity. Thus, the search for sequence homology and the assessment of the statistical significance o f any proposed alignments between histones is clearly a task requiring the computational power and objectivity of a computer. To begin to understand the evolution of these proteins we have conducted thorough and rigorous comparisons of the amino acid sequences of calf thymus histones H2A, H2B, H3 and H4, and the first 107 residues of rabbit thymus histone H1. 0022-2844/78/0010/0309/$ 02.00

310

G.R. Reeck et al.

Experimental Procedures Sequence Da~ We have used the amiiao acid sequences of bovine histones H2A, H2B, H3, and H4 as determined by Yeoman et al. (1972), Iwai et al. (1975), DeLange et al. (1972), and DeLange et al. (1969), respectively, and the sequence of the first 107 residues of rabbit thymus histone H1 as reported by Jones et al. (1974).

Search Procedures In this study we have relied in large part on the methods developed and applied by de Hahn et al. (1976) in an examination of other distantly related protein sequences. To quantitate comparisons between sequences we used the scoring systems of McLachlan (1971) and Dayhoff (1972, page 103), which are based on the frequencies with which amino acid replacements have occurred in proteins known to be homologous. The starting point for the comparison of two sequences of lengths K and L is the computation of the comparison matrix, M, of dimensions K X L. Each entry Mi,j is the score for the comparison of the i th amino acid of one sequence with the jth amino acid of the second sequence (McLachlan, 1971). A (partial) sequence homology shows up in the matrix as a series of unusually high scores along a diagonal, but because of deletion or insertion mutations a complete alignment may consist of portions of several different diagonals. The value of a diagonal stretch or alignment is defined as the sum of all entries along the stretch(es). To find possible sequence homologies, we used the initial search procedure described by de Hahn et al. (1976). This procedure examines all diagonal stretches of specified length (9 to 30 residues), produces a distribution histogram of the roughly 10,000 values calculated from McLachlan scores, and identifies the 20 stretches having the highest values. In the comparison of any two sequences, we selected for further examination the alignments implied by the diagonal stretches with the highest values. The extent o f similarity between histone sequences was not adequate to produce convincing deviations from the random distributions of values obtained b y comparison of unrelated sequences. We therefore relied on the Monte Carlo methods described below to assess the statistical significance of alignments between histones. In several sequence comparisons, portions of more than one diagonal were identified in the initial search procedure as having possibly significant values, thus requiring the introduction of gaps into the alignments. In such cases we have found the optimum alignment between sequences using an adaptation of Sankoff's matrix algorithm (Sankoff, 1972) as applied by de Hahn et al. (1976) in protein sequence comparisons. The Sankoff procedure finds the alignment of maximum value while operating under a specified constraint on the number of gaps, and thereby provides the value of an alignment as a function of the allowed number of gaps. In addition to examining all pairs of histone sequences fo r similarities, each individual sequence was examined for possible internal homology by comparison with itself, using the initial search procedure and excluding the identity alignment.

The Evolution of Histones

311

Statistical Tests

In estimating the statistical significance of alignments, we have used Monte Carlo techniques, the essence of which is the comparison of each alignment's value with values calculated from computer-generated randomized sequences having the same amino acid compositions as the histones in which the proposed alignment occurs. As a screening procedure, the value of each alignment selected from the initial search was compared to the values of all possible alignments of equal length from 20 pairs of randomized sequences. The distributions of all values and of the extreme values from the randomized sequences consistently showed positive skew and kurtosis. These deviations from Gaussian behavior precluded calculation o f probabilities (corresponding to values o f actual alignments) from the means and standard deviations of values from randomizations. If the value o f a proposed alignment was exceeded by the values of no more than 10 of the alignments from the randomized sequences, the significance of the proposed alignment was rigorously assessed with a second Monte Carlo approach (Sankoff and Cedergren, 1973) that also determined the optimum number of gaps. F o r each proposed alignment we tested the null hypothesis that the value o f the alignment resulted from chance similarity between the sequences. In each test, 100 pairs of r a n d o m sequences were generated. These sequences were of the same length and amino acid composition as the histones from which the alignment to be tested was taken. F o r each pair o f randomized sequences, the value of the optimal alignment (having the same length as the alignment from the actual sequences) was calculated using both McLachlan and Dayhoff scores. Thus, for each pair o f sequences and for each allowed number of gaps (in those alignments requiring gaps) a set of 100 values for each scoring system was obtained from the 100 pairs of randomized sequences. Using the result o f Birnbaum (1973) we have calculated 0t, the level of significance of n, the number o f values from the 100 randomizations that exceeded the value of the proposed alignment, a is simply the probability of finding n or fewer values from the randomized sequences that are greater than the value of the actual sequences. Hence, the smaller a is, the less likely it is that the similarity observed between the actual sequences is due to chance. The o p t i m u m number of gaps is that number for which an alignment has the lowest level of significance. To evaluate the significance of a proposed internal alignment in H2A that appeared to reflect a gene duplication, we have devised an appropriate Monte Carlo test that differs slightly from a test of an alignment between two separate proteins. The initial search procedure suggested an alignment of the first 60 positions with residues 6 1 - 1 2 0 , leaving residues 1 2 1 - 1 2 9 unaligned. For each of 100 sequence randomizations, the value of the best alignment was found with the constraint that 60 positions must align with 60 nonoverlapping positions. No gaps were allowed within each 60 residue segment, but unaligned residues between the segments were allowed. For this 129 residue sequence there are 55 such internal alignments. The value of the proposed alignment was compared with the set of 100 highest values from the randomized sequences to determine a.

312

G.R. Reeck et al.

Results

Our examination of all pairs of histone sequences revealed four alignments in the Cterminal regions that are statistically significant when examined with the McLachlan scoring system (Table 1). For three of these alignments (H3--H4, H2B--H3, H2A-H2B) we obtained the rather dramatic result that not one of the roughly one million alignments examined in 100 pairs of randomized sequences had a value that exceeded that of the actual alignment. It should be pointed out that the probability o f 0.01 assigned to each of these alignments is the lowest possible level of significance when 100 randomizations are used (Birnbaum, 1973); if more randomizations had been performed it is possible that a substantially lower level of significance would have been observed. The search for sequence similarities in the N-terminal regions is complicated b y the unusual amino acid compositions of these portions of the histones. 10 to 15 o f the first 40 residues of each protein are lysine or arginine, and glycine and alanine also occur with unusually high frequencies. Olson et al. (1972) have raised the possibility that sequence similarities in the N-terminal regions of histones could be indicative of divergence from a common ancestral sequence; but, as these authors recognized, another distinct possiblity is that the DNA-binding function of hinstones requires an abundance of basic residues in the amino-terminal regions and that the histones may have converged from unrelated proteins to the current arginine- and lysine-rich sequences. Extreme compositional similarity in restricted regions of larger sequences will necessarily result in sequence similarity in those regions. Because of the plausibility of convergence to compositional similarity, sequence similarities in the Nterminal regions of histones cannot be safely ascribed to divergence from a common ancestral sequence. Of course, the unusual compositions do not argue against a possible common ancestry; they just obscure the interpretation of the results of a search. An additional result of the unusual compositions of the N-terminal regions of histones is that several inconsistent alignments frequently appear in the initial searches. The most convincing long stretch of similarity we found in the N-terminal regions was between H2A and H4 (Table 1). Dayhoff (1973) has reported that the amino terminal 60 residues segments of these two proteins are distantly related, but did not propose an alignment. Temussi (1975) presented an alignment (without statistical analysis) that begins with residue 4 in H4 paired with residue 1 of H2A and that extends without gaps to the end of H41 . In agreement with Dayhoff (1973) we find the only possibly significant similarity is in the N-terminal halves of these two proteins. After taking advantage of the six identities resulting from pairing the first seven residues of each protein, our alignment agrees with that of Temussi (1975) but extends only through residue 58 of H4. It is interesting to note that the C-terminal extension of Temussi's alignment, although not statistically significant, is implied by alignments in Table 1. For two reasons, we view the alignment of the N-terminal halves of H2A and H4 with some reservation: first, it has a level of significance of only 0.06, and secondly, there are hazards, discussed above, in claiming significance for alignments in the N-terminal regions of histones. Each histone sequence was examined for possible internal homology that might have resulted from gene duplication on a manner analogous to that proposed for

313

~--~

~ , • Z~ ,~' ~ . ~ ~ " ~ ~, ~ .~ '~ ~..~

.

~

:

P~

0

~--~--~ ,~

~ - - ~

~

~

~

~

~,~

~ , ~

~

k~

~

,~

~

," .~ . ~

ozzz

z z#.

d555

5

~.~,

g 2 p.l

re

r~

p

Z

ne

. = . ~ ~.~-~

-~ ~

5 b

t:~

~ ' ~

i~

~

p..

e~

~a

o~

~,; ~--~--~--~

,~

.I~

. ~

~:~

z~

o

~ ,~ J

e-~

o ~ ~ nn,..~ o o e,__

e,_

_,~

~

~

o~.~

~.~ ~'N

.,~

666u

~ ~

.~

~'~

p.,

"~

~-1--~ ,%-

~

-~

~"~ ~ ~ . ~ .~

0

rm e~ e'l

~

re~ ¢'q ~1 ('q

'~ ~ ~ . , ~ : ' ~ ~.~ ~ ~

o c~

~

~

~

~

~ ~ ~ ~ ~

,~ ~ ' ~

~.~

,.j eq

("4

0

~.~

~ - ~

~

314

G.R. Reeck et al.

haptoglobin, ferredoxin, and immunoglobulins (Dayhoff, 1972). Only in the case of H2A was any internal similarity found (Table 1). The Sankoff procedure demonstrated that the optimal alignment contained no gaps and paired residues 1 - 6 0 with residues 6 1 - 1 2 1 . Using McLachlan scores and the randomization procedure described in Methods, we found the level of significance of this alignment to be 0.05, which is low enough to warrant serious consideration of its having resulted from gene duplication but not low enough to establish this unequivocally. The same alignment of the trout sequence (Bailey and Dixon, 1973) shows one additional identity, and in those positions not included in the alignment ( 1 2 1 - 1 2 9 ) the trout and bovine sequences are altogether different. In Figure 1 we have aligned the C-terminal p o r t i o n s o f H2A, H2B, H3, and H4 as dictated b y the results given in Table 1. Several highly conserved positions are evident in the figure, which also includes the first 60 residues of H2A aligned with residues 6 1 - 1 2 0 of the same protein. The likelihood that the similarity between the two halves of H2A is not simply due to chance is increased by the fact that several of the identities in the internal alignment extend into the C-terminal sequences of the other histones. The values of all of the alignments of Table 1 were exceeded by more values from randomizations when Dayhoff scores were used instead of McLachlan scores. Indeed, for 3 of the C-terminal alignments the similarities did not appear statistically significant when the Dayhoff system was used. There are, however, good reasons, which we discuss below, to believe that McLachlan scores are more appropriate than Dayhoff scores for the comparison of histone sequences. Discussion

In the initial part of the discussion we will confine our attention to the core histones, H2A, H2B, H3, and H4. Our analysis of these sequences with McLachlan scores has shown that several pairs of C-terminal segments of the four proteins exhibit statistically significant similarities. A Monte Carlo estimation of the probability that the s e t o f four C-terminal alignments of Table 1 could simply be a result of chance would require an unreasonable amount of computer time, and this joint probability, which we will call p4, cannot be rigorously calculated. It is nevertheless easy to see that P4 is orders of magnitude lower than 0.01. Chance similarities in H3--H4 and H2A--H2B would be independent in a statistical sense. We can therefore calculate the probability, P2, of both these alignments being due to chance as: P2 = (3) (0.01) (0.01) = 3 X 10 -4, where the factor of 3 reflects the fact that there are 3 ways of forming two pairs of independent alignments among the four core histones. Clearly P4 ~ P2, since the calculation of P2 ignores the H2B--H3 and H2A--H3 similarities. Although chance similarities between H2B and H3 would not be strictly independent from an H3--H4 alignment, they would probably behave nearly so. We can therefore approximate p3, the probability of the H3--H4, H2A--H2B, and H3--H2B alignments all being due to chance, as: P3 ~ (15) (0.01) (0.01) (0.01) = 1.5 x 10 -s , where the factor of 15 is included because there are 15 ways of forming 3 pairs of sequences, in any triple of which no pair is composed of 2 proteins that are both paired with a common protein. (For example, the three pairs, H3--H2B, H3--H2A, and H2A--H2B, violate this c o n -

The Evolution of Histones

315

dition.) Given similarities in H3--H4, H2A--H2B, and H2B--H3 a chance similarity in H2A-H3 would clearly not be independent since both H3 and H2A would show similarity to H2B. This prohibits us from estimating p4 directly. Nonetheless, p4

The evolution of histones.

J. Mol. Evol. 10, 309--317 (1978) Journal of Molecular Evolution © by Springer-Verlag 1978 The Evolution of Histones Gerald R. Reeck I Eric Swanson...
552KB Sizes 0 Downloads 0 Views