Cell. Vol. 18, 875-882,

November

1979,

Copyright

0 1979 by MIT

The Complete Sequence of a Chromosomal Mouse a-Globin Gene Reveals Elements Conserved throughout Vertebrate Evolution Yutaka Nishioka and Philip Leder Laboratory of Molecular Genetics National Institute of Child Health and Human Development Bethesda, Maryland 20205

Summary The mammalian (Y- and /3-globin genes are thought to have evolved from a common ancestral sequence by a duplication event that occurred over 500 million years ago. We have now determined the entire nucleotide sequence of a cloned mouse cu-globin gene, including regions that flank and interrupt the coding sequence, and have compared this sequence with the sequences of the two mouse pglobin genes (Konkel, Tilghman and Leder, 1978; Konkel, Maize1 and Leder, 1979). Like the two /I genes, the (r gene is interrupted by two intervening sequences at precisely homologous positions, suggesting that these interruptions were present and have been preserved throughout vertebrate evolution. While the a: and j3 genes conserve considerable (-55%) sequence homology in their coding regions, this homology-with certain interesting exceptions-is lost in the highly divergent flanking and intervening sequences. These exceptions are short preserved sequences positioned in such a way that they might encode signals for transcriptional initiation, poly(A) addition and RNA splicing. Furthermore, a comparison of the recently diverged p genes and the long separate (Y gene allows us to distinguish two clearly different modes of nucleotide sequence change in evolution: a fast mode which is characterized by drastic sequence alterations involving deletions and insertions, and a slow mode which preserves sequence homology to a large extent and involves mainly point mutations. Introduction The BALB/c mouse has at least seven nonallelic globin genes that serve well as a paradigm for the study of gene evolution and regulation at the molecular level. These genes-two adult LY,two adult p and three embryonic -were apparently derived from a common ancestral sequence by a series of duplication events that occurred early in vertebrate evolution (Dayhoff, 1972). The two mouse j?-globin genes have been cloned (Tilghman et al., 1977; Leder et al., 1978; Tiemeier et al., 1978) and sequenced (Konkel et al., 1978, 1979). They diverged late in vertebrate evolution, about 50 million years ago, and have preserved a certain degree of homology in their organization and in their coding, flanking and intervening sequences. An @-globin gene has also been cloned

(Leder et al., 19781, and electron microscopic analysis has revealed that it, like the p genes, is also interrupted by two intervening sequences of DNA. Since the cx and ,f3genes diverged over 500 million iears ago they provide us with an opportunity to compare genes that have been physically separate since the earliest period of vertebrate evolution. The fact that the (Y and ,6 genes are expressed coordinately and have probably faced similar selective constraints provides us with an opportunity to identify those sequences that might regulate their expression. We have now determined the entire sequence of an a-globin gene, including its flanking and intervening sequences. The sequence data confirm our prediction regarding the preserved location of intervening sequences in vertebrate globin genes (Leder et al., 1978) and allow us to identify regions that might be involved in transcription initiation, poly(A) addition and RNA splicing. A comparison of the sequence with that of the /I-globin genes also provides a basis for speculation regarding fast and slow modes of gene divergence.

Results

and Discussion

DNA Sequencing The 9.7 kb Eco RI fragment containing the a-globin gene was isolated from BALB/c mouse embryonic DNA and cloned in the EK2 vector XgtWES.hB as described (Leder et al., 1978). The Eco RI DNA fragment was further subcloned into the plasmid pBR322 (Bolivar et al., 1977) for sequencing, and a 3 kb Sac I fragment containing the gene was used for the construction of a restriction map (Figure 1). The DNA sequence was determined by the method of Maxam and Gilbert (1977) according to the strategy shown in Figure 1. The accuracy of the sequencing method, using thin gels and sequencing in both directions, was estimated to be >99% (Konkel et al., 1978). One of the complications accompanying this chemical degradation method is the aberrant reactivity and migration of certain modified bases in acrylamide gels. For example, the low reactivity of 5-methylcytosine to hydrazine (Ohmori, Tomizawa and Maxam, 1978) required us to predict Eco RII sites at one base gaps that appeared on the sequencing gels and to confirm these by sequencing the opposite strand. There were seven Eco RII sites in the 1441 nucleotides determined. Due to the lack of information on the behavior of other modified bases and the apparent influence of neighboring bases on the reactivity of certain bases (Y. Nishioka, unpublished observation), it seems an absolute requirement to sequence in both directions if the region is of particular interest. As shown in Figure

1, most

parts

in both directions.

of the

sequence

Furthermore,

were

wherever

determined

possible,

Cell 876

Nucleotide

Amino

Figure

0

200

490

Acid

1. Strategy

600

1

for Sequencing

a Mouse

a-Globin

31

32

800

98

1003

loo

1200

1400

141

Gene

The gene fragment was sequenced by the technique of Maxam and Gilbert (1977) using the restriction sites indicated on the diagrammatic map of the mouse a-globin gene. The scale (in nucleotides) is shown on the top line. The solid area of the map represents coding sequences and IVSI and 2 represent the first and second intervening sequences (5’ to 3’). The kinased end of each fragment is indicated by the filled circle and the extent and direction of sequencing are indicated by the arrows.

efforts were made to read the sequence in 20% gels which gave better resolution and therefore a more reliable estimation of the spacing than did lower percentage gels. Only one base, at position 329 and scored as G in Figure 2, remains questionable. This base is so modified as to react as both G and C when sequenced in the anti-sense strand. It also resides in a large palindrome extending from position 296 to 340 (see Figure 4), making it very difficult to sequence in the sense strand. Sequence and General Organization of the Mouse cw-Globin Gene The entire 1441 nucleotide sequence determined is shown in Figure 2. This sequence includes a 372 nucleotide flanking region to the 5’ side of the capping site (presumably the point of transcriptional initiation) and a 353 nucleotide flanking region to the 3’ side of the putative poly(A) addition site at nucleotide position 1190. The sequence identifies this gene as that corresponding to the adult (Y subunit since it encodes Ser at amino acid position 68 [the other cr chain contains a Thr at this position (Popp, 1967)]. There is a minor discrepancy between the gene sequence and the published amino acid sequence. The gene encodes an Ala-Ala-Gly peptide at codon positions 69-71, whereas the published amino acid sequence is AlaGly-Ala (Popp, 1967). The gene sequence is otherwise in complete agreement with the amino acid sequence. The most striking feature of the structure is that it is interrupted by two intervening sequences that divide the gene into three coding blocks. The sequence confirms both the interpretation of earlier electron microscopic analysis and the prediction that vertebrate globin genes would be interrupted by two intervening sequences that occur at exactly homologous positions within the coding sequence (Leder et al., 1978) (see Figure 6 for a/p alignment). Rabbit (van den Berg et al., 1978) and human (Lawn et al., 1978) /3-globin genes conform to this rule and, since (Y and /? genes are probably derived from a common, inter-

rupted ancestral gene, it is reasonable to expect that all active vertebrate globin genes will be interrupted in these two positions. As discussed below, putative transcriptional initiation sites and poly(A) addition sites can be identified within the sequence (at positions 373 and 1190, respectively). These lie 827 nucleotides from one another and fit well within the length of the mouse CXglobin mRNA precursor (Ross and Knecht, 19781, suggesting that the two intervening sequences are transcribed and that processing involves their deletion and the covalent joining of internal segments of RNA. Ample support for such a mechanism comes from studies of the mouse p gene (Smith and Lingrel, 1978; Kinniburgh, Mertz and Ross, 1978; Tilghman et al., 1978). The sequence and its interruptions contradict certain early generalizations regarding the position and length of intervening sequences. Unlike certain viral genes (Berget, Moore and Sharp, 1977; Kitchingman, Lai and Westphal, 1977; Chow et al., 1977) and the chicken ovalbumin gene (Breathnach et al., 1978; Dugaiczyk et al., 1978), there is no interruption in the 5’ leader sequence. It has also been noted that intervening sequences are, on the average, longer than the coding sequences they interrupt (Crick, 1979). The mouse (Y gene differs in that the length of the coding sequence is almost twice that of the sum of its two small intervening sequences (453 bp versus 256 bp). In fact, the second intervening sequence, which is over 625 bp in both adult mouse j3-globin genes (Konkel et al., 1978, 19791, is only 135 bp in the (11 gene. Thus neither 5’ leader interruption nor length of intervening sequences (nor primary sequence; see below) seem to be absolute requirements for expression of the globin genes. In addition to the striking preservation of the location of the intervening sequences in the (Y- and /?-globin genes, one other general feature of their structure is preserved. The three (Y coding segments have a relatively high and uniform GC content (55, 59 and 59%, respectively, 5’ to 3’). On the other hand, the first and second intervening sequences differ greatly in their

Chromosomal

a-Globin

Gene

877

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400

20 30 40 10 50 60 70 80 90 I I I I I I I I I GTAAGCAGGTTGTGGTTGAGAAAGGAAAGTGTGAdACAGGGACCCAGAGGGAGAGGTGGGGGGATGGCGClGCTCAGTTTGGTTTGhGGGACTTGCTTCT I I I I I I I I I CTGACCAAGSTAGGAGGATACTAACTTCTTCCC~AACTGCCATCACTG~AG~CATAGTAAGGGGTAAGAAAGTGTGTCCGGGCAACT~ATAAGGATTCCC I I I I I I I I I TGC,~CCTAGGGGAAGCACAACCCAGCCCCAGAATCTCA~GGGCCCTAAC.~AGTTTTACTGGGTAGAGCAAGCACAA~CCAGCCAAlGAGlA~ClGCTCCA I I I tataag I I I C’P I I I TGCTACTTGCTGCPGGTCCA4GACACTTCTGATTCTGACAGACTCAGGAAGA AGGGCGTGTCCACCCTGCCTGGAGGACAGCCCTTGGAGGGCATArAfG

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

I

AACCATGGTGCTCTCTGGGGAAGACAAAAGC.~ACATCAAGGCTGCCTGG~GGAAGATTGGTGGCCATGGTGCTGAATATGGAGCTGAAGCCCTGGAAAGG INIVa~L@uSerGly~luAspLysSerAsnIleLy~AlaAlaTrp~lyLy~Il~G~~~GlyHi~Gl 4laGluTyr~lyAlaGluA!aLeuGluArq Y I TGAGAAC~,GGACCTTGATCTGTAAGGATCACAGGATCCAATATGGACCTGGCACTCGCTC~GTGGGCAGCTlCTAACTATGCTTTTCTGTGACCTCAACT - 31 I I I I I I I I I TCTCTTCTCTCCTTCTCCCAGGATGTTTGCTAGCTTCCCCACCACCAAG~CCTACTTTCCTCACTTTGATGTA~GCCACGGCTCTGCCCAGGTC4AGGGT iZ-MetPheAtaSerPhePr~Tt~rThrTyrPhePiOHi sPhe4syValSsrHi ~~l~~SerAldG~~~\‘~lLy~Gly I CACGGCAAGAAGGTCGCCGATGCGCTGGCCAGTGCTGCAGGCCACCTCGATGACCTCCCCGGTGCCTTGlCTGCTCTGAGCGACCTGCATGCCCACAAGC H~sGlyLysLysValAlansp4laLeuAlaSerAlaAlaGlyHisLeuAspAspLeuProGlyAlaLeu~~rAlaLeuSerA~pLeuH~sAlaH~sLy~L I I I I I I I I I TGCGTGTGGATCCCGTCAACTTCAAGGTATGCGCTGGGACCTGGCAGGCGGCATCTGGGACCCCTAGGAAGGGCTTGGGGGTCCTCGTGCCCAAGGCAGG euArgValAspProValAsnPheLys99 GAACATAGTGGTCCCAGGAAGGGGAGCAGAGGCATCAGGGTGTCCACTTTGTCTCCGCAGCTCCTGAGCCACTGCCTGCTGGTGACCTTGGCTAGCCACC 100-LeuLeuSerHi sC\,sLeuLeuValTtlrLeuPlaSerHi ACCCTGCCGATTTCACCCCCGCGGTACATGCCTCTCTGGACAAATTCCTTGCCTCTGTGAGCACCGTGCTGACCTCCAAGTACCGTTAAGCTGCCTTCTG isProAlaAspPheThrProAlaValHisAlaSerLe~~AspLysPheLeu~laSerValSerTl~rValLeuThrSerLysTyrArqTtR-l4Z

I aat. I IpA.. I I I I I I CGGGGCTTGCCTTCTGGCCATGCCCTTCTTCTCTCCCTTGCACCTGTACCTCTTGGTCTTTGAATAA~GCCTGAGTAGG~AGA~GCCTGCATGCCTGGTT I I I I I I I I 1 CTCTGCGTCTGCAAAGGTG;CATGTTTAGTGTGGGG4TGGGGTGCCGCAGCTCATTTGCCATGGGGCAGTAAAGACAAGGTTC~GAGCAAAAAGCATAATTGGAT

I

I

I

I

I

I

I

GCCTACACACACACACATATGTCTTCTGAGTCTGGGCAGCAGTCCCTCCCAAGCCCTCCACTGACAGCCATGTGTCTTCTCCTCGAGCCAAAGAAGCCAA I I I I I I I AGATCGTCTTTGGAGGGTCCTTATCACAGGACCTCTGAGGG cAp i 31 32 ---------------------------X*XXfX*X---------~**~**~***~*~~----------~*~*****~~-----------------------

99

Ix*z+xxxJ IVS-, I*I*I*xI***x*I I ---------------------------*~~*~*~~-------**~*~~~*~~~***----------**~~~*****----------------------2. Complete

100 IVS-2

t

I

pA. _.

I*xYi*4***I

I 14$1

0

Figure

sH

Sequence

of the Mouse

a-Globin

Gene

The complete nucleotide sequence Of the gene is shown, 100 nucleotides per line, scored at each tenth nucleotide. The amino acid translation of each coding block is indicated below each coding sequence. The amino acid codon borders of each coding region are numbered. The overscored sequences tataag, CAP. aataaa and pA. refer to a potential transcriptional initiation signal, the cap site, a potential poly(A) addition site signal and site, respectively. A diagrammatic map is shown at the bottom of the figure.

GC content (47 and 65%, respectively). Analysis of the p major gene (Konkel et al., 1979) suggested that coding sequences represented areas of relatively high GC content, especially as compared to flanking and intervening sequences. Recognition of a Possible Transcription Initiation Site and Signals The 5’ portion of the cx gene sequence is of obvious interest in that comparison with the ,& gene might reveal preserved sequences involved in the coordinate expression of these genes (Figure 3). Fortunately, the site at which the cY-globin mRNA capping occurs can be identified readily within the gene sequence, since it is in perfect agreement with the mRNA sequence determined by Baralle and Brownlee (1978). The cap is added to the dinucleotide AC at positions 373-374. In the 16-20 nucleotides that follow the cap site there is a region of rough homology between a: and /3 genes (shown in Figure 3) that gives way to a random association until the coding regions appear approximately 32 nucleotides in the 3’ direction. The “capping box” sequence (GTTGCTCCTCAC) noted previously in association with the cap site of the /3-globin genes [and the cap site of a major adenovirus mRNA (Ziff and Evans, 1978; Akusjtirvi and Pettersson, 1979)] is not well preserved in the cy sequence, although a dis-

placed pentanucleotide, TTGCT, is retained in each sequence (Figure 3). Perhaps the most interesting feature of the sequence in this region is the hexanucleotide TATAA; that begins exactly 30 nucleotides from the capping (or putative transcriptional initiation) site of the globin mRNA sequence. This spacing assures the occurrence of the cap site and the hexanucleotide in the same orientation with respect to the double helical structure of the DNA. They are three helical turns removed from one another. That is, when the cap structure is accessible through the major groove of the helix, the hexanucleotide will also be accessible through the major groove on the same face of the helix. Obviously much more than this “Pribnow box”like structure (Pribnow, 1975) may be involved in transcriptional initiation, but similar sequences have been noted in the Drosophila histone genes (M. Goldberg and D. Hogness, personal communication), in silk fibroin genes (Tsujimoto and Suzuki, 1979), in the chicken ovalbumin gene (Gannon et al., 1979) and in certain adenovirus initiation sites (Ziff and Evans, 1978; Akusjhrvi and Pettersson, 1979). It seems reasonable to suggest that they have a role in RNA polymerase recognition and may be a critical feature of a promoter site. It should also be noted that there are two prominent

Cell 878

tataaq

BETA

TGGAGGGCATATAAGTGCTAC

III

ALPHA

IIIIII

llIllIII/II

CATTGGGTATATAAAGCTGAGCAGGGTCA tataaa Figure

3. Comparison

of the Proposed

Transcription

Initiation

I

Sites of the Mouse

Globin

I

II

I I

BETA

MAJ 1 MIN

Genes

The 5’ portion of the (1. /3 major (Konkel et al., 1978) and fl minor (Konkel et al., 1979) genes are compared. Conserved homologies of the a and p sequences are indicated by vertical lines, and the regions of potential interest as transcription initiation signals are overwritten in lower case letters. The capped nucleotide is the (A) in the (CAP.. .) symbol. The preserved “capping box” pentanucleotide TTGCT is boxed.

palindromes close to the 5’ capping site (Figure 4). The smaller one, composed of 26 bases, occurs between the capping site and translation initiation site, and has been observed in many mRNAs including rabbit (Y- and ,&I-globin (Efstratiadis, Kafatos and Maniatis, 1977; Heindell et al., 1978) and human (Y- and P-globin (Baralle, 1977; Chang et al., 1977). The larger palindrome occurs on the 5’ side of the TATAAG sequence and is 43 bases in length (positions 296 to 338) including the questionable base at position 329. Imperfect palindromes have been found in the 5’ control regions of many procaryotic genes such as phage h (Maniatis et al., 197% the lactose operon (Gilbert and Maxam, 1973) the galactose operon (Muss0 et al., 1977) the tryptophan operon (Bennett et al., 1976) and the biotin operon (Otsuka and Abelson, 1978). Notice, however, that the longer palindrome is not present in either p major or p minor genes (Konkel et al., 1978, 1979). Preservation of Divergent Intervening Sequences and Splice Signals As noted above, preservation of two intervening sequences interrupting both the (Y and p genes is consistent with the occurrence of these interruptions in the common ancestral globin gene. The continued presence of intervening sequences throughout vertebrate evolution suggests further that they confer some advantage upon the organism. While this is possible, it has also been pointed out that losing an intervening sequence requires a deletion of such precision that it might be a very rare event (Crick, 1979). Still, early evidence obtained using globin genes encoded in SV40 viral hybrids suggests that at least one such intervening

sequence

is required

for

the

accumulation

of stable mRNA (Hamer and Leder, 1979). This finding certainly suggests a critical role for these sequences in gene expression and suggests that they have been preserved for reasons other than their being difficult to eliminate. Aside from occurring in the same relative position within the coding sequences, the a! and p intervening sequences have retained very little (if any) extensive homology during their long separation. Both splice regions obey the GT/AG rule noted among the intervening sequences of ovalbumin (and other) genes (Breathnach et al., 1978; see Figure 2). This rule specifies the deletion of an intervening sequence be-

ginning with a GT dinucleotide at the 5’ end and concluding with an AG dinucleotide at the 3’ end. While the rule appears to hold in most instances, it does not seem to provide a complete signal for RNA splicing (obviously such dinucleotides occur very frequently within all portions of the gene). Somewhat larger canonical sequences have been identified (Seif, Khoury and Dhar, 1979; Lewin, 1980) but while these might be necessary they are also not likely to be sufficient splicing signals. Aside from the few nucleotides at the borders of each intervening sequence, no major homologies are present within the intervening sequences of the two genes. In addition, computerassisted searches of both sequences have failed to reveal any self-complementary structures that would allow the formation of stem-like structures at the base of each intervening sequence. These include searches of the entire region sequenced. Indeed, the major regions of preserved homology between the 01 and P genes are within the coding sequence (Figure 6), especially (but not exclusively) near the 5’ border of each intervening sequence. The role that this region might have in signaling a splicing event has already been tested by cloning a portion of the P-globin gene within an SV40 hybrid (Hamer and Leder, 1979). Such studies, using portions of the mouse /I major genes, indicate that splicing is accomplished in cloned segments that preserve no more than 18 coding nucleotides on the 5’side of the larger intervening sequence. Untranslated Region and a Possible Poly(A) Addition Site Since the sequence of the 3’ noncoding portion of mouse a-globin mRNA has not been determined, the poly(A) addition site must, for the time being, be inferred from a comparison to globin mRNA sequences determined in other species. The sequences of this region of human and rabbit cu-globin mRNAs are compared to the mouse gene sequence in Figure 5. The regions of homology can be aligned by compensating for differences that probably arose by the deletion of 24 or 14 base sequences from the rabbit and mouse sequences, respectively. In addition to the obvious homology in the coding sequence, a major region of homology (23 nucleotides) has been preserved surrounding the hexanucleotide AATAAA. This hexanucleotide is found to precede the poly(A) addition site of many mRNAs by roughly 25 nucleotides

Chromosomal 879

wGlobin

Gene

GC T C C T C-G C-G A A C-G C-G T-A G-C T-A

A C ::A C-G T-A C L-T G-C T-A C-G

E - G 329 G-C G-C ;:; A -T C-G C-G T -A ..AACTGC 296

G T-A -A

IN1 374 l - GAACCm. CAP A

399

ii

G 338 G G C

i A C

A

T ; A G T G A A A C GTGCTACTTGCTG Figure 4. Palindromic Sequences the a-Globin Gene Sequence

Located

Close

to Capping

Sites of

The sequence on the 5’ portion of the a-globin gene has been drawn as a hypothetical self-complementary structure. The initiation codon (INI) and the conserved hexanucleotide tafaag are noted.

(Proudfoot and Brownlee, 1976) and is likely to be similarly aligned in the mouse a-globin gene. The precise point of poly(A) addition is more difficult to predict. In the mouse (Y gene, three GC dinucleotides are present within 25 nucleotides of the 3’ side of the AATAAA, any one of which might be the poly(A) acceptor. We have arbitrarily placed this site at nucleotide position 1190 (Figure 2). Modes and Rates of Evolution of the Globin Genes In the accompanying paper by Konkel et al. (1979) we have shown how the two closely related fi-globin genes have diverged from one another. Calculations suggest that this process has occurred over a period of approximately 50 million years (Dayhoff, 1972). The principal mode of rapid sequence divergence used within the intervening sequence of these genes

appears to be the insertion, deletion and duplication of fairly large segments of DNA. The two P-globin coding sequences tolerated no such changes. These sequences differ from one another in only 17 of 438 positions. After such a relatively short period of time, however, the flanking sequences beyond a region of -100 bp on either side of the transcribed sequences have diverged so widely that their relationship can no longer be reconstructed. In fact, we have argued (Tiemeier et al., 1978) that this rapid loss of homology in flanking and intervening sequences might have a role in the stabilization of duplicate genes under appropriate circumstances. A comparison of the (Y and p genes provides the opportunity to examine this process of divergence after 500 rather than 50 million years (Figure 6). The flanking and intervening sequences are now so different that their relationship and the mechanism by which they diverged (presumably first involving insertions and deletions) can no longer be ascertained. The coding sequences have retained a strong and obvious relationship (Figure 6). The principal differences between these coding sequences clearly arise more slowly as a result of point mutations. Seven insertions (or deletions) have occurred, each consisting of a triplet or some multiple of three bases. Since deletions and insertions can influence the reading frame of the sequence, they must be selected against in the coding region. On the other hand, the coding sequences have diverged by the accumulation of a large number of point mutations. Indeed, the amino acid sequence divergence that results from these point mutations (about 50% of the amino acids compared) could have been brought about by far fewer base changes (the sequences differ in 44% of their nucleotide positions). We can thus imagine two modes for bringing about change in a chromosomal sequence: a rapid one involving insertions and deletions and a slower one involving point mutations. Both modes probably operate over the entire genome, but the deletion-insertion mode is apparently selected against in the coding regions of essential genes, while operating to change large portions of gene-flanking and intervening sequences more rapidly. Despite the many point mutations scattered throughout all three coding blocks, several isolated regions of close homology have been preserved within the two sequences. Such regions are close to the 5’ borders of both intervening sequences and in the central portions of the outside two coding blocks. Whether these regions serve the requirements of globin structure or the requirements of some feature of the gene transcript or the gene itself is not yet known. Role of the Intervening Sequences There has been considerable speculation regarding the role that intervening sequences might have during evolution. Gilbert (1978), for example, has argued

Cell 880

GCCUCGGUAGCNGUNCCNCCNGC~~G~UGGGCCCAACGGGCCCUCCUCC

Hlllllall Rabbit

ACCUCCAAAUAUCGUUAAGCU GCCU-GGGAGCCG-GCCU------------------------GCCCUCCGCC

Mouse

Figure

5. Comparison

of the 3’ Untranslated

The 3’ portions of the human and rabbit regions of homology are boxed.

Portions a-globin

of Human, mRNAs

Rabbit

(Proudfoot

and Mouse

cu-Globin

et al., 1977)

Genes

are compared

with the mouse

gene

sequence.

Conserved

INlVal LeuSerGlyGluAspLysSerAsnIleLysAlaAlaTrpGlyLyslleGlyGlyHisGlyAlaGluTyrGlyAlaGluAlaLeu . . . AACCATGGTG ClCTCTGGGGAAGACAAAAGCAACATCAAGGCTGCCTGGGGGAAGATTGGTGGCCATGGTGCTGAATATGGAGCTGAAGCCCTG I IIIIIII II III I II II II lllll III I I IIII III I III . . . CATCATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGTCTCTTGCCTGTGGGGAAAGGTG AACTCCGATGAAGTTGGTGGTGAGGCCCTG lNlValHisLeuThrAspAlaGluLysAlaAlaValSerCysLeulrpGlyLysVal AsnSerAspGluValGlyGlyGluAlaLeu GAGAACA... I I TGGTATC...

‘4etPheAlaSerPheProThrThrLysThrTyrPheProHisPhe . . . CTCCCAGGATGTTTGCTAGCTTCCCCACCACCAAGACCTACTTTCCTCACTTT

IVS-1

I

IVS-1

III

II

I I I

II

III

III

II

IIIIII

I

IIII

. . .TTTTTAGGCTGCTGGTTGTCTACCCTTGGACCCAGCGGTACTTTGATAGCTTTGGAGACCTATCCTCTGCCTC LeuLeuValValTyrProTrpThrGlnArgTyrPheAspSerPheGlyAspLeuSerSerAlaSe

AspValSerHisGlySe GATGTAAGCCACGGCTC

II

II.1

AlaGlnValLysGlyHisGlyLysLysValAlaAspAlaLeuAlaSerAlaAlaGlyHisLeuAspAspLeuProGlyAlaLeu GCCCAGGTCAAGGGTCACGGCAAGAAGGTCGCCGCCGATGCGCTGGCCAGTGCTGCAGGCCACCTCGATGACCTGCCCGGTGCCTTG

‘;

III

I II

III1

II

IIIIIIIIIII

III

I

I

II

IIIII

TGCTATCATGGGTAATGCCAAAGTGAAGGCCCATGGCAAGAAGGTGATAACTGCCTTTAACGATGGCCTGAATCACTTGGACAGCCTCAAGGGCACCTTT rAlalleMetGlyAsnAlaLysValLysAlaHisGlyLysAlaHisGlyLysLysVallleThrAlaPheAsnAspGlyLeuAsnHisLeuAspSerLeuLysGlyThrPhe SerAlaLeuSerAspLeuHisAlaHisLysLeuArqValAspProValAsnPheLy~ TCTGCTCTGAGCGACCTGCATGCCCACAAGCTGCGTGTGGATCCCGTCAACTTCAAGGTATGCGC...

II

II

II

II

II

IIIIIIIII

IIIIIIIlI

I

IIIIIII

III

GCCAGCCTCAGTGAGCTCCACTGTGACAAGCTGCATGTGGATCCTGAGAACTTCAGGGTGAGTCT... AlaSerLeuSerGluLeuHisCysAspLysLeuHisValAspProGluAsnPheAr~ sL@uLeuValThrLeuAlaSerHisHisProAla CCTGCTEGTGACCTTCGCTAGCCACCACCCTGCC

I

III

III

GATCGTGATTGTGCTGGGC tlleVallleValLeuGly

Figure

6. Comparison

IIIIIII

II

I

I

Portions

.

IVS-2

.

II

AspPheThrProAlaValHisAlaSerLeuAspLysPheLeuAlaSerValSerThrValLcu GATTTCACCCCCGCGGTACATGCCTCTCTGGACAAATTCCTTGCCTCTGTGAGCACCGTGCTG

IIIIIIIIIIIIII

I III

II

CACCACCTTGGCAAGGATTTCACCCCCGCTGCACAGGCTGCCTTCCAGAGG HisHisLeuGlyLysAspPheThrProAlaAlaGlnAlaAlaPheGlnLys

of the Coding

IVS-2

of Mouse

a- and /3-Globin

I

I

I I

I II

IIIIII

III

I

II

III1

III

II

GTGGCTGGAGTGGCCACTGCCTTG ValAlaGlyValAlaThrAlaLeu

Genes

The coding portions of the sequences of the mouse a (upper) and ,9 major (lower) (Konkel et al., 1976) are compared. Base homologies indicated by horizontal lines. IVSl and 2 refer to the first and second (5’ to 3’) intervening sequences. The coding sequences are boxed.

that such sequences might facilitate evolution by allowing segments of genes to undergo illegitimate recombination and thereby shuffle coding blocks. This model is consistent with the apparent mode of domain evolution of immunoglobulin genes. Its application to the globin and ovalbumin genes is less apparent, although as we have previously pointed out the major heme binding segment of globin is encoded in the central coding block, while weaker contacts occur in residues encoded in the other two coding blocks. We have previously argued for another and not necessarily contradictory role for intervening sequences (Tiemeier et al., 1978). That is, under certain circumstances (as in the globin genes) these sequences might diverge (as the larger one does) and reduce regions of homology between closely linked genes, thereby stabilizing them by preventing gene loss or amplification through unequal crossing over. In other genetic systems, such as the immunoglobulin light chain variable region genes, flanking and intervening sequences appear to be preserved. This

II

are

should facilitate homologous recombination and immunoglobulin gene diversification which would accrue to the advantage of the organism (Seidman et al., 1978). Apart from the elusive role these sequences might have in evolution, it is at least possible to design experiments in which the physiological role of these sequences in gene expression can be tested. It is clear, for example, that the intervening sequences and alternative splice mechanisms can serve to generate additional genetic information from a gene sequence, as in SV40 where small t is encoded in the intervening sequence of large T (Berk and Sharp, 1978; Fiers et al., 1978; Reddy et al., 1978). It is possible that such information compaction is the special requirement of a relatively inflexible viral genome. Intervening sequences also seem to have some part in the production of stable cytoplasmic mRNA. Several experiments, including some involving cloned elements of the mouse P-globin gene, indicate that transcripts synthesized following SV40 infection of host

Chromosomal 881

a-Globin

Gene

cells do not accumulate in stable form unless they contain an authentic intervening sequence (Hamer and Leder, 1979; Griiss et al., 1979). In view of such results and the apparent preservation of intervening sequences in the globin genes, it is probable that intervening sequences, in addition to whatever role they have in evolution, also have a critical function in the expression of the genes they interrupt. Experimental

Procedures

Chemicals and Enzymes Restriction endonucleases, T4 DNA ligase and polynucleotide kinase were purchased from New England Biolabs (Beverly, Massachusetts). Bacterial alkaline phosphatase was from Worthington Biochemicals (Wilmington, Delaware). LX-~‘P-ATP (spec. act. -3000 Ci/mmole) was from Amersham (Chicago, Illinois) or New England Nuclear (Boston, Massachusetts). DNA Purification The E. coli strain LE392 carrying the hybrid plasmid was grown under P3 conditions in Difco Brain Heart Infusion, and the plasmid was purified according to the method of Clewell and Helinski (1969) without NaOH treatment. DNA fragments generated by restriction endonuclease treatments were separated from each other by polyacrylamide gel electrophoresis in 50 mM Tris-borate (pH 6.3) supplemented with 2.5 mM EDTA. The fragments were cut, inserted into dialysis bags and eluted electrophoretically from the gel in the same buffer. DNA Sequencing The detailed procedure for DNA sequencing is that described Maxam and Gilbert (1977). Our minor modifications have been scribed elsewhere (Konkel et al., 1976).

by de-

Acknowledgments We are grateful to Dr. Jacob V. Maizel, Jr., for the development of computer programs used in the display and comparison of our sequence data. We are also grateful to Ms. Terri Broderick for her expert assistance in the preparation of this manuscript. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. Received

August

D. (1969).

Crick,

204.

F. (1979).

Science

Dayhoff, M. O., ed. (1972). (Washington. D.C.: National

Efstratiadis, 585.

A., Kafatos.

Baralle,

F. E. (1977).

Baralle,

F. E. and Brownlee.

U. (1979).

Cell 12, 1085-l

J. Mol. Biol., in press.

095.

G. G. (1978).

Nature

274, 84-87.

Bennett, G. N.. Schweingruber, M. E., Brown, K. D., Squires. C. and Yanofsky. C. (1976). Proc. Nat. Acad. Sci. USA 73, 2351-2355. Berget. S. M.. Moore, C. and Sharp, Sci. USA 74, 3171-3175.

P. A. (1977).

Berk. A. J. and Sharp, 1274-l 278.

Proc.

P. A. (1978).

Proc.

Nat. Acad.

Nat. Acad.

Sci. USA 75.

Bolivar. F.. Rodriguez, R. L.. Greene, P. J.. Betlach, M. C.. Heyneker. H. L.. and Bayer. H. W.. Cross. J. H. and Falkow. S. (1977). Gene 2, 95-113. Breathnach, I?., Benoist, C., O’Hare, K., Gannon. F. and Chambon. P. (1978). Proc. Nat. Acad. Sci. USA 75, 4853-4857. Chang. J. C., Temple, G. F., Poon. R.. Neumann, K. H. and Kan, Y. W (1977). Proc. Nat. Acad. Sci. USA 74, 5145-5149. Chow, L. T.. Gelinas. Cell 12. 1-8.

R. E., Broker,

T. R. and Roberts,

R. J. (1977).

Nat. Acad.

Sci. USA 62,

264-271.

Atlas of Protein Sequence and Structure Biomedical Research Foundation).

Fiers. N., Contreras, A., van Heuverswyn. Ysebaert. M. (1978).

F. C. and Maniatis.

T. (1977).

Cell I 0, 571-

R., Haegeman, G., Rogiers, R., van de Voorde. H., Van Herreweghe. J.. Volckaeit. G. and Nature 237, 113-l 20.

Gannon. F., O’Hare, K., Perrin. F.. LePennec. J. P.. Benoist. C.. Cachet. M., Breathnach. R., Royal, A.. Garapin. A., Cami. B. and Chambon, P. (1979). Nature 278, 428-434. Gilbert,

W. (1978).

Nature

Gilbert. W. and Maxam. 3581-3584.

277, 501. A. (1973).

Proc.

Grijss, P., Lai. C.-l., Dhar. A. and Khoury. Sci. USA 76, 4317-4321. Hamer.

D. H and Leder.

P. (1979).

Heindell. H. C.. LIU, A., Paddock, W. A. (1978). Cell 75. 43-54. Kinnlburgh. 693.

A. J.. Mertz,

Nat. Acad. G. (1979).

D. A

Tilghman.

Proc.

70.

Nat, Acad.

G. M. and Salser,

J. (1978).

Kitchingman. G. R., Lai. S.-P. and Westphal. Acad. Sci. USA 74. 4392-4395. Konkel. 1132.

Sci. USA

Cell i7, 737-747. G. V.. Studnicka,

J. E. and Ross,

S. M. and Leder.

Cell 74. 681-

H. (1977).

P. (1978).

Proc.

Nat.

Cell 15. 1125-

Konkel, D. A., Maizel. J. V., Jr. and Leder, P. (1979). Cell 18, 865873. Lawn, R. M., Fritsch. E. F., Parker, R. C.. Blake, G. and Maniatis, T. (1978). Cell 15. 1157-l 174. Leder. A., Miller, H. I., Hamer, D. H.. Seidman. J. G.. Norman, Sullivan, M. and Leder. P. (1978). Proc. Nat. Acad. Sci. USA 6187-1691. Lewin. B. (1980). Gene Expression, 2, Eucaryotic (New York: John Wiley 8 Sons), chapter 20. Mamatis, T.. Ptashne. Jeffrey, A. and Maurer, Maxam.

B., 75,

Chromosomes.

M.. Backman. K.. Kleid. D.. Flashman. R. (1975). Cell 5, 109-l 13.

A. M. and Gilbert.

W. (1977)

Proc.

Nat. Acad.

Sci

S..

USA 74,

560-564.

R.. DiLaura, R.. Rosenberg, M. and de Crombrugghe. Proc. Nat. Acad. SCI. USA 74, 106-l 10.

Ohmori. H.. Tomizawa. Res. 5, 1479-l 485. G.. and Pettersson.

Proc.

Dugaiczyk. A.. Woo, S. L. C., Lai. E. C.. Mace, M. L.. Jr., McReynolds, L. and O’Malley. B. W. (1978). Nature 274, 328-333.

Musso. (1977).

23, 1979

References Akusjdrvi.

CleWelI. D. and Helinski. 1159-1166.

Otsuka. Popp.

J.-l. and Maxam.

A. and Abelson, R. A. (1967).

Pribnow,

D. (1975).

Proudfoot.

J. (1978).

A. M. (1978).

Nature

Nucl.

276, 689-694.

J. Mol. Biol. 99, 419-443.

N. J. and Brownlee,

G. G. (1976).

S.. Smith,

Nature

M. and Longley,

263,

J. and Knecht.

Seidman. J. G.. Leder, Science 202. 1 l-l 7.

Smith,

D. A. (1978).

K. and Lingrel.

R. (1979).

J. B. (1978).

Cell

K. N.. Zam. S. (1978).

J. Mol. Biol. 179, l-20.

A., Nau, M.. Norman,

G. and Dhar.

21 l-214.

J. I. (1977).

Reddy. V. B.. Thimmappaya. B.. Dhar. R., Subramanian. B. S.. Pan, J.. Ghosh. P. K., Celma. M. L. and Weissman. Science 200. 494-502.

Seif. I., Khoury, 3398.

Acids

J. Mol. Biol. 27, 9-16.

Proudfoot. N. J.. Gillam. II, 807-818.

Ross,

B.

B. and Leder.

Nucl.

Nucl. Acids

Acids

P. (1978).

Res. 6, 3387-

Res. 5. 3295-3301.

Tiemeier. D. C.. Tilghman. S M.. Polsky. F. I., Seidman, J. G.. Leder, A., Edgell. M. H. and Leder. P. (1978). Cell 14, 237-245. Tilghman,

S. M.. Tlemeier.

D. C.. Polsky.

F. I , Edgell, M. H.. Seidman.

Cell 882

J. G.. Leder. A., Enquist. L. W., Norman, Proc. Nat. Acad. Sci. USA 74, 4406-4410.

B. and Leder,

P. (1977).

Tilghman, S. M.. Curtis, P. J.. Tiemeier, D. C., Leder. P. and Weissmann. C. (1978). Proc. Nat. Acad. Sci. USA 75, 1309-1313. Tsutimoto.

Y. and Suzuki.

Y. (1979).

Cell 16, 425-436.

van den Berg. J.. van Ooyen. A., Mantei. N., Schambock. veld. G., Flavell, Ft. A. and Weissmann. C. (1978). Nature 44. Ziff, E. B. and Evans, Note

Added

R. M. (1978).

A., Gros276, 37-

Cell 15, 1463-1475.

in Proof

The single ambiguous base assignment at position 329 referred to on p. 876 and in Figures 2 and 4 has been resolved by additional sequencing. Instead of a single G as indicated, there is a CG dinucleotide at this position: thus the large palindrome noted in Figure 4 close to the cap position is actually perfectly complementary.

The complete sequence of a chromosomal mouse alpha--globin gene reveals elements conserved throughout vertebrate evolution.

Cell. Vol. 18, 875-882, November 1979, Copyright 0 1979 by MIT The Complete Sequence of a Chromosomal Mouse a-Globin Gene Reveals Elements Conser...
795KB Sizes 0 Downloads 0 Views