GENOMICS14, 948-958 (1992)

Junctions between Genes in the Haptoglobin Gene Cluster of Primates LAURIEM. ERICKSON,HYUNGSUK KIM, AND NOBUYOMAEDA Department of Pathology and Curriculum of Genetics, University of North Carolina, Chapel Hill, North Carolina 27599-7525 ReceivedJanuary 28, 1992; revised June 25, 1992

To investigate the nature of the recombination that g e n e r a t e d t h e h a p t o g l o b i n t h r e e - g e n e c l u s t e r i n Old World primates, we sequenced the region between the s e c o n d g e n e (HPR) a n d t h e t h i r d g e n e (HPP) i n c h i m p a n z e e s ( 1 5 k b ) , as w e l l a s t h e r e g i o n 3' to t h e c l u s t e r i n h u m a n s ( 1 4 k b ) . C o m p a r i s o n to t h e p r e v i o u s l y seq u e n c e d h u m a n h a p t o g l o b i n (liP) a n d H P R g e n e s showed that the junction point between HP and H P R in h u m a n s ( j u n c t i o n 1) w a s n o t i d e n t i c a l to t h e j u n c t i o n point between the HPR and H P P genes of the chimpanz e e ( j u n c t i o n 2). A n A l u s e q u e n c e w a s f o u n d at e a c h j u n c t i o n , b u t b o t h A l u s e q u e n c e s l a c k e d s h o r t d i r e c t repeats of the flanking genomic DNA. The lack of direct repeats implies that both junction Alu sequences are the products of recombination between different Alu elements. In addition, other insertion and deletion events a r e c l u s t e r e d i n t h e r e g i o n s n e a r t h e j u n c t i o n A l u sequences. The observation that Alu sequences define the junctions between genes in the haptoglobin gene cluster emphasizes the importance of Alu sequences in the evol u t i o n o f m u l t i g e n e f a m i l i e s . © 1992 AcademicPress, Inc.

INTRODUCTION Th e first step in the evolution of a multigene family is the rare and unpredictable generation of a duplicated gene from a single ancestral copy. Understanding the mechanism of the initial event requires detailed knowledge of the boundaries of the duplication unit and of the junctions between duplicated genes. T h e analysis is often difficult and sometimes impossible to perform in contemporary multigene families, because most families were formed so long ago that subsequent mutations have masked the original duplication event. Th e haptoglobin gene cluster is a relatively young gene family. Maeda (1985) calculated the age of the duplication to be at least 30 million years, based on molecular differences between the human H P and H P R genes. T h e phylogenetic evidence also supports this age. T he cluster consists of three genes (HP, HPR, and HPP) in chimpanzees, gorillas, orangutans, and rhesus monkeys but only two (HP and HPR) in humans; an unequal 0888-7543/92$5.00 Copyright© 1992by AcademicPress, Inc. All rightsof reproductionin any formreserved.

crossover in the human lineage fused the ancestral HPR and H P P genes, reducing the number of haptoglobin genes to two (McEvoy and Maeda, 1988). In contrast, the New World spider monkeys, whose lineage separated from t hat of Old World primates at least 40 million years ago (Gingerich, 1984), have only one haptoglobin gene (McEvoy and Maeda, 1988). Thus, the haptoglobin multigene family in Old World primates was created between 30 and 40 million years ago. T he youth of the haptoglobin multigene family makes it suitable for the study of the evolution of multigene families, and especially for analyzing the initial events t hat created the family. An earlier study (Maeda, 1985) began the investigation of haptoglobin evolution by comparing the sequences of the region 5' to the H P gene, the junction between the H P and HPR genes, and the region 3' to the HPR gene in the human gene cluster. T he comparison showed t hat the junction between the H P and HPR genes contained approximately 600 bp of DNA that were not found elsewhere in the gene cluster. T he 3' half of this 600 bp was a member of the Alu family of repeated sequences but lacked the flanking direct repeats characteristic of Alu repeats. T he origin of the 600-bp fragment could not be determined at th a t time, nor could an explanation be found for the lack of flanking repeats in the Alu sequence. In the current paper, we investigate the three-gene cluster of the chimpanzee. T he simplest explanation of the formation of the triplicated gene cluster is th a t it arose from duplicated genes by a homologous but unequal crossover event. If this is the case, the ancestral junction between the duplicated genes should be replicated exactly in the triplicated cluster and both of the resulting junctions should be identical to the ancestral junction. We therefore analyzed the junction between the HPR and H P P genes in chimpanzees (junction 2), to explain the H P - H P R junction in humans (junction 1) and the origin of the haptoglobin gene family. Comparisons of the junctions show t hat both junctions are formed as a result of Alu-Alu recombination but the Alu sequences at the two junctions are different from each other. T he occurrence of Alu-Alu recombination at both junctions documents the frequent involvement of Alu sequences in the evolution of gene families.

948

949

EVOLUTION OF THE PRIMATE HAPTOGLOBIN GENE CLUSTER 5' HP

junction 1

l~q

i

junction 2

t

t

3' HPP

I

[

7

¢

a

b Chimpanzee

~

1 '

234 '' '

5

1 ~

234 ~

IRTVL-la, 9 k b l

5 l

1

1

~

234

1

234

I" I I I

5

.,=

IIRTVL-Ib, 9kblIARPP, 0.5k011RTVL-Ic 6kb'l I

C Human

234

5

fJ

.~ ~.

.

I RTVL-la, 9 kbl I RTVL-Ic, 6 kb I

F I G . l . Haptoglobin gene cluster in human and chimpanzee. (a) The three gene units and their junctions and flanking regions are indicated by shaded boxes at the top of the figure. H P is shown with left diagonal shading; H P R with a spotted pattern; H P P with right diagonal shading. Open boxes are regions flanking the gene cluster. Bracketed regions are the regions that we have examined closely. The organizations of the human and chimpanzee haptoglobin gene clusters are shown in (b) and (c). Numbered bars are exons. Triangles represent Alu sequences and their orientations. The dashed lines show the deletion in the human cluster that was caused by a recombination between the ancestral H P R and H P P genes. Boxes below the line are insertions (not to scale). RTVL, retrovirus-like sequence; ARPP, acidic ribosomal phosphoprotein.

MATERIALS AND METHODS The human H P R 3' flanking sequence was obtained from a clone containing the 3' portion of the RTVL-Ic insertion (Maeda and Kim, 1990). The chimpanzee H P R - H P P junction was obtained from clone 12.4B of McEvoy and Maeda (1988). The sequences of 5'HP and of junction 1 have already been determined (Maeda, 1985). Nucleotide sequences were determined by the methods of Maxam and Gilbert (1977) and of Sanger et al. (1977) using Sequenase (United States Biochemical, Cleveland, OH). About 95% of the sequence was determined on both strands. The regions sequenced on only one strand were confirmed by sequencing from at least two restriction enzyme sites. The sequences have been submitted to Genbank/EMBL (Accession Nos. M69197, M84462, and M84463). Sequences were analyzed using software provided by the University of Wisconsin Genetics Computer Group (Devereux et al., 1984).

RESULTS AND DISCUSSION

Gene Organization T h e organization of the haptoglobin gene cluster and our scheme for referring to its different parts are illustrated in Fig. 1. T he full length cluster contains the genes HP, HPR, and HPP, junction regions i and 2 connecting the genes, and flanking 5'HP and 3'HPP regions (Fig. la). Th e haptoglobin cluster in chimpanzees contains all three genes, HP, HPR, and H P P (Fig. lb), whereas the cluster in humans contains two genes, H P and H P R (Fig. lc). Because the H P R gene in humans is a fusion gene resulting from a homologous recombination between the ancestral H P R and H P P genes in exon 5 (McEvoy and Maeda, 1988), the 3' end of the human H P R gene is equivalent to the 3' end of the chimpanzee H P P gene. We sequenced a 14-kb region 3' to the cluster in humans and a 15-kb region between H P R and H P P in chimpanzees. T h r e e types of insertions occur in the cluster: Alu sequences, retrovirus-like sequences (Maeda and Kim, 1990), and a processed pseudogene of an acidic ribosomal phosphoprotein gene (ARPP, see below). These insertions are indicated in Figs. l b and c. T h e H P and H P R genes in human and chimpanzee have

an Alu sequence in intron 1, but the chimpanzee H P P gene has no corresponding Alu sequence. In the previous report of the mapping of the chimpanzee haptoglobin genes (Fig. 1 of McEvoy and Maeda, 1988) the location of the Alu sequence near the promoter of the H P P gene was misassigned. T he minimum length of the duplicated haptoglobin unit was estimated to be 10.7 kb, which is the distance between equivalent exons of HPR and H P P after disregarding the inserted elements described above. This distance is much longer t han the distance between the equivalent exons of H P and H P R (6.5 kb).

Identifying the Junction Points We aligned and compared the nucleotide sequences of the four bracketed regions from Fig. la. Each of the two junction points could then be identified as the point at which, in traversing the aligned regions in a 5' to 3' direction, the junction region switched from being the 3' end of one gene to being the 5' beginning of the next gene. T he junction region sequence therefore changed from being homologous to 3 ' H P P to being homologous to 5'HP. In Fig. 2, we present the nucleotide sequences of the four regions bracketed in Fig. la. T he regions are the flanking 5'HP region (which contains some 5'-flanking sequence as well as the 5' end of the H P gene unit); the junction 1 region ( H P - H P R junction); the junction 2 region ( H P R - H P P junction); the flanking 3 ' H P P region (which contains the 3' end of the H P P gene unit as well as some 3'-flanking sequence). T h e junction 1 sequence (J1) is from humans and the junction 2 sequence (J2) is from chimpanzees. J1 and J2 begin 1.3 kb 3' to exon 5 of their respective genes and end at intron 1 of the next gene. 5'HP begins 1050 bp upstream of exon 1 of human H P and ends at intron 1. 3'HPP begins 1.3 kb downstream of exon 5 of human H P R (equivalent to chim-

950

ERICKSON, KIM, AND MAEDA J|

tggaaaataacagtcaaactcttaaaaaaa,tgcctctcagaacacagtcacttacacctt . . . . . . . . . . . . . . . . . . .

ttcctctggtagttctcttc

J2

IIIIIIIIIIIIIII|IIIIIIFII|IIII IIIIIIIIIIII III|IIIII1111111 tgglalitmmcmgtciuctcttllllii$lagcctct¢lgil¢gC|gtcmctt~c~cctt Ill Ill I ll/l|tllll|ll IIIIIII /11111 1/111 1111f11111111 II

11111111111111F I 1 IIII Ill Ill t I

3'HPP

Jl J2

(RTVL-Ib

1784~

t~cmccttttcctctg~tlgttctmt~¢

tggcaaacaccagtcaaactctttaaaaazatgcctcttagaacacagtcacttacacatt . . . . . . . . . . . . . . . . . . .

ttcc . . . . . tagatctcttc

"9~cc~mcct~9~"g~c~-~t~tcccca~tt~tctt~"~"mgtct~tt9t~tt~9ta~gt~9~ttct9~t"*~t~atggt"g~ga |llll II/ll|ll 1/11/11111/I IIIIII I/||11i111|11|111 I | I / I I H I I I I t l i //111111/1111111t 11111111 agcactaagtttccaggaagaacaagccactccctagttttcttaaagaattctctgattgtattaag

l/lll

II/|111

|lll/llllll

IIIII

Illlllllllll/l|ll

|lt/ltl/ll|tll/

.

.

.

.

17941

ggcttctggtaaaataaaagtaagaga

.

//t1111//11111

II 111111111

3'HPP

agcaccaagtttcttggaagaacaaggcactctgcagttttcttaaagaattgtctgattgtattaag

J1

aaagctctggcattctcttgatcactcatcttctctgccttcctccctagcaacatctcctggcctcagcctgtgggcttcagcaacaccaaactccatg Ill IIIIIII Ill1111 Ill11111111111|1111 IIIII II1|1 II |1111|111111/111|1t//11111|1111111 IIIIII I/ 18041 aaaactctggcgttctcttaatcactcatcttctctgccctcctctccagcaccacctcctggcctcagcct•tgggcttcagcaacacccaactccgt• III IIII/11 IIIIIII III1|11111111111111 IIIII IIIII11111111111111111111t11//|1111 /1|1

J2 3'HPP

J|

. . . . tggcttctggtaaaacaacagtaagaga

aaagctctggcattctcttgatcactcatcttctctgccttcctccccagcaccacctcctggcctcagcctgtgggcttctgcaa

..............

tt9ctctctg~c"gccc~Gccgtctcc~gttctt~ggcittmtcctt~gg~gcccttt~ct~"~ticccttccccit~ct~tm~c~cctcct~

J2

III I III/111 IIII /I IIII II I I//llllllllllllll///llllll/lt/lllll 1111111111 II/111 1/111111111 ttgttatctgacatgcccagctgtctgcacccc~tgggctttatttcctcggaggccctctgct~aaattt~ctt~tt~tgtzt9z~cm~cta

3'HPP

....................................................................................................

J]

cccgccttcaacatgcaggcactgcctcctctaagga . . . . . . . . . . . . . . . . . . .

J2 3'HPP

I I IIIllll II Illlll|lllllllllllllll |1111111 IIIIIII ll/llll/I ctctccttcaa~atacaggcactgcctcctctaaggacattttctcctgtccaggagccatctggacagtacttatatcacagtacttacatagctgtga 18241 ....................................................................................................

J1

gccatctg . . . . . . . . . . . . . . . . . . .

]8141

tacttacgtagctgtga

J2

gcataggatggggcatacagcaggcacttaacaaatacttgtggaatggaaagaaactaagataggatcaaaaatacaaatcaagatgtgagagaaagac IIIIIIIIIl/llll|lllllllll|]llll III!111|1111 III1|111 IIIIIIIIII/11111111|111111111 /I Ill|Ill|Ill gcataggatggggcatacagcag~cacttaataaatacttgtggcatggaaagcaactaagataggatcaaaaatacaaatcaggacatgagagaaagat 1834]

3'HPP

....................................................................................................

Jl

gatcaaaagccct-ca---J . . . . . . . cgagggg~CGGGCGTGAIGGCTCACACCTGTAATCTCAACACTTTGGGAGGC~AAGGCGGGTGGMCACTTG

J2

gagagaaaggccttca. . . . . . . . . . . ggaggaagaagacatgaaaattttgcgcatgcacattctgagttttggctggtggggtgagcacttaccatac~ 18430

3'HPP

................

Jl

~-G~G~A~TT~A~T~G~A~G~ACATGGAGAAACCTCGTCTCTACIAAAAATACAAAAAATTA . . . . . . . . . ~. . . . . . . . . ~. . . . . . . . . ~ ~ A l u 2 a

J2

ctcaaatactgtgtgatctcggctgaagaacagaaaacctgactaacagtgacttaggcaaaa~gaaattttzttctttcacagaagtctg~a~g~agca

3'HPP

II

Illl ll"I-IT .....

"1TI'1-1

I

IIl|llllll

I

I I I II I

Illll

IIllllll

II Ill

II

IIIll

II

I

IIIIlllllllllll//ll/

I

IIl/

I )

IIllllllll

(RTVL-Ic) ~a~,gaa~,~a~-catgataattttgcacacgcaggttctgagttttggctggtgcgctgagtgcttaccatac~

I

| I |F l|II

| IIIIlll|lllllllll

1

/llllllltltll|lllll

I I

| |I

I

Illlllll|lllllll/lllll

|| 1

Illllllllllllllllll/ll/lll/

/

c•caaatactgtgtgatctg•gctgaagaacagaaaacccaactaacagt•acttaggcaaaaagaaattttattctttcaca•aagtctgtg•ca••ca

J|

.......................................

J2

ggcccagagttcacggaggagctcagtaacatcagggat

3'HPP

ggtccagagttcatgcaggagctcaataacatcagagat

H (llll)|Zll.| I/ll|llll

18529

Ill

tzccazzttaatzcttzcztcazatgt9

IIIIIIII/ Ill

2.4 kb

~'~

20931

IIIIIII1|1|11111|11111111|1/ tgtaccaaattaatacttacatcaaatgtg

J1 J2 3'HPP

5'HP J|

J2

3'HPP

lllll|llllllnIlllllI~'~'"~

III III IIIII

II I| II I I I I I l l l l l l l l I l l l l l l l l I l l l l

I| I I I I

ttccagtagaataaacagctttaaaatacttttgtgctttttttttgttgttgttgttgttgttttgcgat~gagccttgctctgtcacc-aga~t~g~-\ '

"tct agaagc;gccatcat ~ctgaaggtaat~ttct?aa~ct~gcg?tt~tctttcnaa'it~tg~ttataagat?atc} . ~ A

:~, . . . . . . . . . . . . . . . . .

~,"~,'~,~W.~.~W.~TJ,WmI,~.WbT',W.'T~.-'~.~.~.',~¢~i,-,~.~,,:~r~. ~

-

~

~

~

~

luSa I 21120 /

/

GG~G~G~A~GC~A~G~AA~C~A~C~GGGC~AAG~G~AT~CT~G~CI~AGC~GAG~AG~IGGGATTACAGGCA~

| IIIIIIit111

I|llltllllllllllllll|llll|l

llIIIIIIIll

I IIII111111111111

Itllll

/

IIIIIIIIlllllll

~agtgtagt~gtgc~ttct~t~tca~tgcaacctc~a~tcct~g¢tcaag~aatctc~tgc~tca~tcctgagta~tt~gattaca~cac

S'HP

'tgtt~tgc~agttt~tEc'tttt'tgct~tat"at~£tccatt't~Ec¢'c't"cait~'~tta'cccttcctc~t~tt~gGgc~tttgtc;

/

JI

~.~.~,.-ww..~k~,~.~

|

3'HPP

ccaccatcatacctaactaa

J2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CCACCAICAIGCGCAGCIA .................................................................................

lllllllll|ll

IIII

21139

(Alu 9) cat~cctaactaatttttgtatttttagtagagatggggtttcactatgttggccatgctggttttgaact

F I G . 2. S e q u e n c e s of t h e j u n c t i o n regions of t h e h a p t o g l o b i n gene cluster. T h e D N A s e q u e n c e s of t h e b r a c k e t e d regions from Fig. l a are aligned, w i t h gaps i n t r o d u c e d to i m p r o v e t h e a l i g n m e n t . J1 = j u n c t i o n 1 (HP-HPR). J2 = j u n c t i o n 2 (HPR-HPP). N u m b e r s 17846-22510 of j u n c t i o n 2 c o r r e s p o n d to n u m b e r s 12412-17074 of t h e c h i m p a n z e e H P R - H P P gene s e g m e n t (M84463) f r o m G e n b a n k / E M B L . Each sequence is aligned with respect to junction 2 and the vertical lines denote matches between each sequence and junction 2. Alu s e q u e n c e s at t h e j u n c t i o n p o i n t s are capitalized. T h e j u n c t i o n 1 Alu s e q u e n c e is overlined with a b e a d e d line; t h e j u n c t i o n 2 Alu s e q u e n c e is overlined with a s l a s h e d line. Direct r e p e a t s f l a n k i n g a n i n s e r t i o n are u n d e r l i n e d , b u t p r e s u m e d direct r e p e a t s on one side of a n i n s e r t i o n are m a r k e d with a d a s h e d underline. E x o n 1 is m a r k e d w i t h asterisks. P a r e n t h e s e s indicate i n s e r t i o n s w h o s e s e q u e n c e s are n o t given here: R T V L - I b a n d R T V L - I c , retrovirus-like s e q u e n c e s Ib a n d Ic ( M a e d a a n d K i m , 1990); Alu9, Alu s e q u e n c e f r o m 3' to t h e h u m a n HPR; A R P P , acidic ribosomal p h o s p h o p r o t e i n (see Fig. 4).

EVOLUTION OF THE PRIMATE HAPTOGLOBIN GENE CLUSTER 5'HP

951

tgtttctagttaCtttgctattatatcagtgtcaccatgattatc~aaaagtaattcttttgtacaCtctaatttaagaacaactaac~ctttttaatga

J;

t

I

lilllllll

t

I

IIII

IIII Jill Ill

I

I I

I I

t

Iltl Illll IIIltllllllilll

Itl I

IIIIllll

IIHI

IIIIIII

IIIlllllH

I Ill llL-r-r?--r'T

Alu5b

iIiF

3'HPP

cctga~ctcaaatgatccgcctgccttgg¢~tcccaaagtg~tgggatta~aggcgtgagcca~tatgcccagcctaaaaq~tttaaaacagataatat

5'HP

ataaatcaaccttgt~ttgagttgctactaagtttcagt{gactagtacctgggatacacacaggtgcag~catttgac{gagacatat{gatttttctc

Iit111111 III IIIHI/II/IItlIIHIIIll

IItlIII/IIIII/HIII

IIIII//l|

I/Ill

II/111/1/I

II I/tllllllll

J]

....................................................................................................

J2

ataaatcaatcttttattgagttgctactaaatttcaattgactagtacctgggataaacacaggtgtagacacttgactgagatctactgatttttctc

3'HPP

ttcacatactttcaaataaagacaaaaa~ggtacaca~actagataagt~gacacatcaac~a~a~taagaaat~atcagaatcatg~atgcacccaatc

5'HP

atctgccta~ttaggctaaEcacc'gactat'aaaccatgagaaccactgccattgagtatagtctgtgtc~gtct-cactatagctttaact'gttgtg

J2

atctgcctatttaggctaat~accagactataaaatcatgagaaccactga~attgagtatggtctgtgtcagtctacactatagctttaactagttgtg

I I

I I I

II

I/

I

II

IIIIIIIIII111|/(111t11111/111111111

III

I

I

I

I II

III/1/1111111/

I

I

II

Illl/l/lll

I

I/I I

I

I/

21335

II

/lll/llllltlllllllllll/lllllllll/ll/l/

I I!/I

I I

I

I/I

I

/I

I

21435

3'HPP

atcaatattaagtattgagatggtaattggatggagtggaagg~actaaagaggacat~atgttactt~tt~ttgacaggctttgttgaagaatg~ttta

5'HP

tgatttcttgcaaagagcaatcagagaagacacaataaacacatttactgatttcaggctggagagcttttaagcaatagggagatggccacacacaagg Iltllltlllllll/l/llllll/ltllllllllllllltlllllttlt|llllll/lll//lll/lllllll/

IIllllltl

ltllllllllll/

JJ

....................................................................................................

J2

tgatttcttgcaaagagcaatcagagaagacacaataaacacatttactgatttcaggctggagag~ttttaagtaatagggcgacgg~ca~acacaagc21535

3'HPP

aa~ttcaatggcgaaaagcaggaaactttgcaagagagaagcacat~ttagtttgctagaaggaatgtttctttc~taaatgccacttaccagtccaaaa

5'HP

tggagaaaatt~ctgtgaaaaggaagta¢Ett¢tttagaG¢¢c¢a¢¢t~agctagg¢tgcag~atgt¢Eacaatgggtttgaaaaaa¢£caaaatgag ~

Ol

................................................................................

J2

g¢atagaaaaa~atcagtgt~ taaagtaaa~ tgtggaaaaata~ag~ tatatg~ataatgtg~atatatatatga~\t

3'HPP

Ill

I/

I I

I I I

tl

I//l/lllllllll/llllll//ll/

I I I II

I

Illllllll/l/lllll

III

/

I III

I I /I tl

IIII III//ll/ll/llll/llll

I

Itl

l l / / l l l ~

GCCAGGCATGGTGATATATGt

tggagaaaattactgtgaaaaggaagagctttctttagagccc~at~taaactagg~tgcagaaatgtctatgatgggtttgaaa~aaatgaaagtgagg~ 21635

t II

III

I

.

.

.

I

.

.

I/

.

I I

.

I I

.

//I

.

I I I

.

5'HP

ctttctg•agtgtgaaaatcctccaagataaagagacagattgatggttcctgc•gccg•c•tgtcctgcc•agttgctgatttcaggaaatactt-ggc

Jl

CCTGlAAl~CAGCTKCT~GG~GGCTGAG~A~AGAAT~GCT~GAACCTCGGAGG~G~G~GTGGTGAG~TGAGK~ATGCCAUGTAC~C~KGC~

J2 3'HPP

LIUIUIU/UtUIUIUIUIUIU/UIU/UIUIU

III

I

III

I

UtLLIL JIU UIU/UILIIUIUIUIL/UIUIUIJ/I~AIu2b /

II

I I

I

I

I

I

I I

I1

l

cctt~tg~agtgtgaaaat~ctc~aagataaagagacagattgatgggt~tg~tttgc~cgtc~tg~cagttg~tgattt~gggaaatactttgg~ 121735

II

I I

il

II i i I I

i

I

I I

III I

!

/

atg•atggtagtgctttgccacctgtactatttcttcatacttaagtcggaaattaaaatttgacaaaccatgtgtatgcactattaaaaaa•taaataa

5'HP

aggtttgtgggtcata-ga;tt-gccaggtttc~ttgggitttgtaatagaacatcacaigaaaatcaagtgtgaagcaagagctcaact ......... : /

Jl

~GGGCAACAGGAGTGAAACIC~G~CICAAAAAAAACAAAAAA~Aaaaagaacacca~gggaaaatcaagtgt~aagcaagagctcaact .......... /

J2

aggttcgtgggttctatgaatttgccag~tttccttggg~cttataaaagaactcca~gggaaaatcaa~tgtaaagcaa~a~tcaact(ARPP) c._aa

3'HPP

aatttga~aat~atac~agaag~act~t~attttgtag~aaggttttg~ataaaaaattagagagtg¢gt~a~ttggggaaa~agta~tgatataaga~t

H!H ~!~! II II

!~!~-!~!u.~~UI I I tll

I II

t

I

I/Ill Ill |lllllllltl/I I TrTWITI-TTTI//tlIIlIIII/I/I

II

I/

I

I

I

II/lll/l/lllllll II/11111/1111111

tl I I

t

I I

t

/

22310

I

~CttaacagGggtattgtttgtggttttgEtactg~aaaa~ata~tgaccttacCaQ~cCaaa~ttt~ta~acacaooaattacoaaat IIII111///tl II/lll/l//llll/lllttlllltlll IlllllllllllllltlllllTIIlI I]lllltITIIIIIIhlll •ttaacaggggcaatgtttggagttttgttactggaaaagatagaga•ctta•caggt¢caaagtttgtagacacaggaattacgaaat

5'HP

.........

Jl

...........

J2

aqaqctcaactcttaa~aggggtgttatttgtggttttgttactggaaaagacagtga~cttatcagggccaaagtttgaagacacaggaattacgaaat22410

IIIIIII///|

t111/

I I I II m/

/llll/lllll//llll/I

I

I

I

I//tlllll

I

I

II/t Illll/lll/

I

/I I

I/lllll/tlllllt/llll

t Ill

I

3'HPP

~a~agag~aacatca~atggaaataggcaag~t~t~aaaatgctta~aaggaaa~tta~aaggttg~aatt~a~tt~ag~tcaactgcactt~cc~a

S'HP

ggagaagggggagaagtgagctagtgg•agcataaaaaga•cag•agatg•c••acagca•tgctcttc•agagg•aaga•caa••aagatgaggtgggt

J]

ggagaagggggagaagtgag•tagtgg•ag•ataaaaaga••agcagatg•c•ca•ag•a•tg•t•tt••agagg•aaga••aa••aagatgaggtgg•t

J2.

ggagaagggcgagaggtgagctaatggtgccataaaaagaccagcacatgccccacagcactgct•ttccagagg••agaccaaccaagatgaggtgggt2~510

3'HPP

IIIIIIIII

IIII IIIIIIII

II1111111 IIII IIIIIIII I

/I

III

IIIIIII[IIIIIIII

lllllllllllllllllllllllllllllllllllllllllllllllllllll

III

IIIIIIIIIIIIIIII

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

II/

Ill

I

I

I I

II

I

11

taat~a~ttt~atatg~tg~attgaagaaaaa~a~agtagatggaatataactgatgt~aaaatactg~gtataaa~tg~agtttg~t~a~at~a

FIG.2 - - C o n t i n u e d p a n z e e H P P ) a n d ends 3.5 kb f u r t h e r d o w n s t r e a m . All sequences are aligned w i t h respect to J2 a n d large insertions h a v e b e e n a b b r e v i a t e d for convenience. N o t e t h a t we h a v e not h e s i t a t e d to use h u m a n or c h i m p a n z e e sequences, as available, in the alignment. H u m a n a n d c h i m p a n z e e sequences are i n t e r c h a n g e a b l e in this c o n t e x t because of t h e i r high similarity [98% identical ( M c E v o y a n d Maeda, 1988) ] a n d because of the age of the original gene duplication. T h e original gene duplication occurred m o r e t h a n 30 million y e a r s ago (Maeda, 1985), p r i o r to the divergence of h u m a n s a n d

c h i m p a n z e e s which occurred a b o u t 6 to 8 million y e a r s ago (Sibley a n d Ahlquist, 1987). Junction 2 Is Defined by a Fused Alu Sequence in Reverse Orientation T h e j u n c t i o n 2 region includes the 3' e n d of t h e H P R gene followed b y the b e g i n n i n g of the H P P gene. T h e J2 sequence is h o m o l o g o u s to 3 ' H P P (with the exception of a 300-bp gap in 3 ' H P P ) up to a n d including t h e Alu sequence m a r k e d with a slashed line (20964-21210), which

952

ERICKSON, KIM, AND MAEDA

begins 4.6 kb downstream of exon 5 of H P R . On the 3' side of this Alu sequence, J2 is homologous to 5'HP, beginning 800 bp 5' to exon 1 of H P P . Thus the junction point between H P R and H P P must have been within the region now covered by this Alu sequence. T h e junction Alu sequence in J2 is in reverse orientation compared to the consensus Alu sequence (Deininger et al., 1981) and it is not a complete Alu element; it contains a 60-bp gap in its left arm. In addition, the junction Alu sequence does not have the short direct repeats (5-20 bp) usually found flanking an Alu insertion. T he 12-bp sequence immediately upstream of the junction 2 Alu sequence, T T A A A A C C C T T T , is homologous to 3'HPP, but the 12-bp sequence immediately downstream of the junction 2 Alu sequence is AAGAATCACTAA, which is homologous to 5'HP. Although both these flanking sequences are AT-rich, they are obviously not direct repeats. Direct repeats flanking Alu sequences reflect the mechanism of transposition and insertion of such sequences (Haynes et al., 1981; Daniels and Deininger, 1985). The lack of direct repeats implies t hat the junction Alu sequence in J2 was formed as the result of a fusion of two ancestral Alu elements. T he 12-bp flanking sequences are presumed to be the remaining single copies of the direct repeats flanking the two Alu elements prior to their recombination. T h e 5' ancestral Alu element had direct repeats corresponding to the sequence upstream of the junction 2 Alu sequence, TTAAAACCCT T T , whereas the 3' ancestral Alu element had direct repeats corresponding to the sequence downstream of the junction 2 Alu sequence, AAGAATCACTAA. We have named the two parts of the junction Alu sequence separately (Alu5a, 176 bp long, and Alu5b, 72 bp long), dividing th em at the 60-bp gap, to indicate t hat the junction 2 Alu sequence is derived from two separate Alu elements. Both Alu5a and Alu5b are marked with a slashed overline in Fig. 2. Th e ancestor of Alu5a can be deduced from the homologous Alu sequence in 3'HPP. T he Alu sequence in 3 ' H P P is a full-length Alu sequence possessing both direct repeats TAAAAYACTTT, which are the same as the presumed direct repeat upstream of Alu5a in J2. We infer from this fact t h a t Alu5a in J2 is related to Alu5a in 3 ' H P P by duplication (that is, by being part of the region when it duplicated). We can also infer t ha t a full-length Alu sequence with direct repeats of T A A A A Y A C T T T was present at that location in the two-gene cluster prior to the triplication. However, there is no Alu sequence comparable to Alu5b in the 5' region of HP. Th e exact crossover point between Alu5a and Alu5b cannot be determined. We have chosen to divide the junction 2 Alu sequence at the 60-bp gap for convenience, though we are aware t hat the crossover could have occurred at any location in the Alu sequence, with a separate event being responsible for the 60-bp gap. T he Alu sequences of J2 and 3 ' H P P upstream of the gap are more alike than the sequences downstream of the gap. T he 176-bp regions upstream of the gap differ by 13%, whereas the 72-bp regions downstream of the gap differ

by 19%. In the haptoglobin gene cluster, the average difference between Alu sequences related by duplication is 15%, whereas the average difference between Alu sequences independently inserted in the cluster is 19%. T he 13% difference upstream of the gap therefore implies t hat the 176-bp regions of the Alu5a sequences in J2 and 3 ' H P P are most likely related by gene duplication, while the 72-bp regions are probably derived from two separate Alu sequences. J u n c t i o n 1 Is Defined by a Fused A l u Sequence in Forward Orientation T he J1 sequence is homologous to J2 and to 3 ' H P P (except for the 300-bp deletion in 3 ' H P P ) up to a point 1.9 kb downstream of exon 5 at which an Alu sequence occurs. Beyond the insertion point of this Alu sequence is a 3.6-kb gap. After this gap, at a location 220 bp 5' to exon 1 of H P R , homology to 5'HP begins. The junction 1 Alu sequence is a full-length Alu sequence in the forward orientation, compared to the consensus Alu sequence of Deininger et al. (1981). Like the junction 2 Alu sequence, the junction 1 Alu sequence does not have the direct repeats usually found flanking an Alu insertion. T h e presumed direct repeats are CCTCACGAGGGG and AAAAGAACACCA, which are obviously not homologous. Therefore, following the logic used in discussing the junction 2 Alu sequence, we postulate t h a t the junction 1 Alu sequence is the result of recombination between two ancestral Alu elements. We therefore divided the junction 1 Alu sequence into two parts; the 5' part is placed at the 3' end of H P and named Alu2a (1836518499), and the 3' part is placed at the 5' end of H P R and named Alu2b (21616-21781). Both Alu2a and Alu2b are marked with a heavy beaded overline in Fig. 2. Because there are no Alu sequences at the equivalent positions in any other haptoglobin gene units, it is most likely th a t the insertions occurred between H P and H P R after the triplication event. Homologous recombination between Alu2a and Alu2b would result in a full-length Alu sequence without direct repeats and in the deletion of at least 3.6 kb of DNA between H P and H P R . T h e 3.6-kb sequence which was deleted included the region in J1 equivalent to the junction point in J2. T he current J1 junction point is consequently a vestige of Alu-Alu recombination and does not reflect the sequence present at the time of either the original duplication or the later triplication. DNA sequence comparisons also provide an explanation for the 600 bp of DNA between the human H P and H P R genes t hat could not be accounted for by homology with the flanking sequences of the human haptoglobin cluster (Maeda, 1985). T he 5' half of this 600 bp of DNA (J1, 18028-18357) is now identified by its homology to the chimpanzee H P R gene (J2, 18028-18357) as part of the original haptoglobin gene sequence, while the 3' half is the junction 1 Alu sequence. As a consequence of a deletion event in the human H P R gene that removed 300 bp of genomic DNA together with one-third of the retro-

E V O L U T I O N OF T H E P R I M A T E H A P T O G L O B I N G E N E C L U S T E R

953

H___~P

(TG)n HPR 1

4

~1

I

(TG)n

I

RFR~T ]

I

l--I~-]C (TG)n (TG)n •

A

~:;;i~i~i

!i~iiii!;ii!!i~i 4

HPP

Zone

'1

(TG)n (TG)n iiiiii!iii

iiiiiJi iii:::iii

,R ,R

!i: iiiiiil

7 Zone B

~il Zone

: kb

C

F I G . 3. A l i g n m e n t of the three haptoglobin gene units. T h e genes are continuous in the order H P - H P R - H P P in the genome, b u t have been separated into their gene units for this figure. T h e symbols are the same as in Fig. i except t h a t exons are not numbered. T h e data from exon 5 of H P R to exon 5 of H P P are from the chimpanzee; the remaining data are from the h u m a n . Alu sequences are numbered. Triangles t h a t are half-shaded represent junction Alu sequences divided into two parts. Gaps with parentheses indicate deletions. (TG)n represents simple sequence T G repeats of varying lengths. IR, inverted repeat. Other abbreviations as in Fig. 1. Zones A, B, a n d C are boxed and shaded.

virus-like element (RTVL-Ic), the sequence corresponding to the 5' half of the 600 bp is no longer present in the human H P R gene (3'HPP sequence), although it is still present in the chimpanzee HPR gene.

Many Deletions and Insertions Are Clustered Inspection of the general organization of the haptoglobin gene cluster presented in Fig. 1 and of the sequence data presented in Fig. 2 suggested the possibility t hat insertions and deletions occurring during the haptoglobin evolution were not randomly distributed. T o explore this possibility in more detail, we aligned the nucleotide sequences of the three gene units such t ha t the locations of various features along each gene unit could be compared (Fig. 3). Features such as exons and Alu sequences t h at occur at exactly the same locations in at least two of the genes are related by duplication. Features t hat occur in only one gene unit are the result of separate insettions, deletions, or other recombinational events. Using this reasoning, we identified three regions of the gene unit, zones A, B, and C, t h a t are associated with multiple insertion and deletion events. Within the 800 bp of zone A, two Alu elements, Alu2b and Alu5b, were inserted, then subsequently involved in recombinations to form junctions 1 and 2, respectively. T h e insertion points of Alu2b and Alu5b sequences are about 560 bp apart. Downstream of the Alu5b insertion point, there is a unique 480-bp insertion at a location 180 bp 5' to exon 1 in H P P (see Fig. 2). T he insertion has a poly(A) tail and is flanked by 13-bp direct repeats. Comparison of the 480-bp DNA with all sequences listed in G e n b a n k / E M B L demonstrated a 90% similarity with the human acidic ribosomal phosphoprotein P1 cDNA [ARPP (Rich and Steitz, 1987)]. P1 codes for a protein

t hat is present in multiple copies on the ribosome and required for translation. T he similarity to the cDNA indicates t hat the 480-bp sequence in the chimpanzee H P P gene is a processed pseudogene of an acidic ribosomal phosphoprotein gene. A comparison of this chimpanzee A R P P pseudogene to P1 cDNA is presented in Fig. 4 and reveals t hat 15 of 23 of the CpG dinucleotides in the P1 cDNA have been replaced in the inserted element in HPP. T he mutation rate of CpG dinucleotides in pseudogenes has been reported to be 10-fold greater t han t hat of other nucleotide pairs (Bulmer, 1986). Zone B, 800 bp long, contains the insertion point of the Alu2a sequence and of two retrovirus-like sequences, RTVL-Ib and RTVL-Ic. T he insertion points of the two RTVL-I sequences are about 540 bp apart, but the insertion point of RTVL-Ic is only 7 bp away from the insertion point of Alu2a (see Fig. 2). T he 7-bp sequence between the two insertion points (CGAGGGG in J1 and GGAGGAA in 3'HPP) is not known to be recombinogenic, but the overlapping sequence CAGGAGGAAGA in J2 contains the CAGG motif of a recombination hotspot (CAGG)n in the major histocompatibility complex (Kobori et al., 1986). Zone B also contains a deletion in H P P of 3 kb which includes about 300 bp of genomic DNA together with the 5' third of RTVL-Ic (Maeda and Kim, 1990). T he 5' end of the deletion occurs in zone B, as does the insertion point of RTVL-Ic. In the 400-bp region of zone C is the insertion point of Alu5a and the 60-bp gap in Alu5a of H P R (part of junction 2). This 60-bp gap is located in the left monomer of the consensus Alu sequence where Alu-Alu recombination most frequently occurs (Lehrman et al., 1987a). T h e homologous Alu5a sequence in H P P has no gap, but is remarkable in having another Alu sequence (Alu9) inserted into it at the same site or within a few basepairs of

954

ERICKSON, KIM, AND MAEDA Ch ARPP

AGTTGAG~TCTTGAA~T~TTTc~T~A~TG~A~CAAGGTGCT~TGTCCTT~TAGGAAGCTAAGGCTGCATTGGGGTGATGCCcTCACTTCATCCGGT

I00

IIIIIIII llll IIIIIIIIIII IIIIIIII IIIIIIIIIIIIIIII IIIIIIIII IIllllllllllllill

CTTTTCCTCAGCTGCC-GCCAAGGTGCTCGGTCCTTCCGAGGAAGCTAAGGCTGC-GGTTGGGGTGAGGCCCTCACTTCATCCGGC

P1 cDNA Ch ARPP

200 GAcTAGcAc~GTAtccAGc6GcAccAG~G~cAcAc~tGcccGAcccAtG6c-TccGTc~bGAGc~cGc6~GcAtctAc~cGGccctcAhcTGcAcAA6

P1 cDNA

GACTAGcAccGCGTCCGGcAGc~-~GGcCAGcc¢TAcAcTcGCCcGCGCCATGGCcTCTGTcTCCGAGcTcGccTGcATcTAcTCGGCCCTCATTCTGCACGAC

Ch ARPP

GATGAGGTGA~A~A~GGAGGATAAGTTcAGTG~C~T~ATTAAAGCAGC~TGTGTAAATGTTGAAA~TTTTTGGC~TGGCTTGTTTGC~`AGGCT~TGG 300 IIIIIIIIIII IIIIIIIIIIIIII III I I I I I I I I I I I I I I I I I I IIII111111111 I I I I I I I I I I I I I I I I I I I I I I I I I I I I IIII GATGAGGTGACAGTCA~GGAGGATAAGATCAATGCCCTCATTAAAGCAGCC-.GGGTGTAAATGTTGAGCCTTTTTGG~CTGGCTTGTTTG~AAAGG~CCTGG

PI cDNA

IIIIIIIII I III II II IIII

IIIII IIIII IIIIIII II IIII lllllllillllllllllllllllllllllllllll II

......... : ......... : Ch ARPP CCAA~G~CAACA~GGGAGC~CACC--~-AGCAGGGG~TGGTG~ACCTGCTCC-GCAGC~GGTGCTGcA~CAGCAG~

IIIIIIIIIIIIIIIIIIIIIIII I

400

I IIIIII IIII I I I I I I I I I IIIII l l l l l l l l l l l l l l l l

P1 cDNA CCAAC-G•GTCAACATTGGGAGCCTCATCTGCAATGTAGGGGCC-GGGTGGACCTGCTCCAGCAGCTGGTGCTGCACCAGCAGGAGGTCCTGCCcCCTCcACTGC Ch ARPP P1 cDNA

. . . . . . . . . i . . . . AGGA~GAAAG~GA/~AGcAAAGAAAGAAGAA~6AGGAG~c~GA~GA~GAcA~6GGc~GG~c~GAc~uaAcc~ctt~T 500 IIII I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I TGCTGCTCCAGCTGAGGAGAAGAAAGTGGAAGCAAAGAAAGAAGAATC•-GGAGGAGTCTGATGATGACATGGGCTTTGGTCTTTTTGACTAAACCT•TTTT

Ch ARPP ATAATGTGTGCAATAAAAAACTGAACTGTTAAAAAAAACAAACAGTTGA6CTCTTG

Jill

lit IIIII1[11 I I I I I I I

556

II

P1 cDNA ATAACATGTTCAATAAAAAGCTGAACTTTT

F I G . 4. Acidic ribosomal phosphoprotein pseudogene. Comparison of the A R P P pseudogene insertion in chimpanzee H P P with the P1 cDNA sequence of Rich and Steitz (1987). Gaps are introduced to improve the alignment. The direct repeats of genomic DNA in chimpanzee H P P are underlined. CpG dinucleotides in the P1 cDNA are underlined. The ATG initiation codon and the TAA stop codon in P1 cDNA are marked with asterisks.

the site of the 60-bp gap in the Alu5a sequence of HPR. Alu9 is a c o m p l e t e Alu e l e m e n t in the s a m e o r i e n t a t i o n as Alu5a a n d has flanking direct r e p e a t s of C A T G C C T G G C T A A (Fig. 2). T h e insertion of Alu sequences into o t h e r Alu sequences has b e e n n o t e d before (Chen et al., 1991; S t o p p a - L y o n n e t et al., 1990), a n d has b e e n attribu t e d to the A T - r i c h c o m p o s i t i o n of a n Alu sequence (Daniels a n d Deininger, 1985). T h e m a j o r i t y (8/12) of the inserted e l e m e n t s identifled in t h e h u m a n a n d c h i m p a n z e e h a p t o g l o b i n gene clusters are t h u s confined to r a t h e r small zones at the ends of the duplication units. Deletions are also clust e r e d in the s a m e zones. Only one o t h e r small region in the h a p t o g l o b i n duplication u n i t c o n t a i n s m o r e t h a n one insertion: the i n s e r t i o n p o i n t of R T V L - I a in H P R a n d the insertion p o i n t of the A l u l sequence are within 400 bp of each o t h e r in i n t r o n 1.

Characteristics of the Hot Spot Regions T h e clustering of insertions a n d deletions in zones A, B, a n d C e n c o u r a g e d us to search the nucleotide sequences of these regions for a n y unique features t h a t m i g h t predispose the zones t o w a r d r e c o m b i n a t i o n . S t r e t c h e s of A T - r i c h D N A can d e n a t u r e m o r e easily, possibly allowing access to recombinogenic enzymes. Zone A is A T - r i c h (61%) in c o m p a r i s o n to the entire H P gene (54%), b u t zone B is similar to the overall j u n c t i o n 2 region (55% A T for zone B, 58% A T for j u n c t i o n 2) a n d zone C is less A T - r i c h t h a n the overall 3 ' H P P region (55% A T for zone C, 61% A T for 3'HPP). T o p o i s o m e r ase I a n d II sites h a v e b e e n o b s e r v e d at or n e a r the crossover p o i n t s of m a n y n o n h o m o l o g o u s r e c o m b i n a t i o n s (Bae et al., 1988; K o n o p k a , 1988). Because the exact crossover points could n o t be d e t e r m i n e d in our e x a m ples, we e x a m i n e d the entire zone for clusters of topoiso-

m e r a s e I sites CAT, R A T , C T Y , a n d G T Y ( K o n o p k a , 1988). In each zone, the frequency of t o p o i s o m e r a s e I sites was c o n s i s t e n t with a r a n d o m distribution a n d no clustering was observed. No sites c o r r e s p o n d i n g to the t o p o i s o m e r a s e II consensus sites (Osheroff, 1989; Spitzner a n d Muller, 1988) were f o u n d in zones A, B, or C. One chi site (Smith, 1983) was p r e s e n t a b o u t 40 bp d o w n s t r e a m of the Alu2a insertion point. T h e m o s t n o t e w o r t h y features p r e s e n t in the haptoglobin gene cluster were three stretches of simple sequence (TG)~. ( T G ) , repetitive sequences can f o r m l e f t - h a n d e d Z - D N A a n d occur at an average of 30 kb in the h u m a n g e n o m e (Stallings et al., 1991). T h e t h r e e (TG)~ stretches in the h a p t o g l o b i n gene cluster, however, are found within the 10.7 kb of the duplication unit. AIt h o u g h the ( T G ) , stretches occur outside of the t h r e e hot spot zones, t h e y could n e v e r t h e l e s s affect r e c o m b i n a t i o n in the hot spot zones. F o r example, pairing b e t w e e n homologous molecules can be initiated in Z - D N A sequences (Kmiec a n d H o l l o m a n , 1986) even t h o u g h the resolution of the r e c o m b i n a t i o n occurs elsewhere. W a h l s et al. (1990) have n o t e d t h a t (TG)~ sequences can stimulate homologous r e c o m b i n a t i o n as far as 1269 b p away f r o m the (TG)~ sequence. W e also looked for sequence features t h a t m a y affect r e c o m b i n a t i o n in 3.8 kb of D N A d o w n s t r e a m of the zone C in the H P P gene u n i t ( h u m a n H P R gene, d a t a not shown, G e n b a n k Accession No. M69197). W e f o u n d no n o t e w o r t h y features except a p a i r of 86-bp inverted rep e a t s (IR) t h a t are 93 % identical. T h e s e i n v e r t e d r e p e a t s are s e p a r a t e d b y 500 b p of D N A . W e searched for homology of t h e I R a n d the c o n n e c t i n g 500 bp of D N A to k n o w n sequences listed in G e n b a n k / E M B L , b u t no significant m a t c h e s were found. T h e significance of the inverted repeats r e m a i n s to be d e t e r m i n e d .

EVOLUTION OF THE PRIMATE HAPTOGLOBIN GENE CLUSTER 5a /----

5a X

Alu2a, Alu2b

~r

2a 2b - ' ¥ 5a ~

and Alu5b

-- / 5a

X

5a " ~ ' ~ ' ~ 5b

5ab

5a

5a

~,~

junction 1-~

5a~b

• ~

Triplication

2b l • "~'Sab ~

5a



'~k~-junction2

5ab

5a

5ab

5a

2a

5a Recombination between Alu elements at junction t

5a X

Insertions of Alu2a and Alu2b

5b 5a''~"

Recombination between Alu elements at junctions 1 and 2 ~

5a

elements5a and 5b

5a

Triplication

Insertions of

5b?

Duplication by recombination between Alu

Duplication

955

b ~

~

5ab 5a ~-~---~i •

junction 1J

I..~_ Alu sequence

Model 1

Original junction point

Model 2

FIG. 5. Evolution of the haptoglobin gene family. Model 1 shows an evolutionary history in which both of the original junction points are lost. Model 2 shows a history in which the original junction point is seen at junction 2. Solid boxes are ancestral haptoglobin genes. The open circle represents the recombination points for the original gene duplication and the original junction point between the duplicated genes. Triangles are the Alu sequences involved in the junctions, numbered according to Fig. 3. Alu sequences 2ab and 5ab are the fused Alu sequences at junctions I and 2, respectively. The HP, HPR, and H PP genes are shown as boxes with left diagonal shading, spotted shading, and right diagonal shading, respectively. Dashed lines indicate deletions caused by recombination between Alu sequences.

Formation of the Primate Haptoglobin Gene Cluster To determine the origin of the haptoglobin multigene family, we have analyzed both junctions between the genes and compared them to the flanking ends of the gene cluster. We expected junction 1 ( H P - H P R junction) and junction 2 ( H P R - H P P junction) to be identical, but discovered t ha t although both junctions were defined by Alu sequences, the Alu sequences had different orientations and different histories. The Alu sequence at junction 1 is the result of homologous Alu-Alu recombination which deleted the original junction point and cannot be used to analyze the origin of the haptoglobin gene family. T he Alu sequence at junction 2 is also the result of Alu-Alu recombination, but at least one of the Alu sequences involved predates the triplication and therefore might reflect the origin of the haptoglobin gene family. T h e simplest explanation of the Alu sequence at junction 2 is th at it is the result of an Alu-Alu recombination t hat deleted the original junction point, as in junction 1. This explanation is illustrated in model 1 of the evolution of the haptoglobin gene family (Fig. 5). In model 1, the ancestral gene contained the complete Alu5a sequence and the original junction point was located 3' to the Alu5a sequence. T he mechanism of the original gene duplication event could not be determined. When the gene family expanded to three genes, the original junction point was replicated. T hr ee independent insertions of Alu sequences occurred: Alu5b in the 5'HPP region, Alu2a 3' to HP, and Alu2b 5' to HPR. T he contemporary

junction 2 is the result of recombination between Alu5b and the already present Alu5a. T he contemporary junction 1 is the result of a homologous recombination between the inserted Alu2a and Alu2b sequences. An alternative explanation is the possibility th a t junction 2 reflects the junction point of the original duplication. Model 2 (Fig. 5) assumes t hat the original duplication event was caused by recombination between Alu elements flanking the ancestral gene (corresponding to Alu5b in the 5'-flanking region and Alu5a in the 3'flanking region). The original junction point was formed by a fusion of Alu5a and Alu5b (Alu5ab). It was replicated when the gene family expanded to three genes. T he insertion of two independent Alu sequences (Alu2a and Alu2b) at junction 1 and their recombination subsequently caused the loss of one copy of the original junction point. For model 2 to be valid, the lack of an Alu element equivalent to Alu5b in the region 5' to the H P gene must be explained. If the Alu5a and Alu5b elements from junction 2 were involved in the original duplication, we expect to find them represented at least in two genes. (Because the corresponding sequence from junction 1 has been deleted, there is no information on the presence of Alu5a or Alu5b in that region.) Alu5a occurs in the 3' regions of both H P R and HPP, as expected. T h e other partner in the recombination, Alu5b, is found only once in the cluster, as part of the 5' region of HPP. T h e single occurrence of Alu5b can be explained either by a deletion t hat removed the ancestral Alu5b from the 5'HP region or by an insertion of extraneous sequence in the 5'HP

956

ERICKSON, KIM, AND MAEDA

region such that the extent of our sequence has not yet reached Alu5b. It can also be explained by polymorphism. If the ancestral haptoglobin gene had two allelic forms, one with Alu5b inserted 5' to the gene and the other without an Alu5b, recombination between the Alu5b in one allele and the Alu5a in the other allele could have led to a duplicated gene with a fused Alu5ab at its junction. T he hypothesis of polymorphism is supported by the observation t hat the Alul sequence in the first intron of the H P and H P R genes is not present in intron 1 of the H P P gene. T he Alul sequence is present in the single haptoglobin gene in spider monkeys (Erickson et al., unpublished), indicating t hat the Alul insertion predates the gene duplication event. Either Alul has been precisely excised from H P P or Alul was present in only one of the two ancestral alleles t hat were involved in the recombination. Th e sequence information obtained from the human and chimpanzee haptoglobin genes is insufficient to allow us to choose between model 1 and model 2. T he analysis of the haptoglobin genes of another Old World primate will be necessary to decide the issue. T h e detection of the complete Alu5a or Alu5b sequences (including direct repeats) at junction 2 and a different junction point between them would support model 1. The ascertainment of an Alu sequence equivalent to Alu5b in the 5' region of the first gene or evidence of a deletion or insertion in th at region would support model 2. Th e explanation of the junction 1 Alu sequence is the same in both models. T he junction 1 Alu sequence was generated as a consequence of homologous recombination between one Alu element (Alu2a) inserted into the 3' part of the H P region and another (Alu2b) inserted into the 5' part of the H P R region. T he amount of DNA deleted by the Alu-Alu recombination was at least 3.6 kb, including the sequences corresponding to the original junction between the duplicated genes. The 60-bp gap in the junction 2 Alu sequence also has the same explanation in both models. T he recombination between Alu5a and Alu5b was either a nonhomologous recombination resulting in an incomplete Alu sequence or a homologous recombination followed by a separate event t h at produced the 60-bp deletion. T he nonhomologous recombination is the more simple explanation because it requires only one event to explain the junction 2 Alu sequence, while the explanation of homologous recombination requires an additional event (deletion) at the same position. Therefore, we refer to the recombination t h at created the contemporary junction 2 Alu sequence as a nonhomologous recombination. Several investigators have suggested t hat Alu sequences can be classified into subfamilies based on diagnostic mutations and t hat these subfamilies are formed by amplification of a small number of master genes at different times during primate evolution (Willard et al., 1987; Britten et al., 1988; Jurka and Milosavljevic, 1991). A search for these diagnostic mutations in the Alu sequences involved in junctions 1 and 2 indicates t hat Alu2a, Alu5a, and Alu5b belong to the primate-specific

(PS) subfamily (Shen et al., 1991), the oldest subfamily found in all primates. T he Alu2b sequence is too short to allow a definite assignment, but it belongs either to the PS subfamily or to the anthropoid-specific (AS) subfamily, found in both Old World and New World primates. The AS master genes appear to have replaced the PS master genes before the separation of the Old World from New World primate lineages (Shen et al., 1991). This factor argues that Alu2a, Alu2b, Alu5a, and Alu5b were inserted prior to the separation of Old World from New World primate lineages, a conclusion that somewhat contradicts our models in which certain Alu sequences were inserted into the genome after the formation of the triplicated gene cluster in the Old World primate lineage. It is possible t hat both the PS and AS master genes remained active for a short time after the divergence of Old World and New World primate lineages. It is also possible that the alleles involved in the gene duplication events carried Alu sequences at different positions. In either case, the data suggest that the gene duplication/triplication and the formation of the two junctions occurred close to the time when the New World and Old World primates diverged.

Recombination between Repetitive E l e m e n t s during the Evolution of Multigene Families T he consequences of Alu-Alu recombination in the genome are often deleterious. Various clinical disorders have been caused by deletions resulting from recombination between Alu sequences in the L D L receptor gene (Lehrman et al., 1985, 1987a,b; Miyake et al., 1989), the C1 inhibitor gene (Ariga et al., 1990; Stoppa-Lyonnet et al., 1990), the a-globin genes (Nicholls et al., 1987), and the adenosine deaminase gene (Markert et al., 1988; Berkvens et al., 1990). A sex chromosome translocation in humans has been identified as resulting from Alu-Alu recombination between X and Y chromosomes (Rouyer et al., 1987). Duplications caused by Alu-Alu recombination can also be deleterious. Certain cases of muscular dystrophy and of familial hypercholesterolemia have been shown to be due to partial duplications of the muscular dystrophy gene (Hu et al., 1991) and the L D L receptor gene (Lehrman et al., 1985) produced by homologous recombination between two Alu sequences within the genes. T he duplication of an entire gene, however, can be highly advantageous as a foundation for future evolutionary modifications. Multiple copies of a gene allow greater opportunities for changes in expression or function of genes, and protect the organism from lethal mutations in one copy. Crossovers between repetitive elements flanking a gene can be important mechanisms for the gene duplication which is the crucial first step in the generation of a multigene family from an existing single gene. Jeffreys and Harris (1982), emphasizing the prevalence of families of short repetitive elements in mammalian genomes, pointed out their possible importance dur-

EVOLUTION OF THE PRIMATE HAPTOGLOBIN GENE CLUSTER ing t h e e v o l u t i o n of m u l t i g e n e families. T h e few k n o w n e x a m p l e s of t h e o r i g i n s of gene d u p l i c a t i o n s u p p o r t t h e i r p r e d i c t i o n : h o m o l o g o u s r e c o m b i n a t i o n s b e t w e e n L1 rep e a t s f l a n k i n g a h u m a n fetal g l o b i n p r o g e n i t o r gene ( S h e n e t al., 1981; M a e d a a n d S m i t h i e s , 1986; F i t c h e t al., 1991) a n d b e t w e e n B2 r e p e a t s f l a n k i n g a m o u s e lysoz y m e p r o g e n i t o r gene (Cross a n d R e n k a w i t z , 1990) h a v e b e e n i m p l i c a t e d as i n i t i a l e v e n t s l e a d i n g to t h e d u p l i c a t i o n of t h e s e genes. T h e h u m a n g r o w t h h o r m o n e gene c l u s t e r was also p r o p o s e d to h a v e b e e n f o r m e d b y s e v e r a l duplications involving homologous recombinations bet w e e n A l u s e q u e n c e s ( B a r s h e t al., 1983; C h e n e t al., 1989). T h e h u m a n h a p t o g l o b i n gene c l u s t e r w i t h its two t a n d e m genes h a s a n i n h e r e n t l y c o m p l e x h i s t o r y , m u c h of w h i c h is affected b y A l u - A l u r e c o m b i n a t i o n . W i t h i n t h e l a s t 40 m i l l i o n y e a r s , t h e r e g i o n c o n t a i n i n g t h e h a p t o g l o b i n gene was d u p l i c a t e d a n d t h e n t r i p l i c a t e d . V a r i o u s e l e m e n t s ( t h r e e r e t r o v i r u s - l i k e s e q u e n c e s of a single f a m ily, a t l e a s t seven A l u e l e m e n t s , a n d a p r o c e s s e d p s e u d o gene) were i n s e r t e d i n t o t h e gene c l u s t e r a n d s e v e r a l of t h e s e e l e m e n t s h a v e b e e n i n v o l v e d in s u b s e q u e n t r e c o m binations. Two Alu elements between the HP and HPR genes c r o s s e d over h o m o l o g o u s l y w h i c h led to t h e delet i o n of a t l e a s t 3.6 k b of D N A i n c l u d i n g t h e o r i g i n a l j u n c t i o n p o i n t b e t w e e n t h e H P a n d H P R genes. F i n a l l y , a f t e r h u m a n s d i v e r g e d f r o m o t h e r p r i m a t e lineages, t h e h u m a n H P R gene was f o r m e d b y a d e l e t i o n i n v o l v i n g a homologous unequal crossover between the HPR and H P P genes. T h i s d e l e t i o n e l i m i n a t e d n o t o n l y one of t h e h a p t o g l o b i n genes, b u t also one of t h e r e t r o v i r u s - l i k e sequences, t h e j u n c t i o n b e t w e e n t h e H P R a n d H P P genes, and a processed pseudogene. CONCLUDING REMARKS O u r a n a l y s i s e m p h a s i z e s t h e i m p o r t a n c e of A l u seq u e n c e s d u r i n g t h e e v o l u t i o n of t h e h a p t o g l o b i n gene cluster. A l t h o u g h t h e n a t u r e of t h e o r i g i n a l h a p t o g l o b i n gene d u p l i c a t i o n c a n n o t be d e t e r m i n e d w i t h c e r t a i n t y , t h e p r e s e n c e of A l u s e q u e n c e s a t t h e e n d s of t h e d u p l i c a t i o n u n i t suggests e i t h e r t h a t t h e o r i g i n a l d u p l i c a t i o n o c c u r r e d b y h o m o l o g o u s u n e q u a l r e c o m b i n a t i o n between Alu sequences flanking the ancestral haptoglobin gene or t h a t b o t h of t h e o r i g i n a l j u n c t i o n p o i n t s b e t w e e n t h e d u p l i c a t e d genes h a v e b e e n d e l e t e d b y A l u - A l u rec o m b i n a t i o n . T h e p o s s i b i l i t y t h a t b o t h of t h e o r i g i n a l j u n c t i o n p o i n t s were d e l e t e d b y A l u - A l u r e c o m b i n a t i o n suggests a new f u n c t i o n for A l u s e q u e n c e s in e v o l u t i o n . D e l e t i o n of D N A s e q u e n c e s b y A l u - A l u r e c o m b i n a t i o n m i g h t be a d v a n t a g e o u s if t h e s e q u e n c e s b e t w e e n t h e two A l u e l e m e n t s were d e t r i m e n t a l . In t h e h a p t o g l o b i n gene family, t h e j u n c t i o n r e g i o n s b e t w e e n t h e genes are assoc i a t e d w i t h a n a p p a r e n t l y u n u s u a l n u m b e r of i n s e r t i o n s a n d d e l e t i o n s . P e r h a p s t h e o r i g i n a l j u n c t i o n p o i n t was d e t r i m e n t a l or c r e a t e d a p a r t i c u l a r l y s t r o n g r e c o m b i n a t i o n h o t spot. In such c i r c u m s t a n c e s , t h e loss of t h e origin a l j u n c t i o n s e q u e n c e s c o u l d h a v e s t a b i l i z e d t h e gene family. T h u s , in general, t h e p r e s e n c e of m a n y r e p e t i t i v e

957

e l e m e n t s in t h e m a m m a l i a n g e n o m e allows for m u l t i p l e a n d v a r i e d r e c o m b i n a t i o n a l e v e n t s , w h i c h c a n h a v e imp o r t a n t effects on t h e e v o l u t i o n of t h e c o m p l e x g e n o m e structure. ACKNOWLEDGMENTS The authors thank Ms. Susan McEvoy and Sylvia Hiller for technical help, and Dr. Oliver Smithies, Dr. Marshall Edgell, and Dr. Emily Reisner for critical reading of our manuscript. L.M.E. is a dissertator at the Department of Genetics in the University of Wisconsin--Madison. This work was supported by National Institutes of Health Grant GM-34357. REFERENCES

Ariga, T., Carter, P. E., and Davis, A. E., III. (1990). Recombinations between Alu repeat sequences that result in partial deletions within the C1 inhibitor gene. Genomics 8: 607-613. Bae, Y. S., Kawasaki, I., Ikeda, H., and Liu, L. F. (1988). Illegitimate recombination mediated by calf thymus DNA topoisomerase II in vitro. Proc. Natl. Acad. Sci. USA 85: 2076-2080. Barsh, G. S., Seeburg, P. H., and Gelinas, R. E. (1983). The human growth hormone gene family: Structure and evolution of the chromosomal locus. Nucleic Acids Res. 11: 3939-3958. Berkvens, T. M., Van Ormondt, H., Gerritsen, E. J. A., Khan, P. M., and Van Der Eb, A. J. (1990). Identical 3250-bp deletion between two Alul repeats in the ADA genes of unrelated ADA- SCID patients. Genomics 7: 486-490. Britten, R. J., Baron, W. F., Stout, D. B., and Davidson, E. H. (1988). Sources and evolution of human Alu repeated sequences. Proc. Natl. Acad. Sci. USA 85: 4770-4774. Bulmer, M. (1986). Neighboring base effects on substitution rates in pseudogenes. Mol. Biol. Evol. 3: 322-329. Chen, E. Y., Cheng, A., Lee, A., et al. (1991). Sequence of human glucose-6-phosphate dehydrogenase cloned in plasmids and a yeast artificial chromosome. Genomics 10: 792-800. Chen, E. Y., Liao, Y. C., Smith, D. H., Barrera-Saldafia, H. A., Gelinas, R. E., and Seeburg, P. H. (1989). The human growth hormone locus: Nucleotide sequence, biology, and evolution. Genomics 4: 479-497. Cross, M., and Renkawitz, R. (1990). Repetitive sequence involvement in the duplication and divergence of mouse lysozyme genes. E M B O J. 9: 1283-1288. Daniels, G. R., and Deininger, P. L. (1985). Integration site preferences of the Alu family and similar repetitive DNA sequences. Nucleic Acids Res. 13: 8939-8954. Deininger, P. L., Jolly, D. J., Rubin, C. M., Friedmann, T., and Schmid, C. W. (1981). Base sequences of 300 nncleotide renatured repeated human DNA clones. J. Mol. Biol. 151: 17-33. Devereux, J., Haeberli, P., and Smithies, O. (1984). A comprehensive set of sequence analysis programs for the VAX. Nucleic Acids Res. 12: 387-395. Fitch, D. H. A., Bailey, W. J., Tagle, D. A., Goodman, M., Sieu, L., and Slightom, J. L. (1991). Duplication of the gamma-globin gene mediated by L1 long interspersed repetitive elements in an early ancestor of simian primates. Proc. Natl. Acad. Sci. USA 88: 7396-7400. Gingerich, P. D. (1984). Primate evolution: Evidence from the fossil record, comparative morphology, and molecular biology. Yearb. Phys. Anthropol. 27: 57-72. Haynes, S. R., Toomey, T. P., Leinwand, L., and Jelinek, W. R. (1981). The Chinese hamster Alu-equivalent sequence: A conserved, highly repetitious, interspersed deoxyribonucleic acid sequence in mammals has a structure suggestive of a transposable element. Mol. Cell. Biol. 1: 573-583. Hu, X., Ray, P. N., and Worton, R. G. (1991). Mechanisms of tandem

958

ERICKSON, KIM, AND MAEDA

duplication in the Duchenne muscular dystrophy gene include both homologous and nonhomologous intrachromosomal recombination. E M B O J. 10: 2471-2477. Jeffreys, A. J., and Harris, S. (1982). Processes of gene duplication. Nature 2 9 6 : 9-10. Jurka, J., and Milosavljevic, A. (1991). Reconstruction and analysis of h u m a n Alu genes. J. Mol. Evol. 3 2 : 105-121. Kmiec, E. B., and Holloman, W. K. (1986). Homologous pairing of DNA molecules by Ustilago recl protein is promoted by sequences of Z-DNA. Ceil 4 4 : 545-554. Kobori, J. A., Strauss, E., Minard, K., and Hood, L. (1986). Molecular analysis of the hotspot of recombination in the murine major histocompatibility complex. Science 2 3 4 : 173-179. Konopka, A. K. (1988). Compilation of DNA strand exchange sites for non-homologous recombination in somatic cells. Nucleic Acids Res. 16: 1739-1758. Lehrman, M. A., Goldstein, J. L., Russell, D. W., and Brown, M. S. (1987a). Duplication of seven exons in LDL receptor gene caused by Alu-Alu recombination in a subject with familial hypercholesterolemia. Cell 4 8 : 827-835. Lehrman, M. A., Russell, D. W., Goldstein, J. L., and Brown, M. S. (1987b). Alu-Alu recombination deletes splice acceptor sites and produces secreted low density lipoprotein receptor in a subject with familial hypercholesterolemia. J. Biol. Chem. 2 6 2 : 3354-3361. Lehrman, M. A., Schneider, W. J., Sudhof, T. C., Brown, M. S., Goldstein, J. L., and Russell, D. W. (1985). Mutation in LDL receptor: Alu-Alu recombination deletes exons encoding transmembrane and cytoplasmic domains. Science 2 2 7 : 140-146. Maeda, N. (1985). Nucleotide sequence of the haptoglobin and haptoglobin-related gene pair: The haptoglobin-related gene contains a retrovirus-like element. J. Biol. Chem. 2 6 0 : 6698-6709. Maeda, N., and Kim, H. S. (1990). Three independent insertions of retrovirus-like sequences in the haptoglobin gene cluster of primates. Genomics 8: 671-683. Maeda, N., and Smithies, 0. (1986). The evolution of multigene families: H u m a n haptoglobin genes. Annu. Rev. Genet. 2 0 : 81-108. Markert, M. L., Hutton, J. J., Wiginton, D. A., States, J. C., and Kaufman, R. E. (1988). Adenosine deaminase (ADA) deficiency due to deletion of the ADA gene promoter and first exon by homologous recombination between two Alu elements. J. Clin. Invest. 8 1 : 13231327. Maxam, A. M., and Gilbert, W. (1977). A new method for sequencing DNA. Proc. Natl. Acad. Sci. USA 74: 560-564. McEvoy, S. M., and Maeda, N. (1988). Complex events in the evolution of the haptoglobin gene cluster in primates. J. Biol. Chem. 2 6 3 : 15740-15747.

Miyake, Y., Tajima, S., Funahashi, T., and Yamamoto, A. (1989). Analysis of a recycling-impaired m u t a n t of low density lipoprotein receptor in familial hypercholesterolemia. J. Biol. Chem. 2 6 4 : 16584-16590. Nicholls, R. D., Fischel-Ghodsian, N., and Higgs, D. R. (1987). Recombination at the h u m a n alpha-globin gene cluster: Sequence features and topological constraints. Cell 4 9 : 369-378. Osheroff, N. (1989). Biochemical basis for the interactions of type 1 and type 2 topoisomerases with DNA. Pharmacol. Ther. 4 1 : 223241. Rich, B. E., and Steitz, J. A. (1987). H u m a n acidic ribosomal phosphoproteins P0, P1, and P2: Analysis of cDNA clones, in vitro synthesis, and assembly. Mol. Cell. Biol. 7: 4065-4074. Rouyer, F., Simmler, M. -C,, Page, D. C., and Weissenbach, J. (1987). A sex chromosome rearrangement in a h u m a n X X male caused by Alu-Alu recombination. Cell 51: 417-425. Sanger, F., Nicklen, S., and Coulson, A. R. (1977). DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. USA 7 4 : 5463-5467. Shen, M. R., Batzer, M. A., and Deininger, P. L. {1991). Evolution of the master Alu gene(s). J. Mol. Evol. 3 3 : 311-320. Shen, S. H., Slightom, J. L., and Smithies, 0. (1981). A history of the h u m a n fetal globin gene duplication. Cell 2 6 : 191-203. Sibley, C. G., and Ahlquist, J. E. (1987). DNA hybridization evidence of hominoid phylogeny: Results from an expanded data set. J. Mol. Evol. 2 6 : 99-121. Smith, G. R. (1983). Chi hotspots of generalized recombination. Cell 3 4 : 709-710. Spitzner, J. R., and Muller, M. T. (1988). A consensus sequence for cleavage by vertebrate DNA topoisomerase II. Nucleic Acids Res. 16: 5533-5556. Stallings, R. L., Ford, A. F., Nelson, D., Torney, D. C., Hildebrand, C. E., and Moyzis, R. K. (1991). Evolution and distribution of (GT), repetitive sequences in mammalian genomes. Genomics 10: 807815. Stoppa-Lyonnet, D., Carter, P. E., Meo, T., and Tosi, M. (1990). Clusters of intragenic Alu repeats predispose the h u m a n C1 inhibitor locus to deleterious rearrangements. Proc. Natl. Acad. Sci. USA 87: 1551-1555. Wahls, W. P., Wallace, L. J., and Moore, P. D. (1990). The Z-DNA motif d(TG)3o promotes reception of information during gene conversion events while stimulating homologous recombination in human cells in culture. Mol. Cell. Biol. 10: 785-793. Willard, C., Nguyen, H. T., and Schmid, C. W. (1987). Existence of at least three distinct Alu subfamilies. J. Mol. Evol. 2 6 : 180-186.

Junctions between genes in the haptoglobin gene cluster of primates.

To investigate the nature of the recombination that generated the haptoglobin three-gene cluster in Old World primates, we sequenced the region betwee...
1MB Sizes 0 Downloads 0 Views