Proc. Natl. Acad. Sci. USA Vol. 89, pp. 1075-1079, February 1992 Genetics

Duplication-targeted DNA methylation and mutagenesis in the evolution of eukaryotic chromosomes (repeat-induced point mutation/CpG dinucleotide)

MAJA C. KRICKER, JOHN W. DRAKE*, AND MIROSLAV RADMANt Laboratory of Molecular Genetics, National Institute of Environmental Health Sciences, Research Triangle Park, NC 27709

Communicated by Carl R. Woese, October 28, 1991

only 85% identical. This result again suggests that efficient ectopic recombination requires highly homologous segments and that the divergence typical of Alu sequences is sufficient to thwart recombination between them. Thus, it is likely that the existing sequence divergence and polymorphism among repeated sequences accounts for the current stability of eukaryotic genomes. But how is stability achieved after the amplification of sequences when the resulting repeats are identical? The spread of identical repeats such as transposons should create a genetic time bomb unless efficient mechanisms exist to suppress their amplification and recombination. Sequence divergence by rare spontaneous mutations is not likely to be an efficient mechanism. However, an efficient germ-line process capable of identifying, specifically modifying, and mutating repetitive sequences has been demonstrated in fungal systems. In Neurospora crassa and Ascobolus immersus, DNA duplications that enter the sexual cycle are often extensively methylated, leading to their immediate functional inactivation. In N. crassa, methylation at cytosines is accompanied by a very high rate of G-C -- A-T mutation, a phenomenon designated as "repeat-induced point mutation" (RIP) (9) or ripping. Because C -* T mutagenesis is associated with extensive cytosine methylation, ripping may occur by the facilitated deamination of 5-methylcytosine (5MeC) to thymine. Despite the existence of a G-T -- G-C mismatch repair system, 5MeC acts as an intermediate in C -- T mutagenesis both in bacteria and in mammalian cells (10-12). In vertebrate DNA, about 60-90%o of CpG dinucleotides are methylated at the 5 position of cytosine (13-15). CpGs occur at only about ¼15 of the expected frequency in bulk DNA, suggesting 5MeC deamination to yield thymine. About 1% of total vertebrate DNA is rich in unmethylated "islands" of DNA, has the CpG frequencies expected from its base composition, and is composed of unique sequences by the criterion of DNA reassociation. These islands contain about half of the unmethylated CpGs in the genome. They are suspected of playing a role in gene expression because they are associated with housekeeping genes and some tissuespecific genes and because the transcription of some genes with Hpa II tiny fragment (HTF) islands is inhibited when the island is methylated (16). We propose that, in addition to its putative role in gene control, CpG methylation has an important role in the evolution and stability of chromosome structure: it provides a means to specifically mark and diversify duplicated sequences and thereby to protect against recombinationmediated chromosome rearrangements. At present there is

Mammalian genomes are threatened with ABSTRACT gene inactivation and chromosomal scrambling by recombination between repeated sequences such as mobile genetic elements and pseudogenes. We present and test a model for a defensive strategy based on the methylation and subsequent mutation of CpG dinucleotides in those DNA duplications that create uninterrupted homologous sequences longer than about 0.3 kilobases. The model helps to explain both the diversity of CpG frequencies in different genes and the persistence of gene fragmentation into exons and introns.

The human genome harbors about a million interspersed repetitive elements and many gene families, affording a myriad of opportunities for ectopic (out-of-register) recombination generating deletions, additions, and chromosomal rearrangements (1) that can result in disability and disease. Even if the recombination frequency between duplicated sequences were as low as 106, the majority of cells would accumulate gross chromosomal rearrangements or aberrations. Yet mammalian genomes are remarkably stable: chromosomal rearrangements caused by inter-repeat recombination are seen only rarely, as in cellular events causing hereditary disease or cancer. What molecular mechanisms suppress ectopic recombination among repeated sequences in animal and plant chromosomes while allowing those chromosomes to recombine accurately as sister chromatids in mitosis and as homologs in meiosis? Studies in bacteria, yeast, and mammalian cells show that homologous recombination is inhibited by decreasing the degree of sequence similarity; reducing similarity by only a few percent sharply reduces recombination (2-5). These effects can be explained by the known properties of recombination enzymes and the specificity of mismatch repair as an editor of recombination in bacteria (2-4, 6). Defects in bacterial mismatch repair greatly increase interspecies recombination (6), and chromosomal rearrangements resulting from intrachromosomal recombination between 3.7-kilobase (kb) sequence repeats diverged by about 3-4% (7). Intrachromosomal recombination between identical repeats in mouse cells is an efficient process if segments of sequence identity are longer than about 0.3 kb. However, 19o sequence divergence in the same repeats inhibits their recombination by 1000-fold compared with identical repeats (5). The human growth hormone gene GHJ is flanked by highly homologous sequences and by 48 Alu elements. Familial growth hormone deficiency type 1A is caused by ectopic recombination that deletes both GHJ alleles on homologous chromosomes (8). Of 10 independent deletions, 9 occurred within 99%o-identical 594-nucleotide (nt) segments and one within 98%-identical 274-nt segments flanking the GHJ gene. No deletion breakpoints were in Alu sequences, which are

Abbreviations: nt, nucleotide(s); 5MeC, 5-methylcytosine; LINE, long interspersed element; MUP, major urinary protein; TPI, triosephosphate isomerase; SINE, short interspersed element; TK, thymidine kinase. *To whom reprint requests should be addressed. tPermanent address: Institute Jacques Monod, 75251 Paris Cedex 05,

The publication costs of this article were defrayed in part by page charge

payment. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. §1734 solely to indicate this fact.

France.

1075

1076

Genetics: Kricker et al.

no coherent explanation why some DNA segments are methylated while others are not. CpG islands extend from -0.5 to >3 kb and can include the 5' ends of genes, the first few exons, portions of introns, or the 3' regions of genes. We propose that CpG islands remain unmethylated because they are unique sequences. We further suggest that precisely duplicated sequences are targeted for methylation by the pairing of homologous regions. In support of this hypothesis, we present evidence from sequence analyses demonstrating that repeated sequences in mammals preferentially experience a high frequency of transition mutations at sites of cytosine methylation. We propose that this process transforms actively amplifying sequences into inert nonrecombining DNA, thus stabilizing vertebrate genomes. We then estimate the minimum length of homologous sequence required for such a process and suggest that gene fragmentation into exons protects coding sequences against homologous interactions with their own processed pseudogenes.

METHODS The process we propose will leave specific traces in the pattern of DNA sequence evolution. Because the most frequently methylated sequence in vertebrates is the dinucleotide CpG, the cytosine deamination model predicts that repeated sequences should display a history of numerous transitions from CpG to TpG and CpA dinucleotides. Thus, gene families and repeated sequences would have low CpG frequencies compared with unique or unmethylated sequences. We therefore compared the nucleotide and dinucleotide compositions of members of functional gene families, interspersed repetitive elements, pseudogenes, housekeeping genes, introns, and unmethylated sequences. Sequences were obtained from GenBank or EMBL data bases provided by the Genetics Computer Group. Nucleotide compositions and sequence alignments were generated by using the Genetics Computer Group sequence analysis software package (17). In each sequence, the observed CpG frequency was normalized to that expected from the product of its cytosine and guanine frequencies. We note, but will not here elaborate, that other normalizations are possible, such as comparisons against reciprocal GpC, GpT, and ApC dinucleotides or against hypothetical unripped consensus sequences. While perhaps appealing, these alternative normalizations involve additional assumptions and complicate the analyses. They tend to increase the contrasts between unique and repeated sequences and thus to strengthen our conclusions. Mammalian genomes contain over 101 copies of long interspersed elements (LINEs). Each mammalian species contains a single LINE family, always designated Li regardless of species (18). Each LINE contains two open reading frames, the second homologous to reverse transcriptase genes (19). We analyzed the second open reading frame of the consensus Li sequence in three species of mice and one of rats, and Li open reading frames from the factor VIII and hemoglobin genes of humans. We examined several families of functional genes. (i) The primate (-globin gene family comprises the adult (3- and 6-globins, the embryonic C-globin, the fetal Or and Ar globins, and the ,8 pseudogene. All six apparently derive from a single ancestral gene (20). We analyzed all three exons from each gene in the human cluster and the exons of the 6- and -globins from several species. (ii) Mouse major urinary proteins (MUPs) are the most abundant proteins of the male mouse liver. Most of the 35 MUP genes have been classified into two groups with approximately 15 members each (21). We analyzed the seven exons from a representative of the actively expressed group 1 genes. (iii) The cytochrome P450 superfamily comprises 20 gene families, 10 of which are present in all mammals (22). There are 60-200 functional

Proc. Natl. Acad. Sci. USA 89 (1992)

cytochrome P450 genes in any mammalian species. We analyzed the first exon of two closely related genes in the rabbit alcohol-inducible cytochrome P450 subfamily. (iv) Mammalian genomes have -35 different cytochrome c sequences (23). We analyzed the single functional human gene. Pseudogenes are under relaxed constraints compared with their progenitor genes. We analyzed the coding sequences of a group-2 MUP pseudogene, three human cytochrome c pseudogenes, a human (B-globin pseudogene, and three human triose-phosphate isomerase (TPI) pseudogenes. For unique sequences, we examined 12 housekeeping

genes for which there was no evidence of homology with other sequences except for processed pseudogenes. When repeated sequences are eliminated, introns contain unique sequences less subject to selective pressures than exons of housekeeping genes, as witnessed by frequent intra-intronic insertions of repeated elements. We analyzed four examples that were apparently free of insertions. Active mammalian a-globin genes are unmethylated. The C-globin gene encodes a fetal form of a-globin replaced during development by the al and a2 globins; while f3-globin cytosines are methylated, a-globin cytosines are not (24). Drosophila and yeast do not detectably methylate their cytosines (25), and we analyzed three Drosophila LINE-like transposons and a yeast Ty] element. The mammalian genome contains about 5 x 105 copies of short interspersed elements (SINEs), including the human Alu family (26). We examined eight Alu sequences lurking in the introns of the human complement component Clinhibitor gene, and one newly inserted Alu (27). Diverse criteria were used to choose sequences for analysis. The longest LINEs were chosen to reduce sampling error. Globins were chosen because sequences were available for most members of the globin family from a variety of species and because some were unmethylated; genes from other families were simply chosen without conscious bias. Pseudogenes were chosen as derivatives of either unique genes or functional gene families. A convenient set of Alu sequences was chosen from a single locus; Alu sequences within other regions gave similar results. Housekeeping genes were chosen for which information was available on the presence or absence of pseudogenes. Intron sequences were chosen to be free of repetitive elements. Mobile elements among nonmethylating organisms were chosen for similarity of evolutionary origin to vertebrate elements.

RESULTS AND DISCUSSION Patterns of CpG Depletion. The results of our analyses appear in Table 1. It is immediately obvious that most repeated sequences (LINEs, members of functional gene families, and repeated pseudogenes) contain substantial deficits of CpG dinucleotides compared with most unique se-

quences (housekeeping genes and unique introns). Table 1 also reveals that repeated but unmethylated sequences-the mammalian a-globin genes and transposons in organisms that do not methylate their DNA-contain substantially more CpGs than do mammalian repetitive sequences or even housekeeping genes. Thus, methylation and ripping are closely associated, as expected if 5MeC is indeed a mechanistic intermediate in the accelerated loss of CpGs. The special case of a-globin genes is discussed below. We performed pairwise comparisons of the means of observed/expected CpG frequencies for each group using the modified Tukey-Kramer method that adjusts for unequal sample sizes (28, 29). The means of repeated groups (LINEs, pseudogenes, members of functional gene families) are not significantly different from each other at a 95% confidence interval, but they differ significantly from the means of the

Genetics: Kricker et al.

Proc. Nati. Acad. Sci. USA 89 (1992)

1077

Table 1. CpG dinucleotides in different classes of genes

Gene or sequence LINEs Consensus Mus domesticus Consensus Mus caroli Consensus Mus platythrix Li-h in human ,8-globin Li-c in human factor VIII Li rat consensus Li-b in human factor VIII Li-a in human factor VIII

Members of functional gene families Orangutan 8-globin Human Alyglobin Human G-6globin Human 8-globin Human P-globin Human e-globin Goat -globin Spider monkey 6-globin Tarsius syrichta 8-globin Rabbit Cyt P450 IIE2 exon 1 Human Cyt c Human HPRT MUP group 1 Rabbit Cyt P450 IIE1 exon 1

Processed pseudogenes Human Cyt c pseudogene 1 MUP group 2 exons 1-7 Human Cyt c pseudogene 2 Human ,-globin pseudogene Human Cyt c pseudogene a Human TPI pseudogene 13C Human TPI pseudogene 19A Human TPI pseudogene SA

CpGs

Size, bp

0

315 315 315 2101 2145 319 2145 477

0 1 1 10 18 3 21 5

444 441 441 441 441 441 438 443 444 398 318 657

543 401

2 2 2 3 5 6 6 7 8 7 4 8 7 10

318 776 317 439 309 1005 1003 1280

1 3 2 5 3 15 24 32

Gene or sequence Unique housekeeping genes 14 0.000 Human HMG-CoA reductase 13 0.078 Human G3P dehydrogenase 12 0.085 Eel Na channel protein 76 0.132 Human DNA polymerase , 86 0.210 Mouse TK 12 0.247 Human Na/K ATPase 84 0.250 Human Cu/Zn SOD 19 0.268 Human TK Mean 0.159 Human APRT SD 0.099 Human TPI Human G6P dehydrogenase 32 0.063 Chicken Na/K ATPase 32 0.063 30 0.066 32 0.093 35 0.142 Unique introns Human TK intron d 35 0.174 Human APRT intron 2 32 0.186 Human HPRT intron a 5' 35 0.199 Cyt c intron a 35 0.232 28 0.249 15 0.264 27 0.299 23 0.308 Unmethylated mammalian genes Orangutan al-globin 30 0.332 Orangutan a2-globin Mean 0.191 Human '-globin SD 0.095 E

O/E

CpGs

Size, bp

0

2667 1008 5463 1008 621 912 465 705 841 1076 1440 918

40 29 113 18 23 24 14 35 49 37 77 38

128 77 295 43 47 48 27 64 88 62 127 63 Mean SD

0.312 0.379 0.383 0.423 0.490 0.499 0.523 0.546 0.560 0.594 0.606 0.606 0.493 0.098

1503 993 1166 1073

42 41 58 46

120 100 88 64 Mean SD

0.349 0.409 0.659 0.723 0.535 0.183

429 429 429

33 34 42

44 44 45 Mean SD

0.760 0.776 0.943 0.826 0.101

E

O/E

14 0.073 29 0.103 Repeated sequences in nonmethylating organisms 15 0.136 0.735 84 1752 62 Drosophila jockey element 27 0.189 0.818 137 Yeast TyIH3 element ORF b 112 3985 15 0.200 124 0.890 110 2576 Drosophila Fw element ORF 1 53 0.217 0.899 43 1290 39 Drosophila I factor 71 0.338 Mean 0.835 81 0.393 SD 0.076 Mean 0.206 SD 0.111 bp, Base pairs; 0, observed number of CpGs in a sequence; E, expected number of CpGs in a sequence = (number of C residues in sequence) x (number of G residues in sequence) . (total bases in sequence). APRT, adenosine phosphoribosyltransferase; Cyt, cytochrome; G3P, glyceraldehyde 3-phosphate; G6P, glucose 6-phosphate; HMG-CoA, hydroxymethylglutaryl coenzyme A; HPRT, hypoxanthine phosphoribosyltransferase; ORF, open reading frame; SOD, superoxide dismutase; TK, thymidine kinase.

unique and unmethylated

groups

(housekeeping

genes,

in-

trons, and Drosophila and yeast transposons). The results support our proposal that mammalian genomes

mechanism for speeding the divergence of repetitive sequences and inactivating mobile elements. They also help to rationalize the diversity of CpG frequencies in mammalian genes. An alternative hypothesis for the paucity of CpGs in repetitive sequences is that methylation occurs irrespective of duplication and that duplication merely relaxes the stringency of selection against new transitions at CpGs. Several observations contradict this hypothesis. (i) Families of functional (highly stringent) genes, many carrying essential functions, have as few CpGs as do LINEs. (ii) Unique regions in introns are nonessential sequences but average no fewer CpGs than do exons of housekeeping genes. (iii) CpG-rich islands in active genes are relatively short, unique sequences; they remain unmethylated unless the gene has been switched off (15). Because unmethylated sequences can have CpG values 20.8 (Table 1), the mean CpG value of -0.5 for unique sequences suggests that even they have been ripped, perhaps

possess a general

reflecting their ancient origin by duplication, or limited ripping by currently unrecognized pseudogenes. While our analysis used mostly mammalian examples, repeat-induced methylation and ripping should occur generally in methylating eukaryotes, including vertebrates, some invertebrates, plants, and some fungi. Because the sequence specificity of methylation differs among plants, fungi, and vertebrates, CpG depletion might apply only to vertebrates. In addition, rates of ripping may vary greatly in different eukaryotes. Thus, in N. crassa ripping is rapid, while in A. immersus duplications may be inactivated initially by methylation and subsequently by slow ripping (9). Mutational Specifcity at CpG Sites. If CpG sites are ripped by cytosine deamination, transitions to TpG and CpA will predominate. We tested this prediction in three situations. (i) When nine clustered human Alu sequences were aligned and compared with a consensus sequence (30) from another 30 human Alu sequences, more than half of the nucleotide changes were at CpG positions (Fig. 1). The alternate dinucleotide was TpG or CpA at about 90% of the CpG sites, indicating that CpG

-+

TpG and CpG

--

CpA transitions

are

the major consequences of ripping in mammals and a major

1078

Genetics: Kricker et al.

Alu consensus

Alu 9

Proc. Natl. Acad. Sci. USA 89 (1992)

L**

[**

-j

Alu 8

Alu 7 Alu 6 Alu 5

3nKoo ~

Alu 4

I

,

Alu 3 Alu2

W_,

FIG. 1. Variation at CpG sites Alu sequences. CpG; A, 0,

CpA;

. ~o _

Alu 1

o,

TpG;

m,

TpA;

V, trans-

version; absence of a symbol indicates a deletion that includes a

C p G s ite

source of Alu sequence divergence. (Alu 9 is a special case that will be considered below.) (ii) The TPI gene and its 13C pseudogene (31) were aligned and CpG positions were analyzed over a 464-nt region of homologous coding sequence. Of the 20 CpGs in the active gene, 14 had mutated in the pseudogene, 13 by transition and 1 by transversion. (iii) We aligned a 2000-nt region in the second open reading frame of primate LINEs and derived a consensus sequence from at least seven LINEs at each position, a number that varies because most LINEs harbor deletions. (The alignment is available from M.C.K. upon request.) We compared each sequence at the 13 consensus CpG sites and detected 66 transitions and 17 transversions. The frequency of transitions was 79% of all substitution mutations at CpGs. Evolutionary Implications. Because of ripping, the number of nucleotide substitutions among Alu sequences, LINEs, and other repetitive elements cannot accurately reflect their chronological ages. Transitions at CpG sites in eight of the Alu sequences accounted for more than 50% of total mutations (Fig. 1), revealing that these positions mutate more rapidly than other positions in the same sequence. The numbers of non-CpG transitions were similar to the numbers of transversions; thus, ripping alters the proportions of transitions and transversions and should be considered when estimating divergence times between sequences. Clearly, therefore, ripping should be factored and separately analyzed when constructing phylogenies, and especially when evaluating the rates of putative evolutionary clocks. Surprisingly, the depletion of CpG dinucleotides in mobile elements such as LINEs presents yet a different hazard to genome stability. If extensively ripped sequences remain capable of transposition, then duplications will arise that are already CpG-poor and cannot efficiently be further methylated or ripped. Although LINEs are much older than SINEs (19, 26), LINEs show less intraspecific diversity than do SINEs in humans and rodents (18, 19). Because mammalian LINEs have few CpGs, they appear to derive mainly from already ripped sequences. We noticed that LINEs are ATrich and that their noncoding strands are 1.5- to 2-fold richer in adenines than other bases. Among human LINEs, G A transitions account for nearly 70% of mutations of adenines. This may be related to a diversifying mechanism in meal-

The accumulation of repetitious noncoding sequences that have been diversified by ripping would explain most of the observed sequence polymorphism within mammalian species. Ripping as a major generator of sequence polymorphism may be instrumental in the recombinational isolation of chromosomes during both mitosis and meiosis, and it could therefore contribute to sterility in mating between closely related species (1-3). Indeed, yeast mutations preventing chromosome pairing and recombination lead to meiotic sterility because of chromosome nondisjunction (33). Protected Sequences. The a-globin gene family remains unripped (Table 1), implying that a special mechanism exists to protect specific multicopy genes from methylation and CpG loss. Although old Alu sequences are extensively ripped, newly transposed Alu sequences such as Alu 9 are both common and unripped (27), suggesting that actively transposing Alu sequences arise from a hidden protected progenitor with a complete CpG content. A similar protective mechanism operates in N. crassa, where '170 copies of 9.3-kb rRNA-encoding DNA are clustered at the end of a chromosome and remain unripped except when an occasional copy is transposed to an unprotected position (9). Ripping Target Size. Because it is likely to involve a homologous DNA interaction, ripping may require contiguous homologous sequence over some minimum length ('i). Consider first the TPI gene, with seven exons, six introns, and several pseudogenes (31). The structures of the active gene and a typical pseudogene are shown in Fig. 2. There is about 90% identity between the coding regions of the active gene and three well-characterized pseudogenes. As shown in Table 2, the CpG content of the first six exons of the active gene is similar to that of other housekeeping genes and higher than for the pseudogenes. The CpG content of the seventh exon of the active gene is low, resembling that of repetitive sequences. The pseudogenes have lost all seven introns, thus gaining uninterrupted sequence homology with each other over at least 462 nt. The first six exons of the active gene lack continuous homology with the pseudogenes over segments longer than 133 nt, the size of the largest exon. The seventh exon is also short (119 nt), but homology with the pseudogenes extends an additional 445 nt into the 3' noncoding region, so that the pseudogenes and the seventh exon of the active gene are homologous over 564 nt. This suggests that a minimum continuously homologous sequence of 133 < 1m < 462 nt is required for ripping. Consider next the rabbit cytochrome P450 IIE1 and IIE2 genes (22). They share 176 homologous nt in their first exon plus 199 homologous nt 5' to the first exon, providing 375 nt

--

worm

in

DNA, where G-C

-*

AT transitions account for most

of the variation in a major DNA component despite its normal CpG content and absence of 5MeC (32). Thus, even when CpG ripping is impossible, an additional mechanism may accelerate divergence.

active aene I51"I

1250

11241 11118517411331

310

1861

290

1881 12711191

45

FIG. 2. Sizes (in nt) and topography of the TPI active gene and pseudogene. Black bars, exons 1-6; g pseudogene bar,

631

1l11

445

hatched

bar,

exon

7; open

3' noncoding region; open spaces between bars, introns.

Genetics: Kricker et al.

Proc. Natl. Acad. Sci. USA 89 (1992)

Table 2. Contiguous homology requirements for ripping Uninterrupted homologous CpG Sequence Sequence O/E* sequence, bp length, bp TPI exons 1-6 0.59 631 133 TPI exon 7 0.21 564 119 TPI exon 7 + 3' noncoding region 0.15 564 564 TPI pseudogene 13C 0.20 909 909 TPI pseudogene 19A 0.35 922 922 TPI pseudogene SA 0.37 1066 500 Rabbit Cyt P450 IIEl exon 1 0.33 176 375 Rabbit Cyt P450 IIE2 exon 1 0.25 224 375 Rabbit Cyt P450 IIEl exon 2 0.71 132 132 Rabbit Cyt 0.86 P450 IIE2 exon 2 132 0.55 Mouse TK exons 1-6 120 Mouse TK pseudogene 0.57 exons 1-6 120 0.22 Mouse TK exon 7 570 Mouse TK pseudogene 0.23 exon 7 570 570 *Observed/expected numbers of CpG dinucleotides.

of continuously homologous sequence; here, CpG frequencies are low (Table 3). They share only 130 nt of continuously homologous sequence in their second exon (the adjacent introns lacking homology); here, CpG frequencies are similar to those of housekeeping genes. Thus, 130 < im < 375 nt. Consider next the mouse TK gene and its two pseudogenes (34). One appears by hybridization tests to have been extensively rearranged and lacks uninterrupted homology with either the active gene or the second pseudogene. The active gene has six introns, six short exons of -120 nt, and one long final exon of 570 nt. The second pseudogene is intronless and the longest homology with the active gene is the 570-nt segment. The first six exons of the active gene and the same region in the second pseudogene have CpG frequencies similar to those of housekeeping genes. The long, 570-nt, segment, however, has low CpG frequencies in both the active gene and the second pseudogene. Thus, 120 < < 570 ,,

nt.

Because Alu sequences of =300 nt are extensively ripped, < 300 nt. Pooling all values, 133 < im < 300 nt, a range comparable with the 150 < < 800-nt range reported for N. crassa (9). This value recalls observations suggesting a similar minimum length for efficient homologous recombination in mammalian somatic cells (35). Because the minimum length for efficient recombination varies in different orgaBoth recombination and ripping presumnisms, so may ably initiate with DNA pairing tests for homology, and ripping may occur at an early stage of this interaction. Implications for Genome Organization. Mammalian transgenes can be methylated and inactivated in the germ line in rough proportion to their copy number (36), suggesting that multicopy transgenes are subject to ripping. Effective transgene therapy for genetic disorders may therefore require introducing only a single copy of the gene. Most mammalian genes have spawned multiple processed retropseudogenes (37). Our analysis suggests that pseudoim

4,,

4,,.

genes

might be potent CpG

--

TpG mutagens for functional

Indeed, as many as a third of point mutations causing human genetic disorders and 40%o of point mutations causing

genes.

1079

some cancers result from transitions at CpG dinucleotides (12). However, we observe that functional parental genes are largely protected from being ripped by their own pseudogenes. Because more than 95% of exons are smaller than 0.3 kb (38), we propose that the fragmentation of coding sequences into exons protects genes from the awesome effect of ripping and from recombination with their retropseudogene homologs. The observation that introns are preferentially located between domain-encoding sequences suggests that exon shuffling might b! an important evolutionary strategy (38, 39). Ripping, however, provides a powerful selective pressure to both generate and maintain the fragmented status of genes. 1. Petes, T. & Hill, C. H. (1988) Annu. Rev. Genet. 22, 147-168. 2. Radman, M. (1988) in Genetic Recombination, eds. Kucherlapati, R. & Smith, G. R. (Am. Soc. Microbiol., Washington), pp. 169-192. 3. Shen, P. & Huang, H. V. (1986) Genetics 112, 441-457. 4. Shen, P. & Huang, H. V. (1989) Mol. Gen. Genet. 218, 358-360. 5. Waldman, A. S. & Liskay, R. M. (1987) Proc. Natl. Acad. Sci. USA 84, 5340-5344. 6. Rayssiguier, C., Thaler, D. S. & Radman, M. (1989) Nature (London) 342, 396-401. 7. Petit, M.-A., Dimpfl, J., Radman, M. & Echols, H. (1991) Genetics 129, 327-332. 8. Vnencak-Jones, C. L. & Phillips, J. A., III (1991) Science 250, 1745-1748. 9. Selker, E. U. (1990) Annu. Rev. Genet. 24, 579-613. 10. Radman, M. & Wagner, R. (1986) Annu. Rev. Genet. 20, 523-538. 11. Coulondre, C., Miller, J. H., Farabaugh, P. J. & Gilbert, W. (1978) Nature (London) 274, 775-780. 12. Rideout, W. M., III, Coetzee, G. A., Olumi, A. F. & Jones, P. A. (1990) Science 249, 1288-1290. 13. Bird, A. P. (1986) Nature (London) 321, 209-213. 14. Bird, A. P. (1987) Trends Genet. 3, 342-346. 15. Bird, A., Taggart, M., Frommer, M., Miller, 0. J. & MacLeod, D. (1985) Cell 40, 91-99. 16. Antequera, F., Boyes, J. & Bird, A. (1990) Cell 62, 503-514. 17. Devereux, J., Haeberli, P. & Smithies, 0. (1984) Nucleic Acids Res. 12, 387-395. 18. Singer, M. F. & Skowronski, J. (1985) Trends Biochem. Sci. 10, 119-122. 19. Hutchison, C. A., III, Hardies, S. C., Loeb, D. L., Shehee, W. R. & Edgell, M. H. (1989) in Mobile DNA, eds. Berg, D. E. & Howe, M. M. (Am. Soc. Microbiol., Washington), pp. 593-617. 20. Hardies, S. C., Edgell, M. H. & Hutchison, C. A., III (1984)J. Biol. Chem. 259, 3748-3756. 21. Shi, Y., Son, H. J., Shahan, K., Rodriguez, M., Costantini, F. & Derman, E. (1989) Proc. Natl. Acad. Sci. USA 86, 4584-4588. 22. Gonzalez, F. J. & Nebert, D. W. (1990) Trends Genet. 6, 182-186. 23. Evans, M. J. & Scarpulla, R. C. (1988) Proc. Natl. Acad. Sci. USA 85, 9625-9629. 24. Perutz, M. F. (1990) J. Mol. Biol. 213, 203-206. 25. Proffitt, J. H., Davie, J. R., Swinton, D. & Hattman, S. (1984) Mol. Cell. Biol. 4, 985-988. 26. Deininger, P. L. & Daniels, G. R. (1986) Trends Genet. 2, 76-80. 27. Stoppa-Lyonnet, D., Carter, P. E., Meo, T. & Tosi, M. (1990) Proc. Natl. Acad. Sci. USA 87, 1551-1555. 28. Kramer, C. Y. (1956) Biometrics 12, 307-310. 29. SAS Institute (1985) SAS User's Guide: Statistics (SAS Inst., Cary, NC), Version 5, pp. 470-476. 30. Britten, R. J., Baron, W. F., Stout, D. B. & Davidson, E. H. (1988) Proc. Natl. Acad. Sci. USA 85, 4770-4774. 31. Brown, J. R., Daar, I. O., Krug, J. R. & Maquat, L. E. (1985) Mol. Cell. Biol. 5, 1694-1706. 32. Ugarkovic, D., Plohl, M. & Gamulin, V. (1989) Gene 83, 181-183. 33. Roeder, S. (1990) Trends Genet. 6, 385-389. 34. Seiser, E., Kn6fler, M., Rudelstorfer, I., Haas, R. & Wintersberger, E. (1989) Nucleic Acids Res. 17, 185-197. 35. Bollag, R. J., Waldman, A. S. & Liskay, R. M. (1989) Annu. Rev. Genet. 23, 199-225. 36. Mehtali, M., LeMeur, M. & Lathe, R. (1990) Gene 91, 179-184. 37. Li, W.-H. & Grauer, D. (1991) Molecular Evolution (Sinauer, Sunderland, MA). 38. Dorit, R. L., Schoenbach, L. & Gilbert, W. (1990) Science 250, 1377-1382. 39. Gilbert, W. (1978) Nature (London) 271, 501.

Duplication-targeted DNA methylation and mutagenesis in the evolution of eukaryotic chromosomes.

Mammalian genomes are threatened with gene inactivation and chromosomal scrambling by recombination between repeated sequences such as mobile genetic ...
1MB Sizes 0 Downloads 0 Views