Proc. Natd. Acad. Sci. USA Vol. 89, pp. 1358-1362, February 1992 Genetics

Over- and under-representation of short oligonucleotides in DNA sequences CHRIS BURGEt, ALLAN M. CAMPBELLO, AND SAMUEL KARLINt Departments of tMathematics and *Biological Sciences, Stanford University, Stanford, CA 94305

Contributed by Allan M. Campbell, October 17, 1991

ABSTRACT Strand-symmetric relative abundance functionals for di-, tri-, and tetranusleotides are introduced and applied to sequences encompassing a broad phylogenetic range to discern tendencies and omalies in the occurrences of these short oligonucleotides within and between genomic sequences. For dinucleotides, TA is almost universally under-represented, with the exception of vertebrate mitochondrial genomes, and CG is strongly under-represented in vertebrates and in mitochondrial genomes. The traditional methylation/deamination/mutation hypothesis for the rarity of CG does not adequately account for the observed deficiencies in certain sequences, notably the mitochondrial genomes, yeast, and Neurospora crassa, which lack the standard CpG methylase. Homodinucleotides (AATT, CCGG) and larger homooligonucleotides are over-represented in many organisms, perhaps due to polymerase slippage events. For trinucleotides, GCATGC tends to be under-represented in phage, human viral, and eukaryotic sequences, and CTATAG is strongly under-represented in many prokaryotic, eukaryotic, and viral sequences. The CCA TGG triplet is ubiquitously overrepresented in human viral and eukaryotic sequences. Among the tetranucleotides, several four-base-pair palindromes tend to be under-represented in phage sequences, probably as a means of restriction avoidance. The tetranucleotide CTAG is observed to be rare in virtually all bacterial genomes and some phage genomes. Eplanations for these over- and underrepresentations in terms of DNA/RNA structures and regulatory mechanisms are considered.

nents. Similar evaluations are available for characterizing the relative abundances of tri-, tetra-, and higher-order oligonucleotides (see Methods). The DNA sequences examined (Table 1) range from a low G+C frequency of 33% in yeast up to 69o for the bacterium Streptomyces lividans. The relative abundance functionals control for these biases. The CG doublet in vertebrate sequences is a paradigm case of significant under-representation (CpG suppression). It is also well known that TA is under-utilized in the DNA of most organisms. For previous tabulations and analyses of doublet relative abundances, see, e.g., refs. 4-8. The rarity of certain tetranucleotides (e.g., the DAM methylase site GATC and the tetranucleotide CTAG) in some enterobacterial species was highlighted in refs. 9 and 10. In the context of discerning under- and over-represented doublets, triplets, tetrads, etc. within and across genomic sequences, certain more general issues arise. Are there significant similarities and/or differences between prokaryotes and eukaryotes, between viruses and their hosts, between nuclear and organelle DNA, and between coding and noncoding regions? What effects on oligonucleotide frequencies ensue from methylase activities? Do telomeres, centromeres, nucleosome linkers, and other recognized genomic regions have unusual oligonucleotide compositions? Do DNA and chromatin structure impose physical constraints on nucleotide associations? Are replication and transcription control sites correlated with low-abundance sequence patterns?

Genomic compositional inhomogeneities are widely recognized. For example, denaturation experiments on mammalian DNA (1) implicate compartments (isochores) of an average 200-kilobase (kb) length of high G+C content alternating with compartments of high A+T content. Mammalian coding regions tend to be G+C rich. Assays on A phage find two half-genomic compartments: one G+C rich, the other A+T rich (2). Other forms of inhomogeneity include CpG suppression prominent in vertebrate genomes, dispersed Alu sequences, satellite tandem repetitive DNA sequences, characteristic telomeric sequences, and repeated extragenic palindromes in the Escherichia coli and Salmonella typhimurium genomes (3). Thus, genomic inhomogeneity occurs widely and on different scales. In this paper we commence a detailed study encompassing a broad phylogenetic range with aim to discern tendencies and anomalies in the occurrences of di-, tri-, and tetranucleotides within and between genomic sequences. In particular, we identify extremes of over- and under-representation of short oligonucleotides. Assessments of dinucleotide relative abundance are usually based on an odds ratio measure, where values sufficiently less than 1 (or >1) indicate that a given dinucleotide is under-represented (over-represented) compared with the random union of its mononucleotide compo-

METHODS Measures of Over-/Under-representation of Short Oligonucleotides. Letfx denote the frequency of the nucleotide X (A, C, G, or T) in the sequence at hand, fxy the frequency of dinucleotide XY, fxyz the frequency of trinucleotide XYZ, and so on. A standard assessment of dinucleotide bias is through an "odds" ratio calculation, namelyP' = fXY/fXfy For Pxy sufficiently larger (smaller) than 1, the XY pair is considered over- (under-) represented compared with a random association of mononucleotides. There are classical statistical tests of the contingency table genre in terms of Pxy (11). The measure Pxy is suitable for a single sequence, but in comparing sequences from different organisms (or sequences of unknown relative orientation or from different chromosomes), the formula has to be modified to account for the complementary antiparallel structure of double-stranded DNA. This can be accomplished by considering the union of the given DNA sequence S and its inverted complementary sequence SI into S + S1 = S*. In S*, obviously the frequency f, of the mononucleotide A isfHa = = Y2(fA + fT) and ft = b = Y2(fc + fG), where fA, fT, fc, and fG are the mononucleotide frequencies in S. SimilarlyjfT = 1/2(fGT + fAc). The dinucleotide odds ratio measure that accounts for the complementary antiparallel structure of double-stranded DNA is taken to be Pt3T = fTT/ff&ft = 2(fGT + fAC)/(fG + fc)(fT + fA) and similarly for other dinucleotides.

The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. §1734 solely to indicate this fact.

1358

Genetics:

Burge et al.

A third order measure (controlling for all first- and secondorder effects) is the expression y4yz = f tyzfkIf tf I/ ftyf tzftNZ, where f and f y are the strand-symmetric frequencies, ftyz = 1/2(fxyz + fI(xyz), I(XYZ) denotes the inverted complement of XYZ, and N indicates any nucleotide. Again, the degree of third-order representational bias (relative abundance) depends on the deviation of Ayxz from 1. Corresponding higher-order deviation measures are available, labeled r*ijkl for tetranucleotides. Tendencies. We say that a dinucleotide XY shows a tendency toward under-representation if both members of the vector pair Pxy, PI(xy) are less than 1 [where I(XY) is the inverted complement of XY] for all data sets and that XY exhibits a tendency toward over-representation provided Pxy > 1 and PI(xy) > 1 pervasively. Similarly, we speak of tendencies for relative trinucleotide abundances, etc. A tendency can correspondingly be defined restricted to a specified class of organisms or sequences. In Table 2 we list tendencies for under- and over-representation of di, tri, and tetranucleotides for five classes of sequences: bacteriophages (5 sequence sets), bacteria (10 sets), large human viruses (6 sets), eukaryotes (8 sets), and mitochondria (6 sets). The data include several large complete phage, viral, and organelle genomes and extensive eukaryotic sequences covering a broad range (mammalian, avian, amphibian, invertebrate, plant, and fungal representatives) and diverse eubacterial sequences including Gram-negative and Gram-positive types. The E. coli collection is an ensemble of disjoint contigs. The other phage, bacterial, Caenorhabditis elegans and Neurospora crassa sequences were culled of redundant entries. The species collections were compiled from all available sequences in the European Molecular Biology Laboratory data bank (June 1991). Large unduplicated samples from the eukaryotic sequences were evaluated, and the results were concordant to those of the total sets. The species sequence collections undoubtedly have biases. For example, the human sequences frequently center on genes of medical interest, the Drosophila set includes a plethora of genes acting in embryogenesis, and many of the Rhizobium meliloti sequences relate to nitrogen fixation.

Proc. Natl. Acad. Sci. USA 89 (1992)

1359

RESULTS

doublet is also strongly under-represented in the six mitochondrial genomes (PG values between 0.53 and 0.84), to nearly the same extent as in vertebrate nuclear DNA. (ii) Trinucleotides. The GCA'TGC pair shows a tendency toward under-representation in the phage, viral, and eukaryotic sequences (Table 2) but not in the bacterial or mitochondrial sequences. Under-representation of CTATAG is widespread in nuclear eukaryotic DNA sequences, and Y7AG is among the two lowest trinucleotide representational values in many organisms (Table 1). (iii) Tetranucleotides. Five four-base-pair palindromes (AATT, AGCT, GATC, GGCC, and TGCA) exhibit pervasive under-representation tendencies in the bacteriophage sequences; AATT is in addition under-represented in most organisms in each of the four other classes. The palindrome CTAG is markedly under-represented in several phage and almost all bacterial sequences, notably A phage (TtTAG = 0.52) and E. coli (4*CTAG = 0.27) (see also ref. 10). Over-representation. (i) Dinucleotides. AA-TT is overrepresented in all or all but one of the members of each of the five sequence classes (Table 2). The other iterated dinucleotide, CCGG, tends to be over-represented only in mitochondrial sequences. Of the other doublets, CA-TG exhibits a contrasting pattern of representation, tending to have high p* values in all of the eukaryotic nuclear sequences but low p* values in the mitochondrial genomes. (ii) Trinucleotides. Of the trinucleotides, CCA TGG is the most widely over-represented, exhibiting tendencies toward over-representation in the viruses, the eukaryotes, and all but one of the phage sequences. Like CA-TG above, the triplet GAA'TTC has a contrasting pattern of representation, overrepresented in phage but under-represented in human viral sequences. The stop codon triplet pair CTA-TAG is underrepresented in eukaryotic sequences, whereas another stop codon triplet pair, TCATGA, tends to be over-represented in bacterial sequences. (iii) Tetranucleotides. No over-representations were observed for tetranucleotides with respect to more than one sequence class. Two four-base-pair palindromes tended to be over-represented, GTAC in eukaryotic nuclear sequences and TGCA in mitochondrial genomes, both substantially deficient in bacteriophage genomes.

The two highest and lowest di- and trinucleotides with respect to representation values (p v- and y*ijk) are reported for the sequences in Table 1. Tendencies (see Methods) in underand over-representation of di-, tri-, and tetranucleotides are summarized in Table 2. Interpretations and hypotheses on the main results will be considered in Discussion. Under-representations. (i) Dinucleotides. The TA doublet exhibits a tendency toward under-representation (i.e., PTA < 1) in all of the phage, bacteria, viruses, and eukaryotes and all nonmammalian mitochondrial genomes (PTA = 1.01 in rat and PTA = 1.07 in human). In addition, the representation value for TA is among the two lowest for dinucleotides in most sequence sets with the exception of the mitochondrial genomes. By contrast, the reversed dinucleotide AT shows no consistent pattern in its relative abundance. The low odds ratio score has been repeatedly observed for TA (e.g., refs. 4-8). Examination of PTA values in coding versus overall (>75% noncoding) sequences in eukaryotes and in E. coli shows TA to be strongly under-represented in both types of sequence, albeit slightly lower in coding regions: PTA = 0.72 (coding) vs. 0.74 (overall) in E. coli, 0.70 vs 0.77 in yeast, and 0.53 vs 0.64 in human sequences. The AC*GT dinucleotide shows a tendency toward underrepresentation for the eukaryotic sequences and for the mitochondrial genomes, and it is under-represented in all but one of the bacterial sequences. The deficiency of CpG in vertebrate genomes is well known (12). Strikingly, the CG

Our extensive analysis of over- and under-representations of di-, tri-, and tetranucleotides by using the strand-symmetric functionals pt. 'YU, and Tjkj revealed invariants and contrasts. Thus: (i) In agreement with refs. 4-8, the dinucleotide TA is pervasively under-represented with the exception of the mammalian mitochondrial genomes. In addition, the representation value PrA is one of the two lowest among all dinucleotides in most of the sequences. (ii) As has been well established, CG is significantly deficient in vertebrate genomes. Intriguingly, CG is also strongly under-represented in all of the mitochondrial genomes, to nearly the same extent as in the vertebrate sequences. (iii) The AC-GT doublet is also under-represented in both eukaryotic and mitochondrial sequences, whereas the CATG doublet is broadly overrepresented in eukaryotes but under-represented relative to mitochondrial genomes. (iv) The GCA TGC triplet is pervasively under-represented in phage, viral, and eukaryotic sequences. (v) The stop codon triplet pair CTATAG is consistently under-represented in eukaryotic nuclear sequences and in many human viruses, often with significantly lower representation values than GCA TGC. (vi) In nuclear DNA sequences, CCA-TGG is ubiquitously over-represented and commonly exhibits the highest representation value among the 32 trinucleotide triplet pairs. (vii) The palindromes AATT, AGCT, GATC, GGCC, and TGCA all tend toward under-representation in the bacteriophage sequences. The

DISCUSSION

1360

Genetics: Burge et al.

Proc. Natl. Acad. Sci. USA 89 (1992)

Table 1. Two highest and lowest di- and trinucleotide representation values kb

% C+Gt

39.9 48.5 19.4 29.4 103.1

48.40 49.85 39.67 40.80 35.66

Length, Organism Bacteriophages

17t At PZAt P1 (33% of genome) T4 (50%o of genome)

Bacteria§1

Dinucleotides and values Highest p$ Lowest pt

AG GC CA GC GC

1.12 1.20 1.11 1.32 1.17

CA CA AC AA AA

1.08 1.16 1.07 1.22 1.12

AT 0.85 TA 0.71 TA 0.86 TA 0.80 TA 0.82

CG AG GC AG AC

0.88 0.87 0.88 0.87 0.85

Trinucleotides and values Highest 4tdk CCA 1.24 CCA 1.25 *CCG 1.11 CCA 1.14 CCA 1.25

ACC ATA GAA ATC ACC

Lowest 4dk

1.15 1.14 1.11 1.12 1.16

CCC 0.78 CTA 0.68 CGA 0.89 CTC 0.86 ACA 0.86

CTA ACA GCA CCC AGG

0.81 0.83 0.91 0.88 0.87

Rhizobium meliloti (G-, a) 67.3 60.69 AT 1.29 CG 1.26 TA 0.49 AC 0.80 ATA 1.34 ACC 1.16 CTA 0.77 CCC 0.79 Rhodobacter capsulatum (G-, a) 58.4 64.34 AT 1.42 AA 1.26 TA 0.40 AC 0.78 AAG 1.30 GTA 1.22 CTC 0.74 GGA 0.76 Agrobacter tumefaciens (G-, a) 46.9 51.63 GC 1.21 AA 1.19 TA 0.67 AC 0.79 ATA 1.23 TCA 1.09 CTA 0.76 CCC 0.89 Neisseria gonorrhoeae (G-, 68.9 51.34 AA 1.51 CG 1.32 TA 0.64 AG 0.69 TCA 1.26 GTA 1.26 CTA 0.74 CTC 0.83 Escherichia coli (G-, y) 1431.7 51.57 GC 1.25 AA 1.21 TA 0.74 AG 0.83 CCA 1.29 CAG 1.21 CTA 0.68 ACA 0.80 Pseudomonas aeruginosa (G-, y) 104.9 63.15 GC 1.13 AA 1.13 TA 0.60 CC 0.88 GTA 1.35 GAA 1.18 CTA 0.77 TAA 0.78 Haemophilus influenzae (G-, y) 32.8 37.17 GC 1.38 AA 1.19 TA 0.80 AC 0.85 ACG 1.16 ACC 1.16 ACA 0.84 GAC 0.87 Bacillus subtilis (G+) 142.0 43.47 AA 1.24 GC 1.22 TA 0.64 AC 0.78 AGC 1.14 TCA 1.13 CTA 0.83 GCA 0.84 20.0 68.74 GA 1.17 CG 1.12 TA 0.65 AT 0.90 GTA 1.26 GAA 1.18 TAA 0.75 CTA 0.77 Streptomyces lividans (G+) Thermus thermophilus 26.1 67.18 AA 1.26 CC 1.22 TA 0.68 CG 0.75 GTA 1.32 ACC 1.23 AAC 0.73 GCA 0.75 Human virusest Adeno 35.9 55.20 AA 1.21 GC 1.16 TA 0.79 GA 0.88 CGC 1.17 CCA 1.16 CGA 0.86 CCC 0.87 Cytomegalo (CMV) 229.4 57.16 CG 1.19 AA 1.14 TA 0.80 CC 0.86 CCA 1.19 AAA 1.13 ACA 0.86 CCC 0.88 Epstein-Barr (EBV) 172.3 59.94 CC 1.21 AG 1.16 CG 0.60 TA 0.75 TAA 1.14 CCA 1.12 AAG 0.89 CGA 0.91 152.3 68.73 AA1.26 CC 1.07 TA 0.78 AG 0.85 TAA 1.22 CAG 1.17 CTA 0.67 AAG 0.87 Herpes simplex I (HSV1) Varicella-zoster (VZV) 124.9 46.02 CC 1.18 AA 1.16 AG 0.73 GA 0.88 GGA 1.14 CTC 1.12 CTA 0.85 CAC 0.88 Vaccinia 191.7 33.40 GA 1.15 CG 1.11 GC 0.80 AG 0.95 CCA 1.23 GGA 1.13 CCC 0.73 CGA 0.86 Eukaryotes 1284.2 38.56 AA1.14 CA 1.09 TA 0.77 CG 0.80 CCA 1.13 ACC 1.12 CCC 0.89 CTA 0.89 Saccharomyces cerevisiae§ 204.4 52.75 GA 1.10 AA 1.10 TA 0.68 CG 0.88 GTA 1.22 CTC 1.11 CTA 0.79 ACG 0.92 Neurospora crassa§ Caenorhabditis elegans§ 311.9 40.20 AA 1.23 GA 1.18 TA 0.59 AC 0.85 CTC 1.14 CCA 1.13 CCC 0.81 CTA 0.82 1432.7 45.73 GC 1.20 AA 1.17 TA 0.77 AC 0.86 CCA 1.14 CTC 1.13 CTA 0.80 CCC 0.84 Drosophila melanogaster§ 659.5 44.91 CA 1.19 CC 1.15 CG 0.50 TA 0.73 CCA 1.13 AGC 1.08 CTA 0.86 GCA 0.93 Xenopus laevis§ 1001.4 50.27 CA 1.23 AG 1.17 CG 0.50 TA 0.64 CCA 1.17 AGC 1.08 CTA 0.87 CGA 0.87 Chicken§ 1410.9 50.99 CA 1.22 AG 1.19 CG 0.42 TA 0.63 CCA 1.19 TAA 1.09 CTA 0.87 GCA 0.90 Human (20% of EMBL) 395.5 50.20 GC 1.13 CA 1.12 TA 0.78 AC 0.89 CTC 1.13 ATA 1.06 CCC 0.85 CTA 0.93 Zea mays§

Chloroplastst Rice Tobacco Mitochondriatt

134.5 38.99 CC 1.29 AA 1.17 AC 0.75 TA 0.82 TCA 1.11 ACC 1.09 GGA 0.90 ACA 0.92 155.8 37.85 CC 1.28 GA 1.18 AC 0.75 TA 0.78 GCC 1.14 TCA 1.14 GGA 0.87 GCA 0.89

AT 0.65 AC 0.76 CGA 1.33 GTA 1.28 CGC 0.76 AGA 0.85 CG 0.58 AC 0.82 CCG 1.20 AGC 1.09 CGC 0.85 ATC 0.91 Drosophila yakuba CG 0.68 AC 0.80 AGC 1.42 CGA 1.28 CGC 0.65 GCC 0.65 CG 0.63 AC 0.89 GCC 1.13 CCG 1.12 CGC 0.78 CCC 0.87 Xenopus laevis Rat CG 0.53 GC 0.88 CCG 1.25 GCC 1.15 GCA 0.82 CGC 0.83 Human CG 0.53 GC 0.87 GCC 1.16 CCG 1.15 AAG 0.85 GAC 0.86 Of the 10 dinucleotide pairs (AA-TT, AC-GT, AG-CT, CA-TG, CC-GG, GA-TC, AT-AT, CG-CG, GC-GC, and TA-TA; note the last 4 are self-dyads) and the 32 trinucleotide pairs, the two highest and two lowest of the strand-symmetric representation values (see Methods) are recorded. Only one member of each pair is listed. See Methods concerning data selection and cleaning. tIn nuclear, viral, and chloroplast DNAs, A T and C G in each strand. For mitochondria DNAs, where the differences are often significant, the mononucleotide content of one strand is given. For mitochondrial sequences P. aurelia A 25.3%, C 21.9o; P. lividus A 30.8%, C 22.5%; D. yakuba A 39.5%, C 12.2%; X. laevis A 33.1%, C 23.5%; rat A 34.1%, C 26.2%; human A 30.9%, C 31.2%. tComplete genome. §AII of current (June 1991) European Molecular Biology Laboratory (EMBL) sequences. aG-,a indicates Gram-negative, a-purple group; G-, (3, Gram-negative, (-purple group; G-, y, Gram-negative, y-purple group; G+, Gram-positive. Paramecium aurelia Paracentrotus lividus

40.5 15.7 16.0 17.6 16.3 16.6

41.24 39.69 21.41 36.99 38.68 44.37

AA 1.37 CC 1.31 CC 1.67 CC 1.28 CC 1.39 CC 1.35

AG AG GC AG AG AG

AATT tetrad is, in addition, under-represented in many other organisms. (viii) The tetranucleotide CTAG is underrepresented in most bacterial sequences, often the rarest of the four-base-pair palindromes (data not shown). It is worth emphasis that high and low relative abundances of di-, tri-, and tetranucleotides and high and low occurrence frequencies are generally not congruent phenomena (occurrence frequency data not shown). We consider below a number of possible interpretations of the foregoing observations. Avoidance of TA. The pervasive under-representation of the TA dinucleotide is tantalizing. One possibility is that TA (or certain TA-containing oligonucleotides) adversely affect

1.22 1.18 1.34 1.06 1.05 1.09

supercoiling and/or chromatin structure. From another perspective, in view of the prominent regulatory role of the "TATA box" in mediating transcription and of TAcontaining oligonucleotides in transcription termination signals (e.g., AATAAA in higher eukaryotes, TATATA in yeast), occurrences of TA might be minimized to avoid inappropriate binding of transcription or termination factors. Suppression of TA in coding regions might relate to low tyrosine (encoded from TAY) usage and/or avoidance of an inappropriate stop codon. However, PTA values are in general only slightly lower in coding than overall sequences and retain significant under-representations in noncoding regions

Genetics: Burge

et

al.

Proc. Natl. Acad. Sci. USA 89 (1992)

Table 2. Under- and over-representation tendeincies for oligonucleotides ProHuman EuMitoOligonucleotide Phage karyotes viruses k;aryotes chondria pu < 1 and p'(.) < 1 All AC-GT 3/5 9/10 3/6 All All CA*TG 0/5 3/10 1/6 0/8 All 3/5 CG 1/10 2/6 All All TA All All All 4/6 All yiVk < 1 and VI(Uk) < 1 4/5 All 5/6 CCCGGG 6/8 2/6 3/5 8/10 5/6 CTA*TAG All 3/6 0/5 GAA*TTC 1/10 All 6/8 1/6 2/5 6/10 3/6 GACGTC All 4/6 All 6/10 GCA*TGC All 4/6 All Tijid < 1 and 7rI(,kI 1 and p(o) > 1 4/5 9/10 5/6 AA-TT All 5/6 3/5 5/10 3/6 CA*TG All 0/6 2/5 1/10 4/6 CCGG 5/8 All 3/5 8/10 2/6 GC 4/8 All Vyk > 1 and V(wUk) > 1 2/5 1/10 3/6 AGCGCT 6/8 All 1/5 3/10 2/6 ATA-TAT 2/6 All 4/5 5/10 All CCA*TGG All 1/6 All 8/10 3/6 CGCGCG 3/8 0/6 0/5 1/10 All CTCGAG 7/8 2/6 All 7/10 0/6 GAA-TfC 0/8 2/6 2/5 All 3/6 GCC-GGC 5/8 2/6 0/5 1/10 All GGA'TCC 3/8 2/6 3/5 All 1/6 TCA-TGA 4/8 3/6

1/ll

TUkri> 1 and TI(Vk!) >1

3/5 5/10 All ACCCGGGT 5/6 4/8 3/5 4/10 2/6 ATCA*TGAT All 0/5 1/10 All ATTC-GAAT 3/8 3/6 All 2/10 1/6 CATCGATG 4/8 2/6 1/5 2/10 All CGGCGCCG 3/8 3/6 1/5 2/10 3/6 GTAC All 4/6 0/5 4/10 3/6 TGCA 1/8 All Short oligonucleotides showing an under- (over) representation for all sequences of at least one of the data groups: phage (5 sequence sets), prokaryotes (10 sets), human viruses (6 seEts), eukaryotes (8 sets), and mitochondrial genomes (6 sets). See Mfethods for precise definitions of tendencies. [I(ij) = inverted compl ement of ffl.

5/8

(Results). Furthermore, a comparison of tthe three frames [codon positions (1, 2), (2, 3), and (3, 1)] rev heals no essential et difference in the values of PTA (data not sh own). Beutler et al. (7), comparing TA frequencies in cDN. A against intron DNA and intergenic DNA in a sample o df human genes, suggested that TA is more under-represent zd in transcribed than untranscribed regions. They provide e xperimental evidence that UpA is the diribonucleotide mo,st susceptible to ribonuclease activity. The appealing hypothe5sis suggested by these facts does not account for the sitgnificant underrepresentation in intergenic regions.

easwno essentle

1361

An unusual situation exists in Neurospora (also in Asco-

bolus). Sequences are observed to direct de novo patterns of methylation, where duplicated sequences are altered by RIP (repeat-induced point mutation) processes and subsequently heavily methylated at position 5 of cytosine, mostly at CpA (13). Such CpA methylation would be expected to engender frequent C-G -+ T'A mutations, resulting in reduced frequencies of CA and elevated frequencies of TA. In fact, just the opposite is observed in Neurospora (PAA = 1.10, PtA = 0.68). Apparently, then, some other (possibly structural) constraint on TA frequencies is active. CpG Suppression. The well-documented CG deficiency in vertebrate genomes has traditionally (e.g., ref. 14) been ascribed to the methylation/deamination mutation hypothesis (resulting in C-G -- T-A mutations). Further support for this hypothesis comes from the observation that CA TG is markedly over-represented in vertebrate sequences (Table 1). The relatively normal occurrence of CG in Drosophila, C. elegans, and most prokaryotes might be explained by lack of the associated methylase in these organisms. Surprisingly, yeast and Neurospora have reduced values of (0.80 and 0.84, respectively-not as low as in vertebrateP*CG sequences), though they lack the standard methylase. In eukaryotes CpG methylation is considered a regulatory agent linked with suppression of gene activity (e.g., ref. 15) and related to higher-order chromatin structure (16). Regions of under- or nonmethylated CpG appear to be in a relaxed conformation compared with bulk chromatin (17). Methylation in concert with a host of chromosomal proteins could provide the cell with stability and heritability of the chromatin state. Along these lines, evidence is available that DNA methylation plays a decisive role in the maintenance of X-chromosome inactivation in mammals (17). CpG is strongly under-represented in Epstein-Barr virus (EBV), is of normal relative abundance in herpes simplex virus I (HSV1) and varicella-zoster virus (VZV), but is the most over-represented dinucleotide in cytomegalovirus (CMV). Honess et al. (18) note that EBV and HVS (herpes virus saimiri), both lymphotropic, show CpG deficiency, while the neurotropic HSV1 and VZV do not. They propose that a latent existence inside a regularly dividing (e.g., lymphatic) cell would tend to select against CG because of dynamic methylation activities, while existence in a nondividing cell (e.g., a neuron) would expose CG doublets to only minimal methylation. This model does not explain, however, why CMV, broadly extant in dividing as well as nondividing cell types, has high relative CG abundance. The remarkable under-representation of CpG in mitochondrial sequences lacks a convincing explanation, since the associated methylase is either absent or occurs at very low levels in these organelles (19). One possible interpretation is that most occurrences of CpG are disadvantageous in both chromosomal and organellar DNA of eukaryotes, and that the methylase helps expedite changes that are desirable in their own right. This would explain why eukaryotes have not evolved an efficient mechanism to counteract the mutagenic activity of the methylase. Consistent with this interpretation, the product of methylase mutations, CATG, is not overrepresented in mitochondrial sequences as in eukaryotes (Table 2). Several authors (4, 5, 20) have suggested that CG

deficiencies may be due to structural constraints at the DNA level. In this vein, distinct modes of DNA packing (e.g.,

nucleosomes, etc.) in eukaryotes vs. viruses vs. prokaryotes vs. organelles may entail differing constraints on the occurrence of CG, a factor that may relate to the puzzling overrepresentation of CG in E. coli, B. subtilis, vaccinia virus, and cytomegalovirus. Other Under-represented Di- and Trinucleotides. Interestingly, the familiar core doublets of donor and acceptor splice junction signals, GT-AC and AG-CT, are widely under-

1362

Genetics:

Burge et al.

represented (Table 1). Possibly, in line with our speculations about the rarity of TA, these dinucleotides may be maintained at low frequency so as to avoid superfluous (or undesireable) occurrences of splicing signals. The consistent rarity and under-representation of CTATAG in pro- and eukaryotic DNA presents a conundrum. It cannot be due simply to its being a stop codon, since the other two stop codon triplets, TAA and TGA, are not particularly under-represented in most organisms. The broad under-representation of the triplet pair GCATGC is another phenomenon that lacks a convincing explanation. Low cysteine (coded TGY) usage might contribute to low GCATGC frequencies in coding regions, but this fact alone could hardly explain the consistently low representation values observed. Extreme Rarity of CTAG in Many Prokaryotes and Phages. The unusual rarity of CTAG in most bacterial and phage sequences is intriguing (10). It is possible that the overlap of oppositely oriented TAG stop codons in CTAG is somehow detrimental. However, the tetranucleotide TTAA, which also contains a stop codon in both orientations, has normal representations in most of the data sets. In this context, the near universal under-representation of AATT in nuclear DNA sequences is also intriguing. Other speculative explanations relate to the fact that the consensus binding site for the trpR repressor, ACTAGTTAACTAGT (21), contains two copies of CTAG. There is some evidence from the crystal structure of the trp repressor/operator complex that the two CTAGs "kink" when bound by trpR (22), and it is possible that maintenance of the kinks under supercoiled or other structural conditions is disadvantageous to DNA stability. Parenthetically, the metJ repressor binding site involves an eight-base-pair repeated sequence with consensus sequence containing multiple CTAG words (23). The DCM methylase/short patch repair system (24) could also be involved. The DCM methylase targets the second C of the pentanucleotide CCAGG, which can then mutate by deamination to CTAGG. The repair system corrects T-G mismatches of this sort back to C-G. If the repair system lacked perfect specificity and sometimes "corrected" legitimate occurrences of CTAG, that might contribute to the rarity of CTAG. In keeping with the expectations of this hypothesis, CCAG-CTGG (T* = 1.21) are substantially overrepresented in E. coli. High Relative Abundance of Homo-oligonucleotides. Iterations of a single nucleotide are quite frequent in many organisms, though they tend to be over-represented primarily at the dinucleotide level. Runs of A or T are more common than runs of G or C in our data sets. The high frequency of poly(A) and poly(T) runs might be due to polymerase slippage events at weakly hydrogen bonded AFT base pairs. In the mitochondrial genomes, although COG frequencies are invariably low, CCGG is, with one exception (Paramecium), the most over-represented doublet. Relatively high glycine (encoded from GGN) usage in mitochondrial proteins may contribute significantly to this phenomenon, since a large fraction of mitochondrial DNA codes for proteins. Contrst Between Organelle and Nuclear DNA. There are manifest differences between the oligonucleotide compositions of nuclear and organelle DNA. In particular, the dinucleotides AG-CT and CASTG exhibit contrasting patterns of representation. Whereas AG-CT tends to be of low relative abundance in prokaryotic and most eukaryotic nuclear sequences, P*AG is close to one in the two chloroplast genomes (1.02 in rice, 0.97 in tobacco) and AG-CT is distinctly over-represented in five of the six mitochondrial genomes. The CA*TG doublet diverges even more sharply: CATG is generally over-represented in eukaryotic nuclear sequences but is under-represented without exception in the mitochondrial and chloroplast genomes. There are contrasts too at the trinucleotide level. Low relative abundances of CTATAG in

Proc. Natl. Acad. Sci. USA 89 (1992)

eukaryotic nuclear sequences have essentially normal representations in the chloroplast and mitochondrial genomes. Similarly, the predominant over-representation of TGG-CCA in viral and eukaryotic nuclear sequences is reversed in most mitochondrial genomes. The reasons for these differences are unknown but may relate to the evolutionary histories of the organelles: chloroplasts putatively descended from cyanobacteria and mitochondria derived by lateral transfer from the a-purple bacteria, both of which are quite distant phylogenetically from most of the bacteria or any of the eukaryotes. Of all the tendencies for di-, tri-, and tetranucleotides reported in Table 2, fewer are exhibited for the bacteria (only four) than for the mitochondria (nine), the viruses (eleven), the eukaryotes (thirteen), or the bacteriophages (thirteen). This outcome probably results from the vast phylogenetic range represented by the bacteria in our data set, as compared to the relatively narrow range of phage (mostly cohlphage) sequences and viral (mostly human herpesvirus) sequences available. The somewhat lesser number of conserved short oligonucleotide tendencies observed across the mitochondrial sequences may reflect the high rate of nucleotide substitution observed in mitochondrial genomes as well as the relatively broad range of mitochondrial genomic sequences (from Paramecium to human) in our data set. In this context, it is striking that the eukaryotes, extending from yeast and Neurospora to humans, should exhibit many consistent patterns of short oligonucleotide usage. We are happy to acknowledge discussions and comments by Drs. B. E. Blaisdell, D. Botstein, V. Brendel, M. McClelland, and C. Yanofsky. This work was supported in part by National Institutes of Health Grants HG00335-03, GM10452-28, and AI08573 and by National Science Foundation Grant DMS86-06244. 1. Bernardi, C., Mouchirond, D., Bautier, C. & Bernardi, G. (1988) J. Mol. Evol. 28, 7-18. 2. Inman, R. B. (1966) J. Mol. Biol. 18, 464-472. 3. Gilson, E., Saurin, W., Perrin, D., Bachillier, S. & Hofnung, M. (1991) Nucleic Acids Res. 19, 1375-1383. 4. Nussinov, R. (1981) J. Biol. Chem. 256, 8458-8462. 5. Nussinov, R. (1987) J. Theor. Biol. 125, 219-235. 6. Ohno, S. (1988) Proc. NatI. Acad. Sci. USA 85, 9630-9634. 7. Beutler, E., Gelbart, T., Han, J., Koziol, J. A. & Beutler, B. (1989) Proc. Natl. Acad. Sci. USA 86, 192-196. 8. Kozhukhin, C. G. & Pevzner, P. A. (1991) Comput. Appl. Biosci. 7, 39-49. 9. McClelland, M. (1985) J. Mol. Evol. 21, 317-322. 10. McClelland, M., Jones, R., Patel, Y. & Nelson, M. (1987) Nucleic Acids Res. 15, 6985-7008. 11. Hollander, M. & Wolfe, D. A. (1973) Nonparametric Statistical Methods (Wiley, New York). 12. Josse, J., Kaiser, A. D. & Kornberg, A. (1961) J. Biol. Chem. 236, 864-875. 13. Selker, E. U. (1990) Annu. Rev. Genet. 24, 579-613. 14. Bird, A. P. (1986) Nature (London) 321, 209-213. 15. Cedar, H. & Razin, A. (1990) Biochem. Biophys. Acta 1049, 1-8. 16. Taji, J. & Bird, A. (1990) Cell 60, 909-920. 17. Riggs, A. D. (1990) Philos. Trans. Royal Soc. London Ser. B 362, 285-297. 18. Honess, R. W., Gompels, U. A., Barrell, B. G., Craxton, M., Cameron, K. R., Staden, R., Chang, N. Y. & Hayworth, G. S. (1989) J. Gen. Virol. 70, 837-855. 19. Nass, M. M. K. (1976) Handbook of Genetics, ed. King, R. C. (Plenum, New York), Vol. 5, pp. 477-533. 20. Lennon, G. G. & Fraser, N. W. (1983) J. Mol. Evol. 19, 286-288. 21. Gunsalus, R. P. & Yanofsky, C. (1980) Proc. Natl. Acad. Sci. USA 77, 7117-7121. 22. Otwinowski, Z., Schevitz, R. W., Zhang, R.-G., Lawson, C. L., Joachimiak, A., Marmorstein, R. Q., Luisi, B. F. & Sigler, P. B. (1988) Nature (London) 335, 321-329. 23. Rafferty, J. B., Somers, W. S., Saint-Girons, I. & Phillips, S. E. V. (1989) Nature (London) 34, 705-710. 24. Lieb, M. (1985) Mol. Gen. Genet. 199, 465-470.

Over- and under-representation of short oligonucleotides in DNA sequences.

Strand-symmetric relative abundance functionals for di-, tri-, and tetranucleotides are introduced and applied to sequences encompassing a broad phylo...
1MB Sizes 0 Downloads 0 Views