Grammatical rule for all DNA

Electrophoresis 1991,12, 103-108

Susumu Ohno’ Tetsuya Yomo2 ‘Beckman Research Institute of the City of Hope, Department of Theoretical Biology, Duarte, CA 2Departmentof Fermentation Technology, Faculty of Engineering, Osaka University, Osaka

103

The grammatical rule for all DNA: Junk and coding sequences Selfish DNA, coding sequences, and junk DNA in the genome are no stranger to each other; rather, they represent three phases in the life cycle of DNA. Accordingly, they all obey the same grammatical rule of TG/C A/CT excess andCG/TA deficiency. On the one hand,it is this very rule which keeps isoelectric points of most proteinsnear the neutral range. On the other hand, this rule creates numerous palindromes, thus maintaining symmetry between complementary strands. Many of these palindromes encode identical oligopeptides on both strands.

1 Introduction It seems as though “junk DNA” has become a legitimate jargon in a glossary of molecular biology. Considering the violent reactions this phrase provoked when it was first proposedin 1972 [l],theauraoflegitimacyitnowenjoysisamusing, indeed. Nevertheless, it should be pointed out that ‘‘Junk DNA” is not the same as “selfish DNA” [2,31 for the latter refers to relatively short sequences that have a way of propagating themselves as though they were parasites. The mammalian genome, comprised of roughly 3 x lo9 base pairs of DNA, harbors at the most lo5 genes, the majority of them encoding proteins, In DATABASE, 18,383 entries of amino acid sequences, primarily deduced from cDNA base sequences, are registered. The average length ofthese proteins is 285 residues [41; thus, it appears that the portion of the genome actually engaged in encoding of proteins totals only 8.55 x lo7 base pairs. Various regulatory signal sequences typically residing in the immediate vicinities of coding regions are characteristically short. Even if we make a generous allowance of 300 base pairs of regulatory signals per gene, the total regulatory signals in the genome amount to only 3 x lo7 additional base pairs. The above leaves 95 % of the genomic DNA as junk. Inasmuch as “selfish DNA” typically amounts to only 35 % of the genome, more of the junk occurring as intergenic spacers, intervening sequences and pseudogenes are not a selfish variety. While the above consideration renders “selfish DNA” of the contemporary world to be a mere subtype of “junk DNA”, its very selfishness reveals the primordial nature of repetitious sequences. As far as nucleic acids (RNA) were concerned, chemical processes operating in the reducing atmosphere of the prebiotic world would have synthesized only oligonucleotides.Included among these oligonucleotides were those with internal repetitiousness - and only these managed to elongate themselves after each replication via unequal pairing. It follows that the first set of coding sequences to emerge at the beginning of life on earth had to repeats of base tetramers and pentamers [51. The above view is supported by the fact that, even today, each new gene which arises de novo within a species is largely made of oligomeric repeats; e.g., circumsporozoite antigen genes of two malarial protozoa, Plasmodium falciparum and P . knowlesi [6,71. Viewed in the above light, “junk DNA” and“coding DNA” as well as “selfish DNA” and “altruistic DNA” are no stranger to each other. Rather, they represent different phases in the life Correspondence: Dr. Susumu Ohno, Department of Theoretical Biology, Beckman Research Institute of the City of Hope, 1450 E. Duarte Road, Duarte, CA 91010-0269,USA

0VCH Verlagsgesellschaft mbH, D-6940 Weinheim, 199 1

cycle of DNA, the youngest being “selfish DNA”, while the middle age is represented by “coding DNA” and the old, declining age by “nonrepetitious junk DNA.” Accordingly, all DNA obey the same grammatical rule of TG/CA/CT excess and CG/TA deficiency. 1.1 The antiquity of the universal rule: TG/CA/CT excess, CG/TA deficiency The above-noted grammatical rule was first proposed for coding sequences [SI but was later found to apply to noncoding regions of DNA as well [9, lo]. The deficiency of CG base dimers has been known for along time but was usually attributed to the methylation of C of CG, methylated C G being converted toeitherTGorCA [1 ll.TAdeficiency,ontheother hand, is a more recent discovery IS, 121. We shall now show that the universal grammatical rule, noted above, was independent of C G methylation and that it probably existed before the beginning of life on earth. Influenza virus is an RNA virus and, unlike retroviruses, it is not incorporated into the host genome in the form of reversetranscribed DNA. Accordingly, viruses of this kind should not have experienced the conversion of C% to TG and CA throughout their entire existence; for the sake of uniformity, RNA SeQuences are treated, here, as though they are cDNA. Yet, as Fig. 1 shows, the coding sequence for-two hemag(HA1

+

HA2) GENE

OF A-

A : 539 (0.317)

(1.698 BASES)

VIRUS ( A / A i c ~ 1 / 2 / 6 8 )

-6

JL.AmQum

H 9 1 Z (96) 4

H3

8o (96)

%6

89

6

H7

40

#;

74 (127)

114 (111)

151 (111) H

C : 351 (0.207)

H;

128 (127)

Figure 1 . Results of the base dimer analysis performed on the coding sequence for 2 hemagglutinins of human influenza RNA virus I131. At the top, its base composition is shown by numbers and fractions in parentheses of 4 bases. Immediately below, observed numbers accompanied by expected numbers in parentheses 5 pairs of reciprocal dimers are shown. The observed/expected ratio is also shown immediately below each dimer. Three base dimers that are always in excess are underlined by thick solid bars, whereas 2 always deficient dimers are accompanied by thick open bars. Their reciprocal dimers, occurring in more or less expected numbers, are underlined by shaded bars. 0 173-0835/9 1/02-302-3-0103 $3.50+.25/0

104

S. Ohno and T. Yomo

Electrophoresis 1991,12, 103-108

one or two codons are assigned. Met and Trp, with one codon each, occupy the 18th, and the very bottom, 20th ranking, while two codons each are assigned to No. 13 (Asn) and to No. 19 (Cys). The top four rankings are enjoyed by those with six or four codons assigned to them. In hydrophobic proteins, Leu, Ala, and Gly add up to 35 %or more ofthe total, whereas Ser may become the most abundant residue in proteins made mostly of P-sheet structures, e.g., members of the immunoglobulin family. It is a small wonder, then, that six codons are assigned to No. l(Leu), as well as No. 4 (Ser), and that Ala and Gly are endowed with four codons each.

glutinins of human influenza virus clearly obeys the universal grammatical rule [ 131. This 1698-base-long region is rich in A and poor in C. Nevertheless, C G and TA are deficient, while three dimers, TG, CA, and CT are in excess. Reciprocal dimers of the above five are present at more or less expected rates. Among species extensively studied in genetics, Drosophila meZanogasier stands out from the rest by virtue of its entire lack of DNA methylase. Thus, it has been thought that the deficiency of C G is not present in this species. However, as shown in Fig. 2, when a long enough sequence, such as the 1743-base-long coding region of the frizzled locus is dealt with, the deficiency of CG, albeit milder, is still present, together with all the other components of the universal grammatical rule. Thus, it appears that the universal rule transcends C G methylation, indicating its antiquity. (1,743 BASES)

OF JHE FRUIT FLY

4 : 483 (0.277)

6: 479

(DROSOPHILA MELANOGASTER)

A:

J: 435 (0.249)

(0,275)

#&

149 (120)

,#

346 (0.1991

104 (120) 116 (133)

d; C

6

A prominent contradiction to the universal codon assignment is found in the fact that proteins of the average amino acid composition are nearly neutral in electric charge, as most proteins are, because codon assignments are uneven between acidic and basic residues. Only two codons each are assigned to Glu and Asp, while six codons are assigned to Arg, and two to Lys. The reason that No. 9(Arg) is outranked by Glu and Lys is found in the C G deficiency part of the universal grammatical rule. Four of the six Arg codons are C G X and CGX, and as base trimers they are only half as numerous as GCX. Accordingly, most of the Arg residues are encoded by two remaining codons: AGA and AGG. Were it not for C G deficiency, the universal codon assignments would have yielded proteins in which Arg is more numerous than Ala. Such proteins with strong basic charges would only have served the cell well as nucleic acid binding proteins, while they would have been of little use as enzymes and cytoskeletal proteins. Figure 3 also shows that among aromatic residues, No. 15 (Phe) outranked No. 16, Tyr. This is a consistent trend seen in most proteins and is due to the TA deficiency part of the universal grammatical rule. In proteins with several transmembrane a-helices, such as members of the rhodopsin family, Phe, encodable by TTC andTTT, becomes aprominentresidue enjoying 2nd or 3rd ranking. Such prominence is never achieved by Tyr because two codons assigned to it are TAC and TAT.

A

107 (95)

A

:

97

'

(95)

120 (117)

Figure 2. Results of the base dimer analysis of the coding sequence for frizzled-locus tissue polarity protein of Drosophila melanogaster I181. CG deficiency is present, but in a mild degree, as a species characteristic. Accordingly, CT excess that compensates for CG deficiency is negligible (arrow). TA deficiency to the customary degree, on the other hand, caused regular degrees of compensatory excesses of TG and CA.

2 The universal codon assignments are but a compromise to the universal grammatical rule

As shown in Fig. 4, the amino acid composition of a protein is greatly influenced by the base composition of an encoding sequence. This is because of the codon assignments that are coupled to the universal grammatical rule. As shown in Fig. 1, two hemagglutinins of human influenza virus are encoded by the A-rich, C-poor base sequence. Inasmuch as AT ranked 2nd after AA among the 16 base dimers, Ile, which in the average composition ranks only as No. 12 (Fig. 3), became No. 1 in this protein and 19 of the 47 Ile residues were encoded by the ATC codon. In contrast, there were only 9 TAC Tyr codons, although ATC and TAC are made of the same three bases. Such is the consequence of TA deficiency. Thus, the

If this world's proteins descended from random assemblages of 20 building blocks, the sampling ofalargeenough variety of proteins should result in the average amino acid composition in which all 20 amino acid residues are represented equally, i. e., 5 % each. This is clearly not the case, as shown in Fig. 3 [41. The average amino acjdcompositiondeducedfrom18,383 entries in DATABASE clearly reflects the universal codon assignments, with a few notable exceptions. Thus, the bottom eight rankings are occupied by those residues to which only

THE AVERAGE AMINO ACID COMPOSITION DEDUCED FROM 18,383 PROTEINS W S T E R F D I N DATABASF

-S

No. 1 LEU 9.2% 6 COD. j0.s

VAL

No, 9 ARG

6.5% 4 COQ,

5.4% 6

'

4.2% 2 CQ

p0.13

ASN

,NO.17

H I S 2.3% 2 LYS GLU

'

Caq '

AND

OF CODqlys ASSIGNED TO EACH

jj.~,Z ALA 7.8% 4 CO4

I

&,3

GLY 7,4% 4 CQQ.

No. 4 SER 7.2% 6 COD.

p.a

ARG = 5.9%

+

5.4%

+

ASP

+

5.3%

6.2%

o,98

coD. COD,.

P - 1 1 pRO 5.2% 4 CO D,.

Fo,12

ILE 5.1% 3

8 . 1 5 pHE 3,7% 2 CO D.,

N0.16

TYR 3.1% 2 COD,.

coD,,

@,20

TRP 1.0% 1 Q&.

cys 1.9% 2

+

LYS 5.9% 2

TYR

-- ,3.1% -- 0.84

Figure 3. The average amino acid composition of proteins deduced from 18383 entries in DATABASE are shown as percentages of 20 amino acid residues 141. Numbers of codons assigned to individual residues are also shown. Three residues with maximally 6 codons are underlined by the thickest bars, while those with lower numbers of codons are underlined by progressively thinner bars. Bars of 2 basic and 2 acidic residues are drawn solid, while those of all others are dashed. The average basic-toacidic ratio as well as the average Tyr toPhe ratio is also shown.

105

Grammatical rule for all DNA

Electrophoresis 1991,12, 103-108 H&2) PROTEIN 1566 COD0NS ) OF HUMAN INFLUENZA RNA VIRUS ( A I A i ~ ~ i / 2 / 6 8 )

HAEMOAGGLUTININ (HA1

+

Figure 4 . More detailed analysis of the coding sequence for 2 hemagglutinins of human influenza virus 1131, At the bottom N0.3 : 44 LEU N0.6: 3 9 Sm M . 8 . 9 : 30 LYS left, the basic-to-acidic residue ratio, TyrAGC15 AAA18 AGG11 CTG18 to-Phe ratio as well as the ratio between two A G T 3 AAG12 AGA10 No.11 28 ALA C T T 9 T C A 9 CGG 3 GCG 2 C T A 7 TA-containing potential chain terminators T C T 6 C G C 2 G C C 5 C T C 3 and the TG-containing one i s shown. The T C C 4 C G A 1 GCA11 T T G 6 remainder contains ranking and numbers T C G 2 T T A 1 C G T 0 GCTIO of 9 pertinent residues and their codon NO. 8,9: 30 G U lo* 7: 31 4% usages. Codon usages, as arule, reflect base GAA16 GAC19 + ARG -LYS = trimer frequencies; thus, abundant base GAG14 GAT12 GLU + ASP 0 . 9 3 trimers such as C T G and ATC, in this 0.86 instance, also become dominant codons for &,14: 21 PHE M-17:18 TYR bj 1: 47 ILE T T C 12 T A C 9 A T C 19 numerous Leu and Ile. Frequent usage of T A A + T AG T T T 9 T A T 9 A T T I4 G C X base trimers as Ala codons are to be TT;K l.o S AND THEIR COWLEMENTARIES A A " contrasted with infrequent usage of CGXs T A A 1 4 T T A 10 as Arg codons. Thus, C G deficiency is TAG16 C T A 1 5 mainly responsible for maintaining the balTGA30 T C A 3 2 ance between basic and acidic residues. The consequence of TA deficiency becomes obvious when frequent usage of ATC and ATT as Ile codons are contrasted with infrequent usage of their rearranged trirners T A C and T A T as Tyr codons. The numbers of three potentially chain-terminating base triplets, TAA, T A G and TGA are contrasted with those of their reciprocal base trimers, TTA, CTA and TCA. The 1 st two ofthe latter are potential but infrequently used Leu codons, while the last is a frequently used Ser codon.

#

Tyr/Phe ratio of this protein was similar to that ofthe average, being 0.86. Because of the paucity of C, Ala forfeited its customary No. 2 ranking and fell to position No. 11. Nevertheless, G C X codons encoded a total of 28 Alaresidues. In sharp contrast, C G X codons encoded only 6 Arg residues. As a result of this C G deficiency, this protein, too, managed to remain nearly neutral in electric charge. Because of T A deficiency, TAA and TAG are among the fewest of the 64 base trimers. According to the above, the choice of these two as chain terminators is sensible. On the other hand, the 3rd chain terminator, TGA, is always numerous because of T G excess. TGA was most likely a Trp codon in the original codon assignment 1141.

this region to contain 25 % each of A, T, G, and C, the symmetry between the two complementary strands might have been nearly ,perfect. With regard to three potentially chain-terminating base triplets, symmetry between the influenza hemagglutinin gene and its complementary strand is also evident. As shown at the bottom right of Fig. 4, the coding strand contained 14 TAA, 16 TAG and 30 TGA. Its complementary strand would have contained 10 TAA, 15 T A G and 32 TGA, which are represented as TTA, CTA and TCA in the coding strand.

3 The universal grammatical rule imposes symmetry between two complementary strands of DNA Two deficient dimers, TA and CG, are palindromes and two of the threeexcessivedimers, T G andCA, arecomplementary to each other. Accordingly, the universal grammatical rule forces two complementary strands of D N A to remain symmetrical to each other (Fig. 5). The 19002-base-long human serum albumin gene can be considered the microcosm of the entire genome because the 609-codon-long coding region, represented by 14 exons, comprises only 9.6 % of the total 1151. Furthermore, four of the thirteen introns contain five copies of the Alu family of repeats. In Fig. 5, base trimers of the entire region are ranked from No. 1 to No. 64 in the order of their abundance and they are represented as 32 complementary pairs. Note that two complementary base trimers occupied consecutive ranks eleven times. For example, 585 ATT ranked No. 3, whereas its complementary, AAT, numbering 56 1, ranked next, occupying position No. 4. This is as perfect a symmetry as one can expect, for in the complementary strand, 585 AAT ranked No. 3 and 561 ATT ranked No. 4. This region is not only AT-rich (65 %), but also has an unbalanced A/T ratio; 6462 A outnumbers 5891 T considerably. Were

13)

A

T G

3 2

21) 5 A T 323

45) T C C 205

46)

G

G A

106

S. Ohno and T. Yomo

Electrophoresis 1991,12, 103-108

4 TGCA (CATG) as the primordial palindromic tetramer The maintenance of nearly maximal symmetry between complementary strands renders each strand full of palindromes. Anybody who has attemped to determine the secondary structure of a transcript or an mRNA will have realized this immediately. Since a strand is so full of overlapping complementary segments, it can assume not one but an infinite variety of secondary structures. Needless to say, the word “palindrome” was borrowed from the language. In 17th century Japan, contests of palindromic poems were a popular pastime among men of leisure. In the West, Sotades of Crete (3rd century B.C.) is generally credited as the inventor of palindromes. The most ingenious, however, is the following Latin phrase by an unknown author of the Middle Ages. SATOR AREPO TENET OPERA ROTAS Since the phrase is comprised of five words of five letters each, and the first word consists of the first letters of five words, and the second word ofthe second letters, etc., the phrase can be arranged as a rectangle. Starting from the top left, it can be read from left to right or from top to bottom. Starting from the bottom right, it can also be read from right to left or from the bottom to top. All end up the same: SATOR AREPO TENET OPERA ROTAS which can be loosely translated as “The carpenter named Arepo works a wheel with care.” Can there be an equally ingenious palindrome in base sequences? Ofthe 256 base tetramers, 16 wouldbepalindromic, but 4, in reality 2, of them would actually be dimeric repeats; i.e. GCGC and TATA; furthermore, as repeats, GCGC and CGCG become one and the same (as also TATA and ATAT). Because of the CG/TA deficiency part of the universal grammatical rule, these repeats are expected to be rare, which indeed they are [ 16,171 and so are TTAA, (AATT) and CCGG (GGCC). Of the remaining 8 (4)palindromic tetramers containing one each of four bases, CTAG (AGCT) are incapable of encoding a protein because of the inclusion of the TAG chain terminator, while ACGT (GTAC) as well as TCGA (GATC) are expected to be rare because of the inclusion of deficient C G and TA. The above leaves only 2 (1) palindromic tetramers: TGCA (CATG). Because ofthe TG/CA-excess of the universal rule, this pair of tetramers is abundant in all regions of DNA that are so-called unique, occurring at the average rate of once every one hundred bases in sequences of rather balanced base compositions [ 101. TGCA (CATG) as repeats possess as unique an attribute as that possessed by the above-noted five word palindromic phrase, thus qualifying as a primordial coding sequence. As shown in Fig. 6, this repeat encodes the same protein of the tetrapeptidic periodicity HisAla-Cys-Met, not only in all three reading frames of one strand but also of a complementary strand. The inclusion of two sulfur-containing residues, Cys and Met, in a primordial polypeptide is also in keeping with the current view that life started in the region around a geothermal vent in the ocean. At the beginning of life, the replication error rate had to be very high (of the order of 10 per base pair/replication); thus, CATG repeats in two complementary strands would soon have started to encode six different proteins.

-ASR P

L l T Y OF PAR-EPEATS

LEU

GLN ILEU

GLN ILEU

GLN ILEU

GLN

GLN

LEU IGLN

LEU I GLN

LEU IGLN

LEU

MWWW SER

C Y S I SER C Y S I SER C Y S I SER CYS ALA ALA I ALA ALA I ALA ALA I ALA

G A

I

CMG A T C”

T !3F3WM

ILE

SER

GLW

ASN

Figure 6. The unique quality of TGCA (CATG) among tetrameric palindromes containing one each of the 4 bases is illustrated. As repeats, TGCA (CATG) encode proteins of the identical tetrapeptidic periodicity, not only in all 3 reading frames of one strand, but also on those of its complementary strand, as shown inside the box at the top. The TG arm of a bar underlining this palindromic tetramer is made solid, while the C A arm is left open. The elongation of CATG to hexameric palindromes, such as the most common, CAGCTG, does no good; not only is its peptidic periodicity reduced from 4 to 2, but also 3 reading frames give rise to 3 different levels dipeptidic periodicity as shown immediately below the boxed-in TGCA (CATG) primodial repeats. As shown at the bottom, repeats of other palindromic tetramers are also of no use. Both strands of AGCT (CTAG) repeats cannot encode a protein because of the inclusion of the T A G chain terminator (bottom, left). GATC (TCGA) palindromes are also no good: because of the inclusion of deficient CG, GATC (TCGA) seldom become tandem repeats. Furthermore, when C G is converted to T G or CA, by CpG methylation and other means, this particular repeat gains a succession of T G A chain terminators.

We have already noted that circumsporozoite antigens of two malarial protozoa arose de novo. TGCA is contained in both of these coding sequences TGCA comprises one-third of each 12-base-long repeating unit encoding the simpler tetrapeptidic Asn-Ala-Asn-Pro periodicity to the circumsporozoite antigen of Plasmodium faciparum 161. The 36-base-long repeating unit of the circumsporozoite antigen gene of Plasmodium knowlesi, on the other hand, contains one copy ofTGCA and one and two copies of its two single-base-substituted versions, TGGA and AGCA [71. Furthermore, one of the two alternative reading frames of the latter is still open and so are all three reading frames of its complementary strand.

5 The abundance of palindromes that still encode homologous peptides on both strands Because of the symmetry imposed on both strands ofDNA by the universal grammatical rule, each strand, whether it be coding or junk, is full of palindromes. Are they still capable of

Electrophoresis 1991, 12, 103-108

Grammatical rule for all DNA

107

not exactly corresponding. The example shown in the third row is drawn from the Drosophila gene analyzed in F. 2 [ 181. The situation is simply that one 6-base-long arm of what should have been a pentadecameric palindrome with a threebase-long bubble at the center sustained an insertion of three bases, thus becoming an uneven octadecameric palindrome. The hexapeptide Pro-Thr-Leu-Ile-Gln-Gly, encoded by the coding strand, differed from its complementary strand counterpart Pro-Leu-Tyr-Gln-Gly by a single insertion and single substitution. Needless to say, each of the three coding sequences used here contained several ofthe other three kinds.

encoding homologous peptide fragments on both strands, reminiscent of the amazing capability of the primordial TGCA (CATG) repeats (Fig. 6)? As already noted, the most consistently abundant palindromic tetramer is TGCA (CATG), and it often grows to palindromic hexamers such as CTGCAG. When each arm of such a hexamer is used as a codon, they encode identical dipeptides on both strands. The most common is Leu-Gln, encoded by the just noted hexamer, not far behind is its reciprocal Gln-Leu, encoded by CAGCTG, which can be regarded as a two-base-inserted version of CATG. Although less often, Val-His, Ala-Cys, and Thr-Cys, encoded by GTGCAC, GCATGC, and ACATGT, are regularly encountered in all sequences. Taking one example each from the 3 coding sequences analyzed in Figs. l , 2,4, and 5 , we now examine three conditions under which palindromes managed to encode homologous oligopeptides on both strands. Shown at the top of Fig. 7 is the simplest example, found in the influenza virus hemagglutinin gene analyzed in Figs. 1 and 4 [ 131.The primordial CATG has grown into a perfect tetradecameric palindrome, thus encoding the identical tetrapetide Gly-Ala-Cys-Pro on both strands.

In Fig. 4 it was shown that, because ofthe imposed symmetry, numbers of TAA, TAG, and TGA base triplets in a given coding strand are approximately the same as those of their complementaries, TTA, CTA, and TCA. In coding sequences that are either AT-rich or only moderately GC-rich, TGA and, therefore, CTA, are present in significant numbers. Thus, not only are two unused reading frames of the coding strand closed through frequent interruption by chain terminators, but also its complementary strand cannot be translated in any of the three reading frames to yield a long enough polypeptide chain.

Shown next is an example of another kind, drawn from the human gene analyzed in Fig. 5 [ 151. In a region made of a succession of short palindromes, each usually centered by the primordial TGCA, codons may straddle these centers. However, such a region may encode homologous oligopeptides on both strands. In this instance, the pentapeptide Ala-met-CysThr-Ala, encoded by the coding strand, differed from its counterpart Ala-His-Cys-Thr-Ala by a single-residue substitution. Here, homologous peptides are found in positions

Not so with an extremely GC-rich coding sequence. One example was the c-src gene of the chicken [191. This 533codon-long sequence was extremely GC-rich (6 1 %). Accordingly, it contained only 1 TTA, 16 CTA, and 21 TCA. Furthermore, only one CTA and two TCA were utilized as Leu and Ser codons. Inasmuch as two TCA Ser codons occupied the 140th and 2 16th positions, whereas only one CTA encoded the 533rd, Leu, as thelastresidueofc-src, as shownin the 4th row of Fig. 7, the complementary strand ofthis coding

REMAINS OF THE PRIMORDIAL C A T G ( T G C A ) REPEATS

INFLUENZA HAEMOAGGLUTIN I N 566 CODONS A 89 SP

CYS

99

CYS&

G A C T G C T G C T G A C G A C VAL ALA THR

L A

CYS

;Mi

n IS

ALA

SER

A C G A A A SER LYS

MET

16

9 LEU

W O P H I L A FRIZZLED LOCUS ,581 CODONS

VAL G T C

T T A GLY

,533 CODONS

140 216 SER SER -TCA-TCACGC

-ky -ky

GLN

42

w'f

GLN

TYR

(1) 217 ARG

LEU

PRQ

T T VAL

1

41

(2801

496

E T

VAL

GLU

HIS

ALA 37

LEU

ALA

(89)

305

LA

GLU G A G G C

C T G G G

%

Y

PHE

LEU

GLN

GLU

ALA

GLU

GLN

LEU

PHE

Gu:

C A G ASP

(316) 532 533 ASN

(276) 492 HIS CAT-AA T-A NET

G~E~YG 316

(275) 491

VAL

PRO

(96) 312 GLN C A A G T T

PRO

Figure 7. There are several ways whereby both strands of palindromic base oligomers in modern coding sequences still encode homologous oligopeptides. The most common three are shown in the top 3 rows, taking one example each from 3 coding sequences analyzed in Figs. 1, 2 , 4 and 5. Positions of these palindromes in coding sequences are indicated by numbered residues. When palindromes are centered by the primordial CATG(TGCA),onearm is underlined by a solid bar. In other instances, one arm of a palindrome is underlined by a shaded bar. The second row also serves as an example to show the abundance ofpotentialchain terminators in most modern coding sequences. One arm of the last hexameric palindrome, TCATGA, is a potential chain terminator. Thus, not only an alternative reading frame of the coding sequence, but also one reading frame of its complementary sequence, are closed. Another such hexameric palindrome, CTATAG, is seen at the extreme right of the 4th row. Shown in the bottom 3 rows are relevant palindromes found in the chicken c-src conding sequence, which had a 3 16-residue-long reading frame in the complementary strand. The positions of 2 TCA Ser codons and one CTA Leu codon in c-src are shown in the4th row, as well as the position of the 1st Met codon, ATG, in the 3 16-residue-longopen reading frame of the complementary strand.

108

S. Ohno and T.Yomo

Electrophoresis 1991,12, 103-108

sequence was endowed with one 3 16-codon-long reading peculiar symmetrical homology is the cause of proposed comframe which matched the reading frame of the coding strand. plementarity is an interesting conjecture. At any rate, we now Furthermore, the first Met codon appeared in the4 1st position have the means of identifying two genes that were originally of the complementary strand open reading frame. In the 3 16- derived from two complementary strands of the same DNA. codon-long region where both strands can encode proteins in the matched reading frame, several long palindromic seg- Received June 15, 1990 ments that were capable of encoding homologous oligopeptides on both strands were present. Only three interesing examples are shown in the 5th to 7th rows of Fig. 7: In the 5th 6 References row, the first Met residue encodable by the complementary [11 Ohno, S., in Smith, H. H. (Ed.), Evolution of Genetic Systems, strand was included in the tetrapeptide His-Glu-Val-Met, Brookhaven Symposium No. 26, Gordon and Breach 1972, New which is homologous to His-Asp-Leu-Met of c-src encoded York, No. 23, pp. 366-370. by the dodecameric palindrome. Note that immediately to the [21 Doolittle, W. F. and Sapienza, C., Nature 1980,284,60 1-603. right of this palindrome another dodecameric palindrome [31 Orgel, L. E. and Crick, F. H., Nature 1980,284,604-607. 141 Seto, Y., Viva Origino 1989,17, 153-163. failed to encode homologous peptide fragments. If another [51 Ohno, S., J.Mol. Evol. 1987,25, 325-329. reading frame were in use, however, the tetrapeptide Ala-Ser[61 Zavala,F.,Tarn. J.P.,Cochrane,A. H.Quakyi,l.,Nassenzwieg,R.S. Ala-Gly encoded by the coding strand would have been and Nassenzwieg, V., Science, 1985,228, 1436-1440. homologous with Ala-Ser-thr-Gly, encodable by its com[71 Ozaki, L. S., Svec, P., Nussenzweig, R. S., Nussenzweig, V. and Gor plementary strand. Shown next is the perfect tetradecameric don, G. N., Cell 1983,34,815-822. palindrome, where both strands yielded homologous [81 Ohno, S., Proc. Natl. Acad. Sci. USA 1988,85,9630-9634. hexapeptides. Ala-Phe-Leu-Gln-Glu-Ala of c-src would have [91 Yomo, T. and Ohno, S., Proc. Natl. Acad. Sci. USA 1989, 86, been homologous with the hexapeptide of the complementary 8452-8456. strand, because replacement of Ala with Gly is a conservative [I01 Ohno, S. and Yomo, T., Proc. Natl. Acad. Sci. USA 1990, 87, 12 18-1222. substitution. Shown last is an example of two successive palindromes not making longer palindromes. In language, [111 Bird, A. P., Trends Genet. 1987,3, 342-347. the word “tenet” is a palindrome and so is the word “deed”, [ 121 Beutler, E., Beutler, B., Gelbart, T., Han, J. and Koziol, J., Proc. Natl. Acad. Sci. USA 1989,815, 192-196. but “tenet deed” is no longer a palindrome. Similarly, Verhoeyen, M., Fang, R., Jou, W. M., Devos, R., Huylebroeck, D., two successive hexameric palindromes, CTGCAG and I131 Saman, E. and Fiers, W., Nature 1980,286, 77 1-776. CAGCTG, encoded Leu-Gln-Gln-Leu of c-src, whereas its [ 141 Jukes, T. H. Molecules and Evolution, Columbia University Press, complementary sequence encoded Glu-Leu-Leu-Gln. The New York 1966. hallmark of two proteins encoded by complementary strands [I51 Minghetti, P. P., Ruffner, D. E., Kuang, W. J., Dennison, 0. E., in the matched reading frame is a peculiar brand of symHawkins, J. W., Beattie, W. G. and Dlugaiczyk, A., J. B i d . Chem. 1986,261,6141-6157. metrical homology, a segment near the amino terminus of one being homologous with a segment near the carboxyl terminus [161 Greaves, D. R. and Patient, R. K., E M B O J . 1985,4,2617-2629. I171 Hamada, H., Petrino, M.G. andKakunaga,T.,Proc.Natl. Acud. Sci. of the other. Blalock and his colleagues [201 presented the view that two proteins of the above kind would be the mirror-image of each other, so that they will be complementary, showing strong binding affinity with each other. Whether or not this kind of

USA 1982, 79,6465-6469. [181 Vinson, C. R., Conover, S. and Alder, P. N., Nature 1989, 338, 263-264. 1191 Tayeka, T. and Hanafusa, H., Cell 1983,32,881-890. [201 Bost, K. L., Smith, E. M. and Blalock, J. E., Proc. Natl. Acad. Sci. USA 1985,112,1372-1375.

DNA fingerprinting.

Grammatical rule for all DNA Electrophoresis 1991,12, 103-108 Susumu Ohno’ Tetsuya Yomo2 ‘Beckman Research Institute of the City of Hope, Department...
694KB Sizes 0 Downloads 0 Views