Gene, 105 (1991) 61-72 0

1991 Elsevier

GENE

Science

Publishers

B.V. All rights reserved.

61

0378-I 119/91/$03.50

06001

Low-usage codons in Escherichia

cofi, yeast, fruit fly and primates

(Recombinant

bias; Saccharomyces

DNA;

Shiping Zhang”*,

GenBank;

codon

cerevisiae;

Drosophila melanoguster;

Homo sapiens)

Geoffrey Zubay a and Emanuel Goldman b

‘I Fairchild Center for Biological Sciences, Columbia University, New York, NY 10027 (U.S.A.) Tel. (212)X54-4578; Microbiology and Molecular Genetics, UMDNJ - New Jersqv Medical School, Newark, NJ 07103 (U.S.A.)

and “Department

of

Received by W.M. Holmes: 12 September 1990 Revised/Accepted: 17 April/3 1 May 199 1 Received at publishers: 18 June 1991

SUMMARY

Codon usage is compared between four classes of species, with an emphasis on characterization of low-usage codons. The classes of species analyzed include the bacterium Escherichia coli (ECO), the yeast Saccharomyces cerevisiue (YSC), the fruit fly Drosophila melanogaster (DRO), and several species of primates (PRI) (taken as a group; includes eleven species for which nucleotide sequence data have been reported to GenBank, however, greater than 90 y0 of the sequences were from Homo sapiens). The number of protein-coding sequences analyzed were 968 for ECO, 484 for YSC, 244 for DRO, and 15 18 for PRI. Three methods have been used to determine low-usage codons in these species. The first and most common way of assessing codon usage is by summing the number of time codons appear in reading frames of the genome in question. The second way is to examine the distribution of usage in different genes by scoring the number of protein reading frames in which a particular codon does not appear. The third way starts with a similar notion, but instead considers combinations of codons that are missing from the maximum number of genes. These three methods give very similar results. Each species has a unique combination of eight least-used codons, but all species contain the arginine codons, CGA and CGG. The agreement between YSC and PRI is particularly striking as they share six low-usage codons. All six carry the dinucleotide sequence, CG. The eight least-used codons in PRI include all codons that contain the CG dinucleotide sequence. Low-usage codons are clearly avoided in genes encoding abundant proteins for ECO, YSC and DRO. In all species, proteins containing a high percentage of low-usage codons could be characterized as cases where an excess of the protein could be detrimental. Low codon usage is relatively insensitive to gross base composition. However, dinucleotide usage can sometimes influence codon usage. This is particularly notable in the case of CG dinucleotides in PRI.

INTRODUCTION

Amino acids (aa) that are represented by more than one codon usually do not use synonym codons equally (e.g., Grantham et al., 1981; Gouy and Gautier, 1982). Indeed, the differential use of codons is most striking and speciesCorrespondence to: Dr. E. Goldman, Med. School,

185 S. Orange

Tel. (201)456-4367;

Dept. of Microbiology,

Ave., Newark,

New Jersey

Abbreviations:

aa (a.a. in Tables),

mal; STP, stop codon; Brookhaven

National

amino acid(s); bp, base pair(s);

Drosophila melanogaster; ECO, Escherichia coli; PRI, primates;

NJ 07103 (U.S.A.)

Fax (201)456-3644.

* Current address: Biology Department, tory, Upton, NY 11973 (U.S.A.)

specific and raises many interesting possibilities and concerns (Andersson and Kurland, 1990). Studies on EC0 and YSC have shown that high abundance proteins show a sharp avoidance of codons that are in low usage in the overall gene population (Post et al., 1979; Ikemura, 1981; 1982; Bennetzen and Hall, 1982) a finding that has led to

Labora-

YSC, yeast Saccharonqces cerevisiae.

DRO,

r, riboso-

62 the suggestion that low-usage codons may be more difficult to translate (Grosjean and Fiers, 1982; Konigsberg and Godson, 1983; Kurland, 1987). This suggestion is supported by the observation that cognate tRNA abundance is roughly proportional to codon usage for most codons (Ikemura, 1985) and that rates of translation can vary with concentrations of charged tRNA, at least in EC0 (Rojiani et al., 1990). Direct assays of translatability lend some sup-

conditions of growth. For example, under conditions of rapid growth when there are plenty of nutrients, the overall rate of protein synthesis will be maximal and the synthesis of proteins that are needed for rapid growth should be favored. By contrast, under conditions of nutrient limitation, a new set of proteins is most likely to dominate metabolism and the overall rate of protein synthesis would be reduced (Ingraham et al., 1983). We will describe three

port to this view (Robinson et al., 1984; Bonekamp et al., 1985; Carter et al., 1986; Hoekema et al., 1987; Sorensen et al., 1989; Curran and Yarus, 1989; Chen and Inouye, 1990). Further, expression of at least one heterologous protein in EC0 required oligodeoxynucleotide synthesis of the

ways of estimating codon usage. Each of them has its advantages and drawbacks. Fortunately, they all give approximately the same answer as regards the hierarchy for ‘most used’ and ‘least used’ codons within each synonymous codon family.

entire gene with high usage synonym codons of ECO, since message containing the naturally occurring codons of the

(1) Sums of codon uppeurunce

gene (very rich in ECO’s low-usage codons) failed to support measurable translation (Abate et al., 1990; T. Curran, personal communication). Whereas papers summarizing codon usage have appeared sporadically (e.g., Grantham et al., 1981; Ikemura, 1985; Sharp and Li, 1986; Sharp et al., 1988; Wada et al., 1990) the amount of information keeps increasing at the rate of about 2 x lo6 bp per year. This necessitates periodic reevaluation and updating of codon usage. There is also the need for new ways of looking at codon usage which requires new perspectives and new computer programs. For example, Gutman and Hatfield (1989) have considered the problem from the perspective of frequency of appearance of codon pairs. For the purpose of this study and to stay within reasonable practical limits, we will confine the information presented in this paper to ECO, YSC, DRO, and PRI. Mitochondrial genomes are not included in this analysis. Here, we shall describe different ways of estimating codon usage for these four classes of organisms. In addition to serving as an update of previous papers, a new way of evaluating codon usage is introduced and the relationship between codon usage and dinucleotide frequency is addressed. A major purpose of this paper is to give an accurate assessment (as of September 1990) of low-usage codons in different species. A second and more elusive goal is to seek explanations for the evolutionary choices that have been made.

RESULTS

AND DISCUSSION

(a) Criteria for identifying low-usage codons We shall define codon usage as the number of times a codon is translated per unit time. This has not been measured directly in vivo, so estimates of codon usage are obtained by indirect observations. It should be appreciated that codon usage is likely to be different under different

The most common way of measuring codon usage is by summing the number of times codons appear in the reading frames of the genome. This approach is summarized in Table I for four different types of organisms under discussion. This method should over-estimate codons used infrequently and under-estimate codons used frequently. The reason for this is that, when averaging over the entire genome, no weight is given to the number of times different reading frames are used, which is reflected in the variable quantities of protein products. This weighting is a complex resultant of transcriptional efficiency, message stability, and translational efficiency. In general, this weighting is not precisely known but frequently it can be roughly estimated from the amount of gene-encoded protein product. (2) Absence of particubr codons in genes A second way of estimating codon usage is to examine the distribution of usage in different genes. We have done this by scoring the number of protein reading frames in which a particular codon does not appear. In Table II, these data are directly compared for the four species under consideration. A large number indicates a narrow distribution for the codon in question. No attempt has been made to weight this method by scoring the number of times a codon appears in a gene. Only the presence or absence of a codon within a gene has been scored. This approach to measuring codon usage is based on the prediction that the most used codons should have the broadest distribution and the least used codons should have the narrowest distribution. The data in Table II arc presented in a way that is most convenient for making comparisons of relative usage between different organisms. For this purpose, the numbers have been normalized to those actually observed for DRO. The normalization factors arc given in the footnote of Table II. (3) C’ombirlutions of‘ excluded codwzs The third way in which we have estimated codon usage starts with a similar notion to that used in the second

63 TABLE I A comparison of codon usages il among four different species ’

CT

u

C

5858

3382 3048

(Ser) {Serf

3307

{Leu)

2108

(SPT)

644

365Q

(Leu)

2549

(SW)

3204

(LPU)

2123

3135

(Leu) [Lw]

960

G

4875 4295

(Tyr) (STF)

(Cys) (Cys) {SIT-)

65

(STP)

416Q

(Trp)

(Pm)

3666

(His)

7981

(Arg)

1354

(Pro)

3471

[Pm) (Pro)

4223

(His) [Gin)

6Q4F (‘41s) Q8Q @rg)

C A

(Tyr )

C ;

(LPU)

958Q

(Gin)

141)l

(Arg)

G

(Ii?)

3499

(Thr)

5265

(Asn)

2380

(Se;)

li

8712 1273 8507

(lie) (Ile) (Mel)

7873

(Thr)

(Am)

.4P31

xi;

{x-q I‘

7868 12104 3857

(Scr) (Arg) (Ars)

c A c

6716 4552 3896

{W) {Val) (Val)

5770 7490 6760

#!a.) (Ala) (Ala)

1038’2 7116 14134

@sp) (kp) (Glu)

Q205 (Gly) 9824 (Cly) 22x16 (Gly)

IJ

7972

(Xl)

10641

(Ala)

6149

(Glu)

3110

G

602

(Lys)

428

(Gly)

ti

c

8809

(LYS)

c

A

C A

G

D. li

u

c

A

c

~-

c

5559

(Phe)

595‘1

(SW)

3962

4813 5867

{Phe] (La)

3535 37‘22

j&r) [Ser)

PRI:

(Phe) (Phe) (La)

776 2641 772

(Ser) {Ser) (Ser)

1337 2514 117

[Tyr) (Tyr) (STP)

?I0 1782 50

(C$,s) (Cys) (STP)

II c A

1831

(Leu)

2121

(Ser)

FS

(STP)

1358

(Trp)

G

924 1600 813 4606

(Leo) [Leu) JLFU) (Len)

719 2393 154% 1992

(Pro) (Pro) (Pro) (Fro)

1304 2135 1690 4670

(His) (His} (can) (Gin)

1267 2217 873 $108

(Arg) (kg) (kg) [Arg)

u C A G

2OUl 3237 2997 882

(lie) (Ile) (Ile) (Met)

1086 293Q 1207 1704

(Tbr) (Thr] (Thr)

2608 3318 5323 1903

(AWI) (Am) (Lys)

1‘215 !2369

(Ser) (Scr)

I: c

629 777

(kg) (Arg)

6

1371 1956

(Val) (V&l

1900 4797

(Aia) (Ala)

3386 3190

(Axp) (Asp)

1983 3824

{Ciy) (Giy)

I! c

665 3424

(Xl) (Vail

1436 l-714

(Ala) (Ala)

2308 5773

(Glu) (Glu)

2.451 599

(Gly) (Gly)

A G

u C

total

)

codons:

608091,

c’

399s 256

(Tyr) (Tyr) (STP)

1842 9fln 136

(Cys) (fus) (STP)

I: c A

7776

(Leu)

1556

(SW)

02

(STP)

2400

(Trp)

G

2297

(Leu)

3083

(Pro)

2977

(Hs)

1801

(kg)

u

981 2x40 2003

(Leu) (Lea) /Le.!)

1388 5162 988

(Pro) (Pro) (Pm)

2017 7098 2500

(His) (Gl,) (Gl n)

467 531 255

(Arg) (hrg) (hrgf

C A G

7454 4494

(Ile) (Ile)

5281 3425

(Thr) (Thr)

7867 6235

{Am) (Am)

2802 1732

(Ser) (Ser)

u C

3054 5180

(Ile) (Met)

3703 1587

(Thr)

9104 S536

(Lys) (Lgs)

5791 1814

(Arg)

;

6486

(Val)

6884

(Ala)

8962

{Asp)

8375

jay)

u

3622 2366 2280

(Xl) (Xl) (Val)

3765 36G6 1242

(Ala) (Aln) (Ale)

5418 11836 40x2

(Asp) (Glu) (Gill)

2146 2137 1228

[Gly) (Gly) {Gly)

z G

_____I..

u

9684 13718 3279 6670 6538

c

12177 3871 26255 9159

A

tobal

protcins:

c

1518 G

A

{PIN) (Phc) (Leu] jteix)

8052 1OB77 5766 2448

{Ser) (SW) (Ser) {Ser)

750.1 11186 442 302

(Tyr) (Tyr) (STP) {STP)

5982 8522 774 8390

(Qs) (CZS) (STP) (Trp)

(La)

9352

(Pro)

5643

(Iii s)

2818

(Arg)

U

(Pro)

$816

(His) (Gin) @In)

6945 3258 6354

(Arg) f.bs) @rg)

C A G

(Let,) (La) (La) (Ile)

13136 9007 3901

(Pro) {Pro)

6661 209.56

;

7776

(Thr)

10115

(Am)

5747

(Ser)

u

(11~) (Ile) (?“ka)

13820 8768 4064

(Thr) (Thr) (-I&)

13569 13636 ZLS16

(Am) (Lys) (Lys)

11285 6086 6627

(Ser) (kg) (Arg)

C A G

(Val) 9773 (Val) 3610 (%I) 1ROOQ (Vai)

11788

(Ala)

13‘231

(Asp)

6627

(Gly)

zi

17754 8514 4509

(Ala) (Al.?) (Ata)

17615 16379 35461

(Asp) (Cl”) (Gin)

l5198 10443 IO497

(Gly) (Gly) (Glv)

C A G

14619 3605 13572 6238

G

G

A

1473 2976 589

G

A

C

u

IJ

18.42 l9Gl “59

zB’23 7724

t7477

A

G

A

c

GOSQ (Phe) (We)

_~

--

.’ DNA sequences analyzed were from GenBank version 63 (IS March 1990). Reading frames were based on ‘pept’ labels {or ‘CDS’ labels in the current GenBank format) in the FEATURE sections. Repeating sequences were excluded. Computer programs used are writt.en in C and run on a Sun3. h The species analyzed in&de the bacterium E. coli (ECO). the yeast S. cerevisiae (YSC), D. mehnogaster (DRO), and several species of primates (PRI) (taken as a group; includes eleven species for which DNA sequence data have been reported, however, greater than 909; the sequences were from man)

of

method but instead considers ~~~~~~~~~~~‘~of codons that are missing from the maximum number of genes. The assumption here is that combinations of low-usage codons should be excluded from the most active genes, because the more low-usage codons present in a gene, the greater the potential reduction in the amount of transIated product. Although this remains to be experimentally tested in a rigorous fashion, there are a number of reasons to predict this result (see ~~TRo~u~~o~~. The program searches for combinations of codons (two or more) that are excluded from the maximum number of reading frames of the species in question. To make the computer ~a~~uiat~ons practical, only the 20 least used codons defined by method 2 above were used in the pool to determine combinations of ieast-

used codons. In Table III, results are presented for combinations of up to eight codons with the least favored incidence; the number of protein reading frames that exclude the combination is given at the Ieft in the tabIe. Frequently, there is little difference between first, second and third (not shown) choices, indicating a degree of uncertainty in giving priority to the choices. Far EGO, there is a stepwise addition of a new low-usage codon at each stage as the size of the combination increases. This is also true for YSC except after stage 7 (combinations containing seven codons) where GGG is eliminated and AGG and ACC are simultaneously added. For DRO, a similar stepwise pattern of addition is seen as for ECO. The situation is far more complex and intriguing for PRI. The

64 TABLE

II

Relative

numbers

a of proteins

not using a particular

codon

in four species

EC0 YSC zz===== zz=zzz=

codon (aa.) (F’he) WC (I’he) Uf_fA (Leu) UUC (Leu) CUU

8

37

4

33

5

(SW)

42

I1

UCA

(Ser)

68

30

(Ser)

60

69

30

32

30

5 II:,

82

UAG

(STP)

228

198

uw

(c!ys)

83

40

UGC

(Cys)

76

89

UGA

(STP)

179

175

UGG

(Trpf

43

31

cuu

(Leu)

41

48

cut

(Leu)

39

86

CUA

(La)

127

25

CUG

(Len)

9

54

ecu

(Pm)

56

26

ccc

(Pro)

98

CGA

(Pro)

49

66 7

CCG

(Pro)

CAt J

(His) (His) (Gin) (Gin) (Arg) (kg) (Arg) [Arg)

20 38

25

41

29

28

5

10

4R

97

19

37

21

128

124

144

110

167

for each species is the ratio of tota

-

DRO zzzZ=== 39 12 125 26 4Q 14 68 18 37 19 117 177 75 19 194 39 48 20 69 15 54 17 27 31 41 25 52 II 31 20 67 73

PRI zz=zzz

T

35

26

AUU AUC AUA AUG ACU .4CC ACA ACG AAU AAC &%A AAG AGU AGC AGA AGC

35

GIJU

12

c;uC GIJA GUG GCU CCC GCA GCG GAU GAC GAA GAG GQU GGC GGA GGG

14 110 36 26 10 49 94 33 16 173 195 46 33 120

59 5 21 16 34 73 46 21 43 4 81 47 68 -

codon (a)

45

DRO proteins

(Ii@) (Ile) (Ile) (Met) (Thr) (Thr) (Thr) (Thr) (Am) (Am) (LYS) (LYs) (SW) (Ser) (Arg) (Arg) (Val) (VaI) (Val) (Vat) (Ala) (Ala) (Ala) (Ala) (ksp) WP) (Glu) (Glul (Gly) (Gly) (Gly) (Giy]

EC0 10

YSC -

4 8 55 0 6 14 39 63 19 4 3 2 40 51 7 58 6 12 52 43 7 10 30 73 6 5 2 25 6 34 47 70

10 124 1 29 16 66 36 25 12 4 25 65 20 161 180 10 23 19 17 15 18 11 13 11 16 7 19 13 14 68 49

to the total proteins

PRI DRO zzzZZ=Z z 18 3s 14

17 84

88

0

0

50

28

13 61

Q 27

46

60

22

32

11

9

33 9

17

69

40

2

18

14

95

45

67

23

22

38

19

16

68

79

15 24

10

7

3

4

40

18

28 14

64 18

17

5

23 7

17

20

32

11

10

15 86

25

4

18

of the species. Total proteins

DRO, 244 (1); ECO, 968 (0.252); YSC, 484 (0.504); PRI, 1518 (0.161). To obtain the actual (instead

in each species not using a particular 968 EC0 proteins

6

ccc

CAC C;rtn CAC CGU CGC CGA CGG

factor

17 46

(Ser)

UCG

factors in parentheses):

15

t7cx

UAU (Tyr) UAC (Tyr) UAA (STP)

” Normalization

17

I_

analyzed,

codon,

the actual number

multiply

the numbers

of EC0 proteins

in the table by the reciprocal

of the normalization

used are (normalization

ofrelative) factor.

number

For example,

not using the CUA codon is 127 (from the table) times 3.968 (reciprocal

ofprotejns out of the

of normalization

factor 0.252) = 504.

ail of the UA type. At stage 5, ail four codons ending in CG emerge along with one codon, CGU, carrying a CG in the first two positions. At all future stages, codons containing

four codons at stage 4 are eliminated at stage 5 never to reappear as part of the favored low-usage combination at higher stages. Tt should be noted that these four codons are

TABL.E III Numbers

of proteins

not using a codon

or combination

of codons a

A 714 536

352 255 203 159 129 93

EC0

codons

CTA WA

are absent

AGA AGA AGA AGA AGA AGA AGk

AUA CUA CUA CM CUA CUA

total proteins:

55 IillA 40 LWA 35 UUA 30 WA 25 IJUA

“ The indicated

968

B. 1’SC total proteins:

AGG

AGG AGG .4GG AGG .4GG AC

Low-usage codons in Escherichia coli, yeast, fruit fly and primates.

Codon usage is compared between four classes of species, with an emphasis on characterization of low-usage codons. The classes of species analyzed inc...
1MB Sizes 0 Downloads 0 Views