Proc. Natl. Acad. Sci. USA Vol. 89, pp. 1105-1108, February 1992 Biochemistry
Intron-exon organization of the gene for the multifunctional animal fatty acid synthase (domain boundaries/evolution/gene hfsion/domain shuffling/intron phasing
CHRISTOPHER M. AMY, BRENDA WILLIAMS-AHLF, JURGEN NAGGERT*,
AND
STUART SMITHt
Children's Hospital Oakland Research Institute, 747 Fifty-second Street, Oakland, CA 94609
Communicated by Paul K. Stumpf, November 12, 1991 (received for review August 23, 1991)
directly with the T7 sequencing kit (Pharmacia) and adenosine 5'-[a-[35S]thiojtriphosphate (3000 Ci/mmol; 1 Ci = 37 GBq; Amersham) according to the manufacturer's instructions. Location of Exon-Intron Boundaries. All of the introns were sequenced and their boundaries were established by comparison with the cDNA sequence (12).
ABSTRACT The complete intron-exon organization of the gene encoding a multifunctional mammalian fatty acid synthase has been elucidated, and specific exons have been assigned to coding sequences for the component domains of the protein. The rat gene is interrupted by 42 introns and the sequences bordering the splice-site junctions universally follow the GT/AG rule. However, of the 41 introns that interrupt the coding region of the gene, 23 split the reading frame in phase I, 14 split the reading frame in phase 0, and only 4 split the reading frame in phase II. Remarkably, 46% of the introns interrupt codons for glycine. With only one exception, boundaries between the constituent enzymes of the multifunctional polypeptide coincide with the location of introns in the gene. The significance of the predominance of phase I introns, the almost uniformly short length of the 42 introns and the overall small size of the gene, is discussed in relation to the evolution of multifunctional proteins.
RESULTS AND DISCUSSION The rat fatty acid synthase gene was sequenced in its entirety and the location of introns was established (Fig. 1). The coding sequence is interrupted by 42 introns and the sequences bordering the splice-sitejunctions universally follow the GT/AT rule (14). Of the 41 introns in the gene, 23 split the reading frame in phase I, 14 split the reading frame in phase 0, and only 4 split the reading frame in phase I. Remarkably, 46% of the introns interrupt codons for glycine. It has been pointed out that many mosaic proteins, such as extracellular matrix proteins, blood coagulation proteins, and membrane-associated proteins are composed of modules that appear to have fused in various combinations (15). These modules, which encode discrete functional units such as epidermal growth factor domains, fibronectin domains, and kringle domains, are composed of exons flanked at both ends by introns in the same phase, and it has been suggested that this identical phasing could have promoted the coupling of functional domains through the process of intronic recombination by preserving a continuous open reading frame (15). Although from the sequence data available at present it is not possible to say whether the exons that constitute the multifunctional fatty acid synthase gene represent modules that have been shuffled in various combinations to produce other unique mosaic proteins, the exons flanked by phase I introns clearly cluster together in continuous groups, exons 3-6, 17-21, 26-27, 30-31, and 37-39 (Figs. 1 and 2). Similarly, exons flanked by phase 0 introns cluster together at exons 13-14, 23-24, and 41-42. These observations are consistent with common phasing having played a role in the fusing of some of the domains through intronic recombination. As is typical for a vertebrate gene (17) the first intron, which separates the single 5'-noncoding exon from the first coding exon, and the last exon are relatively long. Indeed, the entire 3'-noncoding region of the mRNA is represented in a single exon, the largest in the entire gene, showing that the 8.3- and 9.1-kilobase mRNA species transcribed in the rat arise not from alternative splicing but from differential utilization of two polyadenylylation signals (12). This conclusion has also been confirmed by amplification of rat spleen DNA by using primers flanking the polyadenylylation signal for the
The de novo synthesis of fatty acids from malonyl-CoA requires seven enzyme activities that in most bacteria and plants exist as discrete monofunctional polypeptides (1). However, in fungi these activities are distributed between two nonidentical polypeptides encoded in two unlinked genes (2-5), and in animals they are integrated into a single polypeptide chain (6, 7). The hypothesis that the multifunctional fatty acid synthase evolved by fusion of the genes for ancestral monofunctional enzymes is supported by the finding that several of its components can be proteolytically cleaved from the complex as discrete catalytically competent monofunctional proteins that exhibit amino acid sequence similarity with present-day monofunctional analogs (7-10). It has been anticipated that structural features of the animal fatty acid synthase gene such as the location of introns may provide clues as to how the gene fusion events occurred (9). The objective of this study therefore was to determine the nucleotide sequence of the rat fatty acid synthase gene and map the location of the introns, particularly in relation to the location of the various catalytic domains of the multifunctional proteinA
EXPERIMENTAL PROCEDURES Genomic Clone Isolation and Sequencing. Genomic libraries, prepared from partial EcoRI or Hae III digests of Sprague-Dawley rat DNA cloned into the EcoRI site of the A vector Charon 4a (11), were screened with probes containing parts of rat fatty acid synthase cDNA clones obtained from a lactating rat mammary gland library (12). Four overlapping clones from the two rat genomic libraries were identified that spanned the entire gene (13). Double-stranded genomic DNAs recloned in pUC vectors were sequenced
Abbreviation: nt, nucleotides. *Present address: The Jackson Laboratory, Bar Harbor, ME 04609. tTo whom reprint requests should be addressed. tThe sequence reported in this paper has been deposited in the GenBank data base (accession no. M84761).
The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. §1734 solely to indicate this fact.
1105
1106 No.
Proc. Nati. Acad. Sci. USA 89 (1992)
Biochemistry: Amy et al. Exon
location length
1
1
77
2
1265
136
3
1938
153
4
3114
174
5'-splice donor 77
313
Intron -
length 1187 537
566 Ada~ 540
3476
201
ATICWaoaa 884
175
6
3852
123
4062
116
89
8
4267
135
610
9
5012
463
ccacqt 367
87
ccaaXX
188
11
6018
190
12
6712
95
13
7014
135
14
7226
204
15
7535
116
16
7833
173
17
8077
192
18
8343
81
19
8516
180
3186
AAGLZAg
260-Gly
ctacowmITG 1579
0
298-Lys/Val 343-Lys/Val
I
498-Gy
0
1767
504
ttatagM= 2052
105 2506
182
2679
3B71 ~ACT
21
9106
204
22
9571
287
23
10250
387
24
10719
165 119
26
11163
155
27 28
11424
207
11834
151
29
12066
179
30
12332
120
768-Gin/Ala
caccaaual 2690
3'
807-Gy
2507
865-Asp
ctgc A; 2953
I
92
ctacsgA4W 3133
I
157
ctgcqmw3 3313
I
929-Giy 956-Gly 1016-Gly
I
1076-Val
I
1144-Gly
2372
261
cc-atITX 3517 tcccagCl1
Go0G
392
cacc2
4A35 CGAon 4474
82
GATI1gtagga
82
4190
10962
gtgtLfl2C
74
73
303
25
0
0
gtgcagmw= 2391
I
332
Cc~aiAtagct 231S6
0
ttaceXCXC
3132 180
ctgcqfl136 Z187
560-GIn/ile 624-Gly 655-Gln/Ala 700-Lys/Val
1967
71
2952
8853
487
5266
3904
ati 4191
0
1239-Glu/Val
0
1368-Gln/Glu 1423-Lys/Ser
4356 78
ctgcmkT=
0
4475
II
1463-~rg
106
cctcgCA
I
1515-As
203
acaci3AT 498
I
1584-Gly
81
cctcag2XAr
II
1634-Trp 1694-Gly 1734-Gly 1775-Gly
4629
636 mcaw
iMIMI
4630
.37
87
3167 tgacal33
I
115
ccgtr323 5410
I
5409
5287
atacrA1U 5634 atgcag=%= 3936
0
I
1849-Gln/Va 1917-Gly
ttctagm
0
1967-?*t/Val
la
I
1998-Ag 2049-Gly 2131-Gly
I
2194-Glu
I
2270-Ala
0
2343-Gin/Ser 2376-Lys/Val 2460-Gln/Val
31
12567
123
32
12773
224
33
13345
202
34
13662
152
111
35
13925
92
167
36
14184
152
8o
37
14416
246
38
14861
189
178
tcttgm 6478 tgtta:cA 6667 tta p=l
I
39
15228
228
88
tta
40
15544
221
118
41
15883
99
42
16162
252
43
16496 1676
CIC~ftatga 5633 Got
am 37 15w
83
348 115
199 694r im
Kam
7466
152-Gy 219-Gly
0
tcgcaTltx
77
43-Gly 94-Gly
ctgcagG=
237
207
Amino Acid
I
9a51
118
AlTCCA3gatgc 1956
2390
20
I
742 ctcta3AMZ. 8am m6
3578 5593
1 I
1766
10
phase I
541 188
7
Ad
Codon
ccac 214
1023
741
5
3-splice
acceptor
180 82
ccata:E 59"
Mi6
7215 tctagGTOM 7467 ttac X
I
0 0
FIG. 1. Location and characteristics of introns in the rat fatty acid synthase gene. The position of introns between codons is indicated by phase 0, interruption after the first base is indicated by phase I, and interruption after the second base is indicated by phase II; the amino acid(s) encoded at the splice site is indicated. The locations of the splice junctions in the cDNA sequence are indicated as superscripts. Invariant nucleotides at the splice sites are in boldface type.
smaller mRNA species; only fragments with the lengths predicted by the cDNA and genomic clone sequences were produced (details not shown). Two elements of the short interspersed repetitive DNA sequence type (16, 18) are associated with the fatty-acid synthase gene, one between the enoyl reductase and ketoreductase coding sequences [nucleotides (nt) 13154-13254], the other, in the opposite orienta-
tion, in the 5' flanking region (nt -3724 to -3833). The elements exhibit 91% sequence identity, both contain A-rich 3' ends, and both are flanked by direct DNA repeats (AAACCTGATGACCC and AGACATT, respectively). Earlier we proposed precise locations for the functional components of the animal fatty acid synthase based on the hypothesis that the individual catalytic domains should exhibit significant sequence similarity with monofunctional analogs and might be expected to be joined by mutable surface loops, which would be susceptible to proteolytic attack (10). Indeed, similar features have been found at the boundaries ofother multifunctional proteins (19) and it seems likely that fusion of the genes for the primordial monofunctional proteins in a way that results in joining of the polypeptides via a hydrophilic surface loop region is likely to preserve the three-dimensional structure and biological function of each of the fused components. Of the seven functional components of the animal fatty acid synthase, only the dehydrase could not be precisely located, although we suggested that it was likely to be positioned somewhere between the transferase and enoyl reductase domains (10). More recently, S. Donadio and L. Katz (personal communication) and P. F. Leadlay and colleagues (personal communication) have attempted to locate the interdomain boundaries of the multifunctional polyketide synthase of Saccharopolyspora erythraea by similar criteria. The predicted amino acid sequence, the size, and the organization of the individual domains of this protein are strikingly similar to that which we have elucidated for the animal fatty acid synthase, except that the polyketide synthase likely contains not one but multiple fatty acid synthase-like modules, and several of these modules appear to lack specific functional domains. The conclusions reached by these workers concerning the predicted size and location of the functional domains of the polyketide synthase agree remarkably well with the model we advanced earlier for the animal fatty acid synthase. In addition, since in the case of the polyketide synthase only one of the modules contains a dehydrase function, S. Donadio and L. Katz (personal communication) were able to deduce that this enzymatic function is encoded within 140-170 amino acids immediately downstream of the transferase domain. A stretch of 100 amino acids within the corresponding region of the animal fatty acid synthase (residues 875-975) exhibits marginally significant sequence similarity with both the putative dehydrase region of the polyketide synthase and the 170-residue monofunctional 13-hydroxydecanoyl thioester dehydrase that provides the branch point to unsaturated fatty acids in Escherichia coli (ALIGN scores 3-4 SD). This evidence supports our original suggestion that the dehydrase domain of the animal fatty acid synthase is located between the transferase and enoyl reductase and suggests that the most likely location is close to the C terminus of the transferase in a region encoded by exons 16-19. Again, the boundaries of the encoded putative dehydrase domain are marked by a predicted surface region at the N-terminal end and by a highly mutable predicted surface region at the C-terminal end (details not shown). Because of the tentative nature of the evidence relating to the location of the dehydrase domain, it is not possible to pinpoint precisely the location of the associated boundaries. However, for the remaining functional components of the complex their boundaries can be predicted with confidence and, with only one exception, appear to coincide with the location of introns in the gene (Fig. 2). The 3' ends of exons 28 and 37 and the 5' ends of exons 33 and 39 encode mutable amino acid sequences, which, we have deduced, define interdomain boundaries (10). Thus, exons 29-32 appear to encode the enoyl reductase, exons 33-37 encode the ketoreductase, exon 38 encodes the acyl carrier protein, and exons 39-42 encode the thioesterase; exon 42 also includes the
Biochemistry: Amy et al. 2
4 5 77
3
1
6
/
/
I.
10
12 SINE 14 16 23 25 27 29 31 33 35 37 39 41
/ /
cr
'
8 13 15 1719 21
11
9
/ 581
161 /
/
Proc. Natl. Acad. Sci. USA 89 (1992)
/
/
/ ,
1
/
/
/I LXGXCeXG
/ ,
188
XGXXG
1107
18 kb 43
exons
2 5cr
.
I
1r1000 500 EDehy rase Ketoacyl Synthase Trans ferase
1
1500
2000
l
acid 2500 amno residues
Enoylreductase ACP Ketore uctase Thioesterase
FIG. 2. Map of the rat fatty acid synthase gene and the encoded multifunctional polypeptide. Exons are shown as boxes separated by introns depicted as lines. Exons flanked at both 5' and 3' ends by phase I introns are solid boxes, those flanked at both ends by phase 0 introns are darkly shaded boxes, and those flanked by introns of mixed phasing are lightly shaded boxes. SINE, short interspersed repetitive DNA sequence (16). The location of interdomain boundaries in the multifunctional polypeptide (10) is shown in relation to the position of introns in the gene. The locations of active site residues for the ketoacyl synthase, transferase, and thioesterase; the nucleotide-binding site motifs; and the site of 4'-phosphopantetheine attachment in the acyl carrier protein (ACP) domain are also marked; the location of the dehydrase coding region is tentative (see text). Exons, introns, and protein domains are drawn to scale. kb, Kilobases.
entire 3'-noncoding region. A mutable amino acid sequence corresponding to the C-terminal end of the malonyl/acetyl transferase is also encoded in a region of the gene interrupted by an intron (intron 15). However, there is no intron lQcated in the region of the gene encoding the mutable amino acid sequence marking the boundary between the transferase and the ketoacyl synthase; this boundary is apparently encoded =200 nt from the 5' end of exon 9. Exon 9 is, however, the largest in the coding sequence and it is possible that an intron marking this interdomain boundary- has been lost. The central region of the fatty acid synthase, encoded by exons 20-28, has not yet been assigned a functional role. The region is the least conserved in the multifunctional polypeptide and may play a role in stabilizing subunit interactions (10). An extensive, poorly conserved region of unknown function is also present in the corresponding location of each of the fatty acid synthase-like modules that constitute the polyketide synthase of S. erythraea (20). The two readily identifiable reductases contain, in their first exons, the Gly-Xaa-Ser-Xaa-Gly motif characteristic of nucleotide-binding domains (10). The coding sequences for the motif in the enoyl reductase and ketoreductase domains begin 108 nt into exon 29 and 109 nt into exon 33, respectively. The two nucleotide-binding domains exhibit significant sequence homology with the corresponding domains of other pyridine nucleotide-requiring enzymes, and secondary structure analysis predicts the presence of a/3 subdomains typical of a mononucleotide-binding fold (10). In common with other pyridine nucleotide enzymes, the nucleotidebinding structural modules of the multifunctional fatty acid synthase are not encoded by single exons. However, the sizes of the introns within the reductase domains of the fatty acid synthase are considerably smaller than those found, for example, in other vertebrate dehydrogenases (21). If the nucleotide-binding domains of the multifunctional fatty acid synthase are descendants of the same primitive mononucleotide-binding fold, encoded within several exons, that has fused with a variety of substrate-binding domains to give rise to many different enzymes with different specificities (22), then it would appear that the introns within the reductase coding regions must subsequently have been shortened. Indeed, perhaps the most remarkable feature of the fatty acid synthase gene is the overall relatively short length of the introns. The mean length of introns that interrupt the coding regions of vertebrate genes has been estimated as 1127 nt (17); in contrast, the mean length of introns within the coding region of the fatty acid synthase gene is only 191 nt. The significance of the prevalence of shortened introns within the gene is not immediately clear. On fusion of the seven genes
encoding the component activities of the fatty acid synthase it was probably advantageous that individual promoter and enhancer regions be removed to avoid deleterious effects resulting from the presence of multiple regulatory sequences positioned throughout the gene. This process could have resulted in a shortening of a number of the intronic regions. However, the fatty acid synthase introns are almost uniformly short, whether they are located at the 5' end of individual enzyme coding regions or whether they interrupt the internal coding regions of the enzymes, suggesting that perhaps there has been additional selective pressure, either on the size of individual introns or on the upper size limit for the gene. Other possible pressures might have been related to the chromosomal location of the gene or the need to have a compact gene that could be transcribed quickly and efficiently. Packaging of a highly fragmented gene, which encodes a long mRNA into a relatively modest stretch of DNA could avoid the extensive chromatin decondensation that would otherwise be required for activation of a gigantic gene and might allow more rapid transcription in response to regulatory stimuli. The fatty acid synthase gene is highly regulated in a tissue-specific manner according to the nutritional, hormonal, and developmental status of the animal, and both its mRNA and the encoded protein are highly abundant in tissues active in lipogenesis (13). Fatty acid synthases have been found in eukaryotes in two multifunctional forms, the a2 animal form, and the a436 yeast form. The component enzymes of the two types of complex exhibit little sequence similarity and are connected structurally in completely different ways [in yeast: a, acyl carrier protein-ketoreductase-ketoacyl synthase; f, acetyl transferase-enoyl reductase-dehydrase-malonyl/palmitoyl transferase (23)], indicating that they have most likely evolved by separate series of gene fusion events. Until recently, it had seemed likely that the two independent series of gene fusions took place after divergence of the yeast and animal lines. However, the finding that the multifunctional proteins required for polyketide synthesis in S. erythraea and fatty acid synthesis in animals are composed of functional domains of similar size and amino acid sequence linked in the same order [ketoacyl synthase-transferase-dehydrase-enoyl reductaseketoreductase-acyl carrier protein (20, 24)] directly challenges this view and raises the distinct possibility that the gene fusion events leading to a common polyketide synthase/ fatty acid synthase module took place prior to the divergence of prokaryotes and eukaryotes. Most of the clearest examples of exon shuffling involve genes that are unique to vertebrates and that presumably appeared relatively late in eukaryotic evolution (15, 25). However, if the introns located at the
1108
Biochemistry: Amy et al.
boundaries of the coding regions for the constituent enzymes of the fatty acid synthase are indeed relics of the process of assembly of the multifunctional polypeptide, then it seems likely that this assembly process occurred early in evolution, before separation of the prokaryote and eukaryote lines. Of course, the possibility that the animal-type fatty acid synthase module was acquired by prokaryotes by a relatively late "horizontal" gene transfer mechanism cannot be entirely ruled out at present but could be evaluated once additional sequences of related polyketide and fatty acid synthases become available. We thank Drs. R. F. Doolittle (University of California, San Diego) and K. Paigen (The Jackson Laboratory, Bar Harbor, ME) for their helpful comments. We also are very grateful to Drs. S. Donadio and L. Katz (Abbott Laboratories, IL) and Dr. P. F. Leadlay (University of Cambridge, Cambridge, U.K.) for generously sharing with us their unpublished results. The work was supported by Grant DK 16073 from the National Institutes of Health. 1. Alberts, A. W. & Greenspan, M. D. (1984) in Fatty Acid Metabolism and Its Regulation, ed. Numa, S. (Elsevier, Amsterdam), pp. 29-58. 2. Schweizer, E., Muller, G., Roberts, L. M., Schweizer, M., R6sch, J., Wiesner, P., Beck, J., Stratmann, D. & Zauner, I. (1987) Fat Sci. Technol. 89, 570-577. 3. Schweizer, M., Roberts, L. M., Holtke, J., Takabayashi, K., Hollerer, E., Hoffmann, B., Muller, G., Kottig, H. & Schweizer, E. (1986) Mol. Gen. Genet. 203, 479-486. 4. Chirala, S. S., Kuziora, M. A., Spector, D. M. & Wakil, S. J. (1987) J. Biol. Chem. 262, 4231-4240. 5. Mohamed A. H., Chirala, S., Mody, N. H., Huang, W.-Y. & Wakil, S. (1988) J. Biol. Chem. 263, 12315-12325. 6. Smith, S. & Stern, A. (1979) Arch. Biochem. Biophys. 197, 379-387.
Proc. Natl. Acad. Sci. USA 89 (1992) 7. Tsukamoto, Y., Wong, H., Mattick, J. S. & Wakil, S. J. (1983) J. Biol. Chem. 258, 15312-15322. 8. Smith, S., Agradi, E., Libertini, L. & Dileepan, K. N. (1976) Proc. NatI. Acad. Sci. USA 73, 1184-1188. 9. Hardie, D. G. & Coggins, J. R., eds. (1986) Multidomain Proteins-Structure and Evolution (Elsevier, Amsterdam). 10. Witkowski, A., Rangan, V. S., Randhawa, Z. I., Amy, C. M. & Smith, S. (1991) Eur. J. Biochem. 271, 571-578. 11. Sargent, T., Wu, J., Sala-Trepat, J. M., Wallace, R. B., Reyes, A. A. & Bonner, J. (1979) Proc. Natl. Acad. Sci. USA 76, 3256-3260. 12. Amy, C., Witkowski, A., Naggert, J., Williams, B., Randhawa, Z. & Smith, S. (1989) Proc. Natl. Acad. Sci. USA 86, 31143118. 13. Amy, C. M., Williams-Ahlf, B., Naggert, J. & Smith, S. (1990) Biochem. J. 271, 675-679. 14. Senapathy, P., Shapiro, M. B. & Harris, N. L. (1990) Methods Enzymol. 183, 252-278. 15. Patthy, L. (1991) Curr. Opin. Struct. Biol. 1, 351-361. 16. Daniels, G. R. & Deininger, P. L. (1985) Nature (London) 317, 819-822. 17. Hawkins, J. D. (1988) Nucleic Acids Res. 16, 9893-9908. 18. Jelenik, W. R. & Schmid, C. W. (1982) Annu. Rev. Biochem. 51, 813-844. 19. Crawford, I. P., Clarke, M., Cleemput, M. V. & Yanofsky, C. (1987) J. Biol. Chem. 262, 239-244. 20. Donadio, S., Staver, M. J., McAlpine, J. B., Swanson, S. J. & Katz, L. (1991) Science 252, 675-679. 21. Fukasawa, K. M. & Li, S. S.-L. (1987) Genetics 116, 99-105. 22. Rossmann, M. G., Powers, D. A. & Sofer, W. (1977) in Pyridine-nucleotide-dependent Dehydrogenases, ed. Sund, H. (de Gruyter, Berlin), pp. 3-30. 23. Schweizer, E., Mfller, G., Roberts, L. M., Schweizer, M., Rosch, J., Wiesner, P., Beck, J., Stratmann, D. & Zauner, I. (1987) Fat Sci. Technol. 89, 570-577. 24. Cortes, J., Haydock, S. F., Roberts, G. A., Bevitt, D. J. & Leadlay, P. F. (1990) Nature (London) 348, 176-178. 25. Doolittle, R. F. (1987) Trends Biochem. Sci. 10, 233-237.