Gene, 1! 0 (1992) 245-249

© 1992 Elsevier Science Publishers B.V. All rights reserved. 0378-1119/92/$05.00

245

GENE 06224

Analysis of the 5'-AAUAAA motif and its flanking sequence in human RNA: relevance to c DNA library sorting (Polyadenylation; messenger; database; human genome)

lan N.M. Day University Clinical Biochemistry. Level D. South Laboratory and Pathology Block, Southampton General Hospital, Southampton (U.K.) Received by M. Salas: 7 March 1991 Revised/Accepted: 21 August/22 August 1991 Received at publishers: 10 October 1991

SUMMARY

The motif, N-s..N-IAAUAAAN~..Ns (where N is A, C, G or U), and its flanking sequence in human mRNA were examined by database analysis. Approximately 20 % of 5'-AAUAAA in 3'-noncoding regions appear not to direct mRNA cleavage-polyadenylation. In coding regions, Asn-Lys, lle-Lys and lie-Ash are proven not to be unfavourable, and AAUAAA is not an unfavourable choice of coding sequence, occurring in 16% of mRNAs. Neither immediate flanking sequence nor associated motifs bear sufficient information content to account for the cleavage specificity observed. The unusual distribution and properties of the motif, AATAAA, in eDNA invite novel strategies for sorting eDNA libraries.

INTRODUCTION

Much current interest surrounds schemes to obtain a complete description of the human genome and ofthe genes encoded (Watson, 1990). CpG-rich islands have provided a powerful mapping tool and often mark the 5'-boundaries of genes (Bird, 1986). At the genomic level, the cleavage polyadenylation motif 5'-AATAAA is the most distinctive marker of the 3' boundary of most genes (Proudfoot and Brownlee, 1976; Proudfoot and Whitelaw, 1989). Proudfoot and Brownlee (1976) were the first to report that the sequence AAUAAA occurs consistently at a posi-

tion (5-30 nt) preceding the polyadenylation tail of messenger RNAs. It has since been established that this sequence is of primary importance in directing two independent but related events (Proudfoot and Whitelaw, 1989) which have been recently reviewed, firstly cleavage of the primary transcript and secondly polyadenylation at the new 3' terminus thus created. Sequence 5'-AAUAAA acts as an operational boundary for 80-90% (Proudfoot and Whitelaw, 1989) of the estimated 30000-100000 human genes (AUUAAA being used instead in 10-15%) contained in the 3 x 109 bp human genome (Human Gene Mapping Conference 10, 1989). Using the measure of uncertainty T

Correspondence to: Dr. I.N.M. Day, Room LD59, University Clinical Biochemistry, Southampton General Hospital, Tremona Rd, Southampton S09 4XY (U.K.) Tel. (44-703) 796871; Fax (44-703) 704062.

Abbreviations: aa, amino acid(s); bp, base pair(s); kb, kilobase(s) or 1000 bp; N, A C G or T (U); nt, nucleotide(s); oligo, oligodeoxyribonucleotide; PCR, polymerase chain reaction; UWGCG, University of Wisconsin Genetics Computer Group.

H(L) = - ~ f(N, L)log2f(N, L) A

bits per nt(N) [where N = A, C, G or T (or U); and f(N, L) is the frequency of N at position L] (Schneider et al., ] 986) and human average nt frequencies (Shapiro, 1976) AAUAAA is calculated to carry approx, twelve bits of information compared with a random hexanucleotide. Reassociation studies, detailed analyses of various genomic

246 regions, and estimates of gene number (Human Gene Mapping, 1989; Bantle and Hahn, 1976) have suggested genome primary transcript complexities summing to perhaps 3 x l0 s nt and mature spliced transcript complexities ofperhaps 3 x 107 nt. To define 30 000 cleavage sites within such pools would require at least 13.3 bits,

i.e., log~.

3

x

lO s

30000

or 10.0 bits of information, respectively. Cleavage may precede splicing, AAUAAA has been noted in coding regions, and AUUAAA can also direct cleavage (Proudfoot and Whitelaw, 1989): therefore even the estimate of 13.3 bits is probably conservative and additional information must be associated with true cleavage sites. Various associated motifs have been implicated (Proudfoot and Whitelaw, 1989; McLauchlan et al., 1985). This work examines by database analysis the occurrence and nature ofthe sequence N-s..N-tAAUAAAN~..Ns (as defined in footnote b of Table If) in human m R N A both to define further the mechanism of cleavage-polyadenylation specificity and also to explore its possible application as a 'file handle' for e D N A sorting strategies.

EXPERIMENTAL AND DISCUSSION

(a) Computing

TABLE I Conditional probabilities relating to AAUAAA motif usage in human protein coding regions Motif~

Found b

Expectedc

p(AAU.AAA ]Ncodon.Kcodon) p(AAU.AAA INcodon.AAA) p(A.AUA.AA ]A.Icodon.Kcodon) p(A.AUA.AA JA.Icodon.Ncodon) p(NK [dipeptide) p(IK Idipeptide) p(IN Idipeptide)

0.14 0.34 0.31 0.33 0.0020 0.0026 0.00019

0.17 0.42 0.12 0.12 0.0024 0.0026 0.0018

" Conditionalprobability expressions for particular sequence motifs. For example (AAU.AAAINcodon.Kcodon) implies 'probability that the nt sequence AAUAAAwill be the codon sequence, where an aa sequence displays an Ash(N) followed by a Lys(K)'; p(A.AUA.AAIA.Icodon.Kcodon) implies 'probabilitythat the codon sequence A.AUA.AAwill be found, where an aa sequence displays an lie(1) followedby a Lys(K) and for which the last nt of the codon preceding the lie codon is A.' b Probabilities found are those of Losnuc database version 62, from which 2885 human sequences spanning a total of 2558244nt were analysed. Expected probabilities are calculated from known human codon usage and nt frequencies (Shapiro, 19761, which are very similar to those estimated from this database. et al.. 1988; Fiddes and Goodman, 1980). The absence of evolutionary selection against this motif, which is found in coding regions at a frequency of 1.8 x 10-4/nt (Fig. 1), 0.16 per sequence

The GenBank databases and U W G C G software were 1.8x1~ ~ per nt iV used, as implemented by the U.K. SERC 'Seqnet' facility on a DEC micro Vax 3600 (Daresbury, Warrington, U.K.), o/IProtolo-oodiogrog,oo in conjunction with VMS DCL batch procedures, compiled c ~ C language routines, and a suitable microcomputer.

0.14 p e r s e q u e n c e

I~

ITI

3'-ooooo ing o ioo I÷lostSoot l

(b) AAUAAA in protein coding sequence The protein coding possibilities of AAUAAA in human m R N A were examined using the Losnuc (version 62) database, which contains only spliced coding region nt of the corresponding GenBank database. All human sequence entries were converted to composite files containing nt sequence and translated sequence for coordinate analysis of aa and nt sequence data. The relevant conditional probabilities are shown in Table I, along with those expected on the basis of known average nt frequencies (Shapiro, 1976) and codon usage (Aota et al., 1988) (published frequencies and usage correspond closely with those found in this database). The dipeptides NK, IK and IN are not unfavourable, and AAUAAA is not an unfavourable means of encoding them. The expected coincidence of A A U A A A with a UAA stop codon is low, but the number found (nine) does not appear inconsistently low. Exceptionally, coding and stop site motifs do also act as true cleavage signals (Bishop

~.13 per sequence 1.45xlO'~per n t

0.8-0.9 per sequence

4 . 2 x i 0 -~ p e r nt

I

~.~9 per

sequence

8 . 5 x 1 0 "~

p e r nt

Fig. i. Estimated average occurrence of the motif 5'-AATAAA in human eDNA. For the protein coding region Losnuc database version 62 was used (analysis of 2885 sequences spanning 2558244 nt), for the 3'-noncoding region a database of 977 entries (328975 nt) was constructed (see section e). Assuming all sequences are full length, the average mRNA length would be 1224nt, in close agreement with the average size (!. !-1.4 kb)of poly(A)+mRNA calculated from reassociation analysis, e.g., Bantle and Hahn (1976). The actual length distributions are, to a first approximation, Poisson, consistent with observed length distributions of individual exons (Hawkins, 1988): variance of motifoccurrence (except for true signals) will therefore be approximately equal to the mean. In a eDNA library biased by oligo(dT) priming and truncated first strand synthesis, true signal sites will predominate, in a perfect (double-stranded)eDNA library approx. 50-55% of AATAAA 'anchor' sites will be true signal sites and 80% of such sites will occur in 3 °-noncoding regions and hence commonly in the terminal oxen of the gone. m, mRNA strand; c, complementary (antisense) strand. The AATAAA motif frequencies are shown above or below the respective strands.

247 suggests that secondary discriminants of true cleavage sites must exert an absolute rather than a relative effect.

Flanking sequence was examined for additional information content which might distinguish true cleavage signals. Although minor downstream G + T and T-rich motifs (Proudfoot and Whitelaw, 1989; see also Nussinov, 1986) were frequently found, perfect Y G T G T T Y Y motifs (McLauchlan et al., 1985) were not found within 100 nt downstream from any protein coding A A U A A A and only for one 3'-noncoding non-signal AAUAAA, that of Hisrich protein mRNA (Dickinson et al., 1987). This supports the previous suggestion (McLauchlan et al., 1985) that the occurrence of this motif in 60 ~o of genes downstream from the cleavage site may be an important determinant of cleavage, but this would still fail to identify 40~o of true cleavage signals. The nt frequencies immediately flanking A A U A A A (Table II) bear a small positional bias (e.g., calculated information content of 0.1 bit at position N- m, 0.05 bit at N-2, where N-~ is the first nt upstream from AAUAAA, N-2 the second) but insufficient to account for all of the additional minimal 0.85 bit (above) necessary to distinguish A A U A A A non-signals in mRNA, although it is conceivable that the differences of nt frequency between coding and signal motifs particularly of G at position N-~ may reflect an interaction of this position with the footprint of cleavage-polyadenylation protein complexes (Wilusz et al., 1990), or that summation of differences in bulk property over a longer range could be important.

(c) AAUAAA in 3'-noncoding sequence A human m R N A 3'-noncoding region database of 328975nt constructed from the GenBank database (version 64), using all entries designated 'ss-mRNA' in the locus line and with appropriate information provided in the features and comment tables. Of 603 occurrences of A A U A A A found, 457 were within 50 nt of the end of a sequence. The latter group appear invariably to represent true cleavage-polyadenylation signals (as judged by recourse to a selection of papers), although fewer than 300 are actually annotated as such in the database. The 138 occurrences of A A U A A A in 3'-noncoding regions found to be remote (at least 50 nt and usually considerably further) from the 3' terminus must either never act as a true signal or do so with only partial activity. A small number of concatenated A A U A A A motifs near true cleavage sites were omitted from classification. By random chance, this database would be expected to contain [using the frequencies p ( A ) - p ( U ) - 0.3] 240 occurrences of AAUAAA, in fact the reverse complement U U U A U U was represented 280 times. Disregarding the terminal 50 nt of each entry (977 entries) where the selected set of true polyadenylation signals are partitioned, it is concluded that 138 out of an expected 204 A A U A A A motifs are not acting as effective cleavage sites. The probability ratio of signal A A U A A A to all non-signal A A U A A A (Fig. 1) found ~,l mature m R N A demands an associated information content of at least 0.85 bit.

(d) AAUAAA in relation to cDNA library sorting strategies Analysis of dinucleotide and trinucleotide frequencies in the strings N-s to N-~ and N mto Ns indicates little sequence bias except for a depletion of CpG (expected, Swartz et al.,

TABLE II Nucleotide frequencies in positions flanking human AAUAAA protein coding and polyadenylation signal motifs Position of motif"

Nucleotidefrequencies flanking the motif AAUAAAb

AAUAAA motifs in protein. coding regions True polyadenylation signals

N-s

N-7

N-6

N-5

N-4

N-3

N-2

N-m

A C G U

0.40 0.20 0.20 0.20

0.35 0.17 0.17 0.31

0.28 0.21 0.31 0.20

0.38 0.20 0.18 0.24

0.30 0.21 0.22 0.27

0.36 0.13 0.24 0.26

0.35 0.14 0.31 0.20

0.36 0.24 0.21 0.18

A C G U

0.28 0.17 0.20 0.35

0.2 6 0.21 0.19 0.3 4

0.26 0.21 0.19 0.33

0.33 0.18 0.19 0.30

0.3 ! 0.18 0.18 0.32

0.34 0.19 0.17 0.30

0.33 0.23 0.14 0.30

0.33 0.34 0.09 0.24

AAUAAA N I

N2

N3

N4

Ns

N6

N7

Ns

0.35 0.15 0.31 0.19

0.30 0.18 0.29 0.24

0.33 0.23 0.20 0.24

0.29 0.20 0.31 0.20

0.36 0.17 0.25 0.21

0.36 0.20 0.16 0.28

0.30 0.15 0.28 0.26

0.35 0.22 0.23 0.20

0.34 0.16 0.28 0.22

0.25 0.17 0.21 0.38

0.28 0.19 0.17 0.35

0.28 0.19 0.15 0.37

0.24 0.19 0.19 0.38

0.27 0.21 0.14 0.38

0.29 0.17 0.13 0.41

0.27 0.21 0.15 0.38

The figures are drawn from 326 such protein coding motifs, and 272 polyadenylationsignals actually annotated as such in the GenBank database (Losnuc derivative database version 62 and GenBank version 64, respectively). Sequences flanking all AAUAAA in human protein coding regions (upper half of table) or true polyadenylation-cleavagesignals (lower half of table) were analysed for nt distribution. b Nucleotide frequencies are presented from N-s through Ns, where N-8 is the eighth nt upstream, Ns, the eighth nt downstream, etc.

248

1961) and slight overrepresentation of A and T runs (f(A) - 0.292, f(AA) = 0.106, f(AAA) -- 0.046, f(T) = 0.335, f(TT)= 0.132, f(TTT)= 0.056 where for example f(AA) is the frequency of AA amongst all dinucleotides). Since an 8-nt string offers 65 536 combinations apparently in this instance with little loss of variety due to positional and sequence bias the possibility exists of using such a short stretch of nt adjacent to the unusually distributed motif AATAAA for the simplified classification and sorting of most human mRNAs. Strategies to access such sites ral:~idlyvia eDNA libraries might exploit either properties of the polyadenylation tail (to classify true 3' boundary sites), or the unique distribution of the anchor motif AATAAA defined here. Oligo(dT)-primed eDNA libraries (Buell et al., 1978) tend to be biased further in favour of AATAAA true signal sites, and can be specifically biased to represent the 3' ends of genes by early stoppage of first-strand synthesis. Either selection and directional sequencing of eDNA clones bearing poly(A) tail representations, or a resource of'sequence-anchored' oligos (e.g., all N-6. N-~AATAAA would represent 4096 sets expected to contain 0-10 genes each) for sorting by hybridization (Wood et al., 1985; Craig et al., 1990) might prove suitable in the simplification of eDNA library sorting. Theoretical modelling (e.g., Gunby, 1990) and practical experience (e.g., Adams et al., 1991) have clearly established the high level ofinefficiency in picking and sequencing eDNA clones at random to create universally accessible sequence databases, because of the bias of representation in eDNA libraries and the random nature ofpicking either in the same or different laboratories. Hybridization with panels of short oligos (e.g., 6-8-mers) to produce fingerprint hybridization patterns to distinguish clones in a library (Craig et al., 1990) is a theoretically efficient means of classifying clones. However, the use of extremely short oligos is technically complex, information from different laboratories and libraries would be difficult to pool and the analysis of derived information is demanding. Thus the strategy may not readily lend itselfto a distributed approach networking many laboratories and many libraries. The concept of networking genomic mapping laboratories using 'sequence tagged sites' (predefined sequence landmarks accessible to all, using PCR primers) has been elaborated (Olson, et al., 1989) and must apply equally to eDNA sorting. For example, a resource of 4096 12-mers or similar, of the form proposed above, would not be expensive on the scale of the Human Genome Program. These could be sent out to any requesting laboratory, to use with any cDNA library, either as very small aliquots of particular oligos for end labelling or as an immobilized array to which labelled clones could be hybridized. The conditions of use (e.g., 'always wash 12-mer hybridizations in 3 M tetramethylammonium chloride at 46°C ', see Wood etal., 1985) would be predefined and the

resource could readily be assimilated both in the form of clones and hybridization data. The resource achieved would considerably enhance the efficiency of a sequencing strategy or would lead to considerably simplified pools of clones for subsequent very short oligo hybridization fingerprint sorting, cDNA sorting strategies centred upon the unusually distributed motif AATAAA merit detailed practical investigation.

ACKNOWLEDGEMENTS

Related work is supported by UK MRC Human Genome Mapping Program Gran ~,G8916639. Mrs. Wendy Pringle is thanked for typing the manuscript.

REFERENCES Adams, M.D., Kelley, J.M., Gocayne, .I.D., Dubnick, M., Polymeropoulos, M.H., Xiao, H., Merril, C.R., Wu, A., Olde, B., Moreno, R.F., Kerlavage, A.R., McCombie, W.R. and Venter, J.C.: Complementary DNA sequencing: expressed sequence tags and human genome project. Science 252 (1991) 1651-1656. Aota, S., Gojobori, T., ishibashi, F., Maruyama, T. and lkemura, T.: Codon usage tabulated from the GenBank genetic sequence data. Nucleic Acids Res, 16 (1988) (Suppl.) r315-r402. Bantle, J.A. and Hahn, W.E,: Complexity and characterisation of polyadenylated RNA in the mouse brain. Cell 8 (1976) 139-150. Bird, A.P.: CpG-rich islands and the function of DNA methylation. Nature 321 (1986) 209-213. Bishop, D.F., Kornreich, R. and Desnick, R.J.: Structural organisation of the human a-galactosidase Agene: further evidence for the absence of a Y-untranslated region. Prec. Natl. Acad. Sci, USA 85 (1988) 3903-3907. Buell, G.N., Wickens, M.P., Payvar, F. and Schimke, R.T.: Synthesis of full length cDNAs from four partially purified oviduct mRNAs, J. Biol. Chem. 253 (1978) 2471-2482, Craig, A.G., Nizetic, D., Hoheisel, .I.D., Zehetner, G. and Lehrach, H.: Ordering of eosmid clones covering the Herpes simples virus type 1 (HSV-I) genome: a test case for fingerprinting by hybridisation. Nucleic Acids Res. 18 (1990) 2653-2660. Dickinson, D.P., Ridall, A.L. and Levine, M.J.: Human submandibular gland statherin and basic histidine-rich peptide are encoded by highly abundant mRNAs derived from a common ancestral sequence. Biodlem. Biophys. Res. Commun. 149 (1987) 784-790. Fiddes, J.C. and Goodman, H.M.: The eDNA for the p-subunit ofhuman chef ionic ganadotrophin suggests evolution of a gone by readthrough into tf~,: 3'-untranslated region. Nature 286 (1980) 684-687. Gunbv, A Optimization of Messenger RNA Classification. M. Sc. Th~'qis, University of Oxford, U.K., 1990. Hawkins, Ji.O.: A survey on intron and exon lengths. Nucleic Acids Res. 16 (19881 9893-9905. Human Gene Mapping Conference 10. Cytogen. Cell. Genet. (suppl.) (1989). McLauchlan, J., Gaffney, D., Lindsay, Whitton, J. and Barklie Clements, J.: The consensus sequence YGTGTTYY located downstream from the AATAAA signal is required for efficient formation of mRNA 3' terminus. Nucleic Acids Res. 13 (1985) 1347-1368.

249 Nussinov, R.: TGTG, G clustering and other signals near non-mammalian vertebrate mRNA 3' termini: some implications. J. Biomol. Struct. Dyn. 3 (1986) !143-1153. Olson, M., Hood, L., Cantor, C. and Botstein, D.: A common language for physical mapping of the human genome. Science 245 (1989) 1434-1435. Proudfoot, N.J. and Brownlee, G.G.: 3'non-coding region sequences in eukaryotic messenger RNA. Nature 263 (1976) 211-214. Proudfoot, N.J. and Whitelaw, E.: Termination and 3' end processing of eukaryotic RNA. In: Hames, B.D. and GIover, D.M. (Eds.), Transcription and Splicing. IRL Press, Oxford, 1989, pp. 97-129. Schneider, T.D., Stormo, G.D., Gold, L. a~ld Ehrenfeu,:ht, A.: Information content of binding sites on nucleotide sequences. J. Mol. Biol. 188 (1986) 415-431. Shapiro, H.S.: Nucleotide frequencies in chordates. In: Fasman, G.D. (Ed.), CRC Handbook of Biochemistry and Molecular Biology (3rd ed.): Nucleic Acids, Vol. I!. CRC Press, Cleveland, 1976, p. 273.

Swartz, M.N., Trautner, T.A. and Kornberg, A.: Enzymatic synthesis of deoxyribonucleic acid, XI. Further studies on nearest neighbour base sequence in deoxyribonucleic acids. J. Biol. Chem. 237 (1962) 1961-1967 Watson, .I.D.: The human genome project: past, present and future. Science 248 (1990)44-49. Wilusz, J., Schenk, T., Takagaki, Y. and Manley, J.L.: A multicomponent complex is required for the AAUAAA-dependent cross-linking of a 64-kilodalton protein to polyadenylated substrates. Mol. Cell. Biol. 10 (1990) 1244-1248. Wood, W.I., Gitschier, J., Lasky, L.A. and Lawn, R.M.: Base composition-independent hybridisation in tetramethylammonium chloride: a method for oligonucleotide screening of highly complex gene libraries. Proc. Natl. Acad. Sci. USA 82 (1985) 1585-1588.

Analysis of the 5'-AAUAAA motif and its flanking sequence in human RNA: relevance to cDNA library sorting.

The motif, N-8..N-1AAUAAAN1..N8 (where N is A, C, G or U), and its flanking sequence in human mRNA were examined by database analysis. Approximately 2...
525KB Sizes 0 Downloads 0 Views