YEAST
VOL.
7: 41 3- 424 ( 1 99 I )
The Complete Sequence of the Unit YCR59, Situated Between CR YI and M A T, Reveals Two Long Open Reading Frames, which Cover 9 1% of the 10.1 kb Segment YANKAI JIA. PlOTR P. SLONIMSKI A N D CHRISTOPHER J. HERBERT C'enrre de Gc;ni.rique Moli.culuire. Lnhoruroire propre du CNRS nssocie d I'Universite Pierre et Marie Curie. F-91 I98 G(fisur-Yverte Cedes, France
Rcccivcd 19 November 1990; accepted 28 November 1990
We have cntirely sequenced YCR59. which is a 10.1 kb segment of the right arm ofchromosome 111, and is part of the clone ESF from the Newlon collection. The segment contains two long open reading frames (ORFs): YCR591 which starts in the adjacent fragment H9G (situated towards C R Y I and the centromere), and continues with 1833 codons in YCR59. The second O R F YCR592 is 1226 codons long and encoded entirely within YCR59. The two ORFs represent 91% of the total length of the segment. Excellent agreement in both location and length is found between the ORFs YCR591 and YCR592 and the transcripts 86 and 87 respectively in the Yoshikawa and lsono (1990) map of chromosome 111. Thc two ORFs correspond to new genes and show no significant similarity with any known genes. KEY WORDS - -
Yeast; Chromosome 111; sequcncc.
INTRODUCTION The complete nucleotide sequence of a 10.1 kb segment of chromosome 111 of Saccharomyces cerevisiae has been determined as part of the European project to sequence the whole of the chromosome. This segment ( Y C R 5 9 ) lies between CRYI and the MA T locus on the right arm of the genetic map of Mortimer et al. (1989). No genetic loci have been reported in this segment. MATERIALS A N D METHODS
proA+ B' , lacIp,IucZAM 15)) and DH5aF' (supE44, AlacU169 ( ( ~ 8 lacZAM15) 0 hsdRI7, recAI, endAI, gyrA96, thi-I. relAI. F). DNA manipulations
All routine plasmid preparations, transformations, subcloning and other D N A manipulations were carried out as described by Maniatis et al. ( I 982).
Sequencing strutegy The starting material for the sequencing was the Strains clone E5F (Newlon et ul., 1986) which is a 22 kb The Escherichia coli strains used were TGI fragment from the small ring of chromosome 111 (A(lac-pro). rhi-1, supE44, hsdD5, F' (traD36, containing the ectopic recombination between 0749-503X 91:W0413 I2 506.00 0 1991 by John Wiley & Sons Ltd
414
Y. JIA, P. P. SLONIMSKI A N D C. J. HERBERT
Figure 1. Restriction map and ORF map of YCR59; the positions of the two ORFs YCR591 and YCR592 are marked by a hatched bar. The start of YCR591 in H9G wasdetermined using data provided by Prof. F. Pohl.
MAT and HML, cloned in YIPS. The 2.1 kb BamHI-EcoRI fragment and the 2.7 and 5.2kb EcoRI fragments (see Figure I ) were subcloned into M I3mp 18 or M 13mp I9 (Pharmacia). Clones suitable for sequencing were generated by exonuclease 111 digestion of the replicative form DNA. Single strand DNAs were sequenced by the chain termination method of Sanger et al. ( I 977) using either the Klenow fragment of the E. cofi DNA polymerase (Boehringer) or T7 DNA polymerase (Pharmacia). Where necessary oligonucleotide primers were synthesized to fill in gaps in the data from the exonuclease 111 generated clones. The data from the individual clones were assembled using the StadenPlus@ software package (Amersham), and the junctions between the restriction fragments were determined by sequencing on the double stranded DNA of E5F using oligonucleotide primers. In this way the entire sequence of both strands of the 10.1 kb segment was determined. The sequence was analysed using the UWGCG programs (Devereux e f al., 1984) and DNA Strider (Marck, 1988).
RESULTS AND DISCUSSION The complete sequence of the 10.1 kb chromosome 111 unit YCR59 (located between CR YI and MA T) has been determined. The right-most EcoRI site of our sequence (see Figure I), is the left-most EcoRI site of the sequence of Thierry et al. ( 1 990); this has been verified by sequencing across the junction on the original clone. An analysis of the sequence shows two long open reading frames (ORFs) which are encoded on the same strand (Figure I). The first O R F (YCR59 1) begins in the preceding fragment H9G; with data from this fragment provided by Prof. F. Pohl we have been able to reconstruct the entire ORF. YCR591 is an exceptionally long O R F of 2167 codons, of which 1833 are encoded in YCR59. The MW of the protein deduced from the sequence is 250.8 kDa; when calculated according to Bennetzen and Hall ( 1 982) there is no codon bias indicating that the gene will be expressed at a low level. When YCR591 was compared to the data banks either as a nucleotide or deduced protein
Figure 2. The complete nucleotide sequence of YCR59; the translation of the section of YCR591 encoded in YCR59 and the translation of YCR592 are marked. The putative 'TATA' box for YCR592 (position 5874) is double underlined, and the putative terminator for YCR592 is underlined.
415
COMPLETE SEQUENCE OF UNIT YCRSY
YCR591 1
G S M I C S I K V Y R F Y L W D G L L T GGATCCATGATTTGTTCAATTAAAGTATATAGATTTTATTTGTGGGATGGATTATTAACA E
61
F
A
I
N
I
L
Q
A
I
G
T
N
Y
Q
Y
T
F
S
K
GAATTTGCGATAAATATACTTCAAGCTATCGGCACCAATTACCAATATACATTTAGCmG
K
K
E
G
P
E
V
L
S
L
C
Q
D
F
L
I
A
K
A
60
120
H
121
AAAAAAGAAGGGCCTGAAGTTTTATCGCTCTGCCAAGACTTCAT
180
181
L M A R P A T E I S S T K Y I D E I E L TTAATGGCCAGGCCTGCAACAGAAATATCTTCTTCCAC~TACATCGATGAGATTGAACTT
240
241
L E M E N I I I D V N P N D I L Q D F T CTTGAAATGGAkAATATCATTATTGATGTTAACCCAAATGATATTCTTCAAGATTTCACC
300
301
GAATCGTCTAATTTTACGGTAAAATTTGAGGAAAGCACAAACTCGAATTCCGGAA
360
361
V G K C Y F Y R S S N L V S K F V S I D GTGGGTAAGTGCTATTTCTATAGGAGTTCAAACTTGGTTTCAAAATTTGTGTCCATTGAT
420
42 1
S I R L A F L N M T E S G S I D D L F H TCTATACGGCTTGCGTTTTTAAACATGACAGAATCCGGTAGTATAGACGATCTGTTTCAT
480
481
~ V S H L M N L L R N I D I L N W F K K CATGTATCACATCTGATGAATCTTTTACGAAATATTGATATTCTTAATTGGTTTAAAAAA
540
541
D F G F P L F A Y T L K Q K I T Q D L S G A C T T T G G C T T C C C T T T A T T T G C T T A T A C T T T ~ C ~ T A A C A C ~ G A T T T A T C T600
601
Q P L N I Q F F N L F L E F C G W D F N CAGCCTCTGAATATCCAATTTTTCAATTTATTCTTAGAATTTTGCGGGTGGGATTTCAAC
660
661
D I S K S I I L D T D A Y E N I V L N L GATATTTCCAAATCCATAATTCTAGATACTGATGCCTACGAAAACATAGTCCTTAACTTG
720
721
D L W Y M N E D Q S S L A S G G L E I I GATTTATGGTATATGAATGAGGATCAAAGTTCTCTGGCGTCAGGCGGATTAGAAATTATC
780
781
R F L F F Q I S S L M E A S I Y S K F N AGATTTCTTTTCTTCCAAATTTCAAGTTTGATGGAAGCCTCTATTTATTCTAAGTTCAAT
840
841
S N K F N D M N I L E K L C L S Y Q A V TCCAATAAATTCAATGATATGAATATCCTAGAAAAACTATGTTTAAGCTATCAGGCTGTC
900
901
T K R E N Q N S K F N E L S N D L I S V ACAAAAAGAGAAAATCAGAACAGTAAATTTAATGAGCTATCAAATGATTTAATTTCTGTA
960
961
F V T L L K S N T D K R H L Q W F L H L T T T G T T A C T T T A T T G A A A T A C T G A T A A A C G A C A C C T ~ A G T ~ T T T T T A C A T ~ T C1020
E
S
S
Y
S
Y
N
F
F
I
T
K
V
R
K
K
F
D
E
V
E
R
S
S
T
T
N
E
S
I
K
I
N
L
I
Q
P
A
E
V
1021
TCATATTACTTTATTAAGAGAAAAGATGTACGTTCTACAGAAATTATA
1080
1081
D Q L F S F Y L D Q G S D E N A K I L S GATCAACTTTTTTCGTTTTACTTAGATCAA~TAGCGACGAAAATGCGAAGATACTTTCA
1140
416
Y. JIA, P. P. SLONIMSKI A N D C. J. HERBERT
E
I
I
P
L
K
L
M
L
M
I
M
D
Q
I
V
E
N
N
E
1141
GAGATTATACCACTTAAGCTAATGCTGATGATTATGGATCAAATAGTGGATAATGAA
1200
1201
S N P I T C L N I L F K V V L T N K P L TCAAACCCTATTACGTGCTTGAATATCTTATCTTATTTAAGGTAGTTCTGACCAATAAACCGCTT
1260
1261
F K Q F Y K N D G L K L I L T M L C K V TTCAAACAATTTTACAAAAATGATGGTTTGAAACTCATATTGACTATGCTTTGTAAGGTA
1320
1321
G K S Y R E E I I S L L L T Y S I G N Y GGGAAAAGCTATCGAGAGGAGATTATTTCTTTGCTTCTCACATATTCTATTGGCAATTAT
1380
1381
T T A N E I F S G A E D M I G G I S N D ACCACAGCTAACGAAATATTTTCAGGTGCTGAAGACATGATTGGAGGAATTTC~CGAC
1440
1441
K I T A K E I I Y L A V N F I E W H V I AAGATAACTGCAAAAGAAATTATTTATTTGGCTGTCAACTTCATTGAGTGGCATGTGATT
1500
1501
N S N A S D S S S V L D L N N H I L R F AATTCTAATGCCAGTGATTCTTCTTCTGTATTGGACCTGAACAACCATATATTAAGATTC
1560
1561
V E D L K S L S A V P I N E S V F D P K GTCGAATATCTGATCTGAAATCGCTGAGCGCTGTTCCGATTAATGAATCTGTATTTGATCCTAAA
1620
1621
K S Y V M V S L L D L S I A L N E S E D AAAAGTTATGTGATGGTTTCATTATTAGATCTCTCGATAGCTTTGAATGAATCGGAGGAC
1680
1681
I S K F K S S S K V I S E L I K G N I M ATCTCAAAGTTCAAGAGCTCTTCAAAAGTGATTTCAGAGCTCATTAAAGGTAATATAATG
1740
1741
C A L T K Y A A Y D F E V Y M S T F F C TGTGCTCTTACGAAATATGCCGCTTATGATTTCGAAGTCTATATGAGCACATTTTTTTGT
1800
1801
CACAGTACAGAATACAAACTGGTTTATCCAAAAACTGTAATTA
1860
1861
E L S F I V T L L P E I L N D L I D S N GAGCTATCATTTATAGTGACACTCCTACCCGAAATACTTAATGACCTGATAGATAGCAAT
1920
1921
N N L N L M M L K H P Y T M S N L L Y F AACAATTTGAACCTGATGATGTTGAAGCATCCATACACGATGTC~TCTCCTTTATTTT
1980
1981
L R K F R P D T S Q I V M P K D F Y F S CTTCGCAAATTTCGACCTGATACGTCACAGATAGTTATGCCTAAAGATTTTTATTTCTCA
2040
2041
S Y T C L L H C V I Q I D K S S F Y H F AGTTATACATGTCTCTTGCATTGTGTTATTCAGATTGATAAATCATCATTTTACCATTTC
2100
2101
K N V S K S Q L L Q E F K I C I M N L I AAAAACGTTTCTAAGTCGCAACTGTTACAGGAATTCAAAATCTGCATAATGAACTTAATA
2160
2161
Y S N T L K Q I I W E K E E Y E M F S E TATTCCAATACTCTAAAGCAGATAATATCTTCTGGGAGAAAGAAGAATACGAGATGTTTTCTGAG
2220
2221
S L M A H Q E V L F A H G A C D N E T V TCACTGATGGCGCATCAGGAATATCTGTTTTATTTGCACATGGAGCATGTGATAATGAGACCGTT
2280
H
S
T
E
Y
K
L
V
Y
P
K
T
V
M
N
N
S
S
Y
L
417
COMPLETE SEQUENCE OF UNIT YCR59
2281
G L L L I F F A N R L R D C G Y N K A V GGCTTATTGTTAATATTTTTTGCCAACAGATTACGTGATTGT~ATACAACAAAGCAGTC
2340
2341
F N C M K V I I K N K E R K L K E V A C TTCAATTGTATGAAAGTGATCATTAAGAACAAGGAAAGGAAACTAAAGGAGGTGGCGTGT
2400
2401
F F D A A N K S E V L E G L S N I L S C TTTTTTGACGCAGCGAATATGAAGTACTCGAA~TTTAAGTAATATCCTCTCATGC
2460
2461
N N S E T M N L I T E Q Y P F F F N N T AATAACTCTGAAACAATGAACCTCATAACTGAACAATACCCATTTTTTTTCAACAATACA
2520
Q
Q
V
R
F
I
N
I
V
T
N
I
L
F
K
N
N
N
F
S
2521
CAACAGGTACGGTTCATAATTGTCACCAATATCTTGTTTAAGAACAACAATTTTTCT
2580
2581
P I S V R Q I K N Q V Y E W K N A R S E CCAATAAGCGTTAGACAGATCAAAAACCAAGTTTTACGAATGGAAAAATGCAAGATCAGAA
2640
2641
Y V T Q N N K K C L I L F R K D N T S L TACGTCACCCAAAACAATAAAAAGTGCCTTATTTTATTTAGAAAAGACAACACATCCTTA
2700
2701
D F K I K K S I S R Y T Y N L K T D R E GATTTTAAAATCAAAAAGTCCATATCAAGATACACTTACAACCTCAAAACGGATAGAGAA
2760
2761
E N A V F Y R N N L N L L I F H L K H T GAAAATGCAGTTTTCTATCGAAATAATTTT~TCTTTTGATTTTTCATCTG~CATACA 2820
2821
L E I Q S N P N S S C K W S L D F A E D CTGGAGATACAATCAAATCCAAATTCGTCGTCCTGCAAGTGGTCATTGGACTTT~CAG~GAT2880
2881
F D G M K R R L L P A W E P K Y E P L I TTTGATGGGATGAAACGGAGGCTTTTGCCTGCTTGGGAACCAAAATATGAACCACTCATT
2941
N E E D A N Q D T I T G G N R Q R R E S AACGAGGAAGATGCTAATCAAGATACTATAACAGGTGGTGGTAACAGAC~GGAGAG~GT 3000
3001
G S I L S Y E F I E H M E T L E S E P V GGAAGCATTTTATCCTACGAATTTATCG~CATAT~AGACTCTTGAGTCGGAGCCAGTT 3060
3061
G D L N E N R K I L R L L K D N D S I A GGAGATTTGAATGAGAATAGAAAAATTCTTCTTAGACTTTTGAAGGAT~CGATTCTATTGCA 3120
3121
T I W N C S L I I G L E I K E G I L I H ACTATTTGGAATTGCAGTTTGATTATTGGATTAG~TTAAGGAGGGGATTTTAATTCAT
3181
G S N Y L Y F V S D Y Y F S L E D K K I GGCAGTAATTACCTTTACTTTGTAAGTGATTACTATTACTATTTTAGTTTAGAGGAT~GATT 3240
3241
L K L S E V S Q E S R D M T V S L I N G CTAAAATTATCAGAAGTATCGCAAGAATCACGGGGATATGACGGTTAGCTTAATTAACG~
3300
3301
P D V K R V S T F L K H E V F V W K L L CCTGATGTTAAAAGGGTATCAACTTTCCTAAAGCACGAAGTCTTTGTTTGG~CTTCTC
3360
3361
D I T F V T K R P F L L R D V A I E L L GATATCACTTTCGTTACCAGACCCTTTCTACTTCGGGTTATTG
3420
2940
3180
418
Y. JIA, P. P. SLONIMSKI AND C. J. HERBERT
3421
F K E R V S A F F S F Y N K R V R D D V TTCAAAGAGAGAGTTAGCGCTTTTTTTAGTTTTTACAACAAAAGAGTGAGAGATGACGTT
3480
3481
L R V L N K I P K H L P A D P I F S S V TTACGGGTACTGAATAAGATCCCGAAGCACCTTC~GCAGATCCAATTTTTTCAAGCGTT
3540
3541
L Q E I N D R G N S I V A R N G I G K A TTACAAGAAATAAACGACCGAGGAAATAGTATAGTGGCAAGAAATGGAATAGGAAAGGCA
3600
3601
S I A S K F T S V F S A N N S L I D G F AGCATTGCTTCCAAATTCACTAGCGTCTTCTCAGCGAA~CAGCCTAATAGATGGATTT
3660
3661
E I S K K W V R G E I S N F Y Y L L S I GAGATCAGCAAAAAATGGGTTAGGGGAGAGAGATTTCTAATTTTTATTACCTGTTGAGTATC
3720
3721
N I L A G R S F N D L T Q Y P V F P W V AACATCCTAGCGGGAAGGTCATTCAACGATTTGACCCAATATCCAGTGTTTCCGTGGGTT
3780
3781
I A D Y E S N V L D L E N P K T Y R D L ATTGCAGATTACGAAAGTAACGTACTCGATTTAGAGAATCCTAAAACTTACCGGGACCTA
3840
3841
S K P M G A Q S E K R K L Q F I E R Y E TCGAAACCTATGGGCGCTCAAAGTGAGAAAAGGAAATTACAGTTTATAGAGCGTTATGAA
3900
3901
A L A S L E N A D S A P F H Y G T H Y S GCTTTGGCTTCCCTGGAAAATGCTGATTCCGCACCATTTCATTATGGCACGCATTATTCC
3960
3961
S A M I V S S Y L I R L K P F V E S F L TCAGCTATGATAGTATCTTCATATCTGATRAGGCTGAAGCCCTTTGTCG~TCCTTTTTG
4020
4021
L L Q G G S F G P A D R L F S S L E R A TTATTGCAAGGCGGAAGTTTTGGCCCTGCAGATCGTTTATTTAGTTCGCTTGAAA~CC 4080
4081
W S S A S S E N T T D V R E L T P E F F TGGAGCTCTGCTTCTTCTGAAAATACAACGGATGTCAG~AATTGACACCTGAATTTTTT
4140
4141
F L P E F L I N V N S Y D F G T D Q S G TTTCTACCTGAATTTTTGATCAACGTTAATAGTTATGACTTTGGTACAGAC~GCGGT
4200
4201
K K V D D V V L P P W A N G D P K V F I AAAAAAGTTGACGACGTCGTACTTCCACCCTGGGCAAATGGTGACCC~GGTTTTCATT 4260
4261
Q K N R E A L E S P Y V S A H L H E W I CAAAAGAATAGAGAAGCTTTAGAAAGTCCTTATGTATCA~ACATTTACATGAATGGATT
4320
4321
D L I F G Y K Q K G E I A V K S V N V F GATTTGATATTTGGTTACAAACAAAAGGGGGAAATTGCTGTGAAATCTGTTAACGTATTC
4380
4381
N R L S Y P G A V N L D N I D D E N E R AACAGATTGAGTTACCCAGGGCTGTAAATCTAGAT~TATTGACGATGAAAATGAGCGC
4440
4441
R A I T G I I H N F G Q T P L Q I F Q E AGAGCTATCACAGGCATTATTCACAACTTTGGTC~CGCCTTTAC~TATTTCAGGAA 4500
4501
CCTCATCCGGAAAAAATAGCCTGCAATGTTCAACAGCTAA
P
H
P
E
K
I
A
C
N
V
Q
Q
L
T
T
E
V
W
R
R 4560
419
COMPLETE SEQUENCE OF UNIT YCR59 V
4561
P
M
K
P
I
F
E
K
T
I
F
N
L
N
E
K
N
R
S
GTTCCAATGAAGCCAATATTTGAGAAGACAATCTTTAATTTGAATG~GAACAGGTCT V
D
Y
V
I
H
D
P
S
Y
F
D
S
L
Y
W
R
G
F
4620
A
4680
4621
GTCGATTATGTTATACACGATCCTAGTTACTTCGATTCATTATACTGGAGG~TTCGCT
4681
F P N L F F R T E E S L V S L R I V H K TTCCCAAACTTGTTTTTCAGAAC~~GAATCGTTAGTGTCATTGAGAATTGTGCAT~4740
4741
N W L K I G L D I F K K T H M A Q I T S AATTGGTTAAAAATTGGACTAGATATTTTTAACGCATATGGCTCAGATTACATCG
4800
4801
F A Y W K L G E F I T G D K N G L I K V TTTGCGTACTGGAAGTTGGGCGAATTCATAACTGGTGATATT
4860
W
4861
K
Y
R
K
D
K
H
S
V
S
G
N
L
E
N
K
K
T
M
TGGAAATATCGTAAAGATAAGCATTCGGTTTCAGGTAACCTTGAGAACWCAATG F
G
H
L
C
E
L
K
E
M
R
C
Y
H
D
Y
N
T
L
4920
L
4921
TTTGGGCACCTATGCGAGCTAAAGG~TGCGCTGTTATCACGACTA~TACGCTTTTA 4980
4981
T L D I S G L V Y V W D M I N F E L V R ACCTTAGACATCAGCGGCTTAGTATATGTCTGGGACATGACATGATT~TTTCGAACTAGTGAGA 5040
5041
Q I T N D A Q K V A I S Q H A G S I M V CAAATAACAAATGATGCGCAAAAGGTCGCAATATCAACATTATGGTA
5101
L T K N N A I S I F N L N G Q I Y T S K T T G A C T A A G A A T A a C G C C A T T T C G A T C T T C A A T C T ~ T ~ A C ~ T A T A T A C A T C A A A G 5160
5161
AAATTCGAACCAGCTAAAATTGTAAGCTC~TTGATTTTTTTGACTTCACTAAGTTAGAC
K
A
F
G
E
Y
P
R
A
K
K
H
I
I
V
Y
S
W
S
I
K
E
D
M
F
F
E
D
I
L
F
L
T
V
K
G
L
F
5100
D
5220
E
5221
GCAGGTTACAGAAAGCATATCTATTGGAAAGAGAGATGG~TACTACTAGTGGGCTTTGAA
5281
D G T I E I Y E L F L T F H N E W A I K G A T G G A A C T A T A G ~ T T T A C G A G C T C T T T T T G A C T T T T C A T ~ T G ~ T ~ G C G A T A A A G5340
5341
CTACTGAAACAGCTCTGTACCGAAAGAGGGAAAGCCATAACTAGCATTAAGGGACAGGGG
5401
AAGACATACCTGTCCCAGAAAAGACGCAAGGATACA~AGA~CTCATGAGATAGAAGTG
5461
ATTGCGGGAACATTAGATGGCAGATTAGCTATTTGGTACTAGGCATGACATCGTAACGCC
5520
5521
TTTCTTTAAATGATTCAATTTTTGTAGTTTATATCTTTACTTTTGAAACTGATTTCTCAT
5580
5581
CCCACCTAGTATTGTAATTGCGTACGTATCCAATATCATTACCAACGCCGGGTATTTTTT
5640
5641
TCTAGTATTTCTTCTCCATTTCGCCTATGGAAAACAGCAAAAGGGTAAAAGWCA
5700
5701
AACGATTAATTCTTCATTGAATTATGTAAAAATCAGCAACCGCAGATTTAATAGAG
5760
5761
ACCAGAAATTCGGATTACTATTGACTTTGTGCACCACCTTC~TTTACTCATTGTTTAA
5820
L
K
I
L
T
A
K
Y
G
Q
L
T
L
S
L
C
Q
D
T
K
G
E
R
R
R
R
L
G
K
A
K
D
I
A
T
W
I
A
T
E
S
P
I
H
K
E
G
I
Q
E
5280
G
5400
V
5460
Y
420
Y. JIA, P. P. SLONIMSKI AND C. J. HERBERT
5880
5821
GACAGGCAGTGGGAAAGAAGCCGTCATATTGCTCGAATCCTTMCAAGC-CAA
5881
CCACTAAATTATTCCGAAAGGGCCTGCTTAATAATTTGCCTACTAACTTGTGCATAGAAC 5940
5941
M G Y P P P T R R L G AGCAAACAGAAACAAAGCGTAAGAAACATGGGGTATCCGCCACCTACACGAAGGCTTGGA
6000
6001
D K K R Y H Y S N N P N R R H P S A V Y GATAAGAAAAGGTACCATTATTCCAATAATCCTAACCGAAGGCATCCTTCCGCTGTTTAT
6060
6061
TCCAAGAATAGCTTTCCACAAGCAATAATGGATTTGTATCTTCTCCTACTGCCGAT
6120
6121
N S T N P S V T P S T A S V P L P T A A AATTCAACAAATCCGTCTGTAACTCCCAGTACTGCATCTGTACCTCTTCCTACAGCGGCA
6180
6181
P G S T F G I E A P R P S R Y D P S S V CCTGGAAGCACGTTTGGTATCGAAGCACCCAGGCCATCTCGATATGATCCGA~TCAGTC
6240
6241
S R P S S S S Y S S T R K I G S R Y N P AGTAGGCCTTCGTCATCATCTTATTCGTCAACAAG~TTGGAAGCCGTTATAACCCA
6300
6301
D V E R S S S T T S S T P E S M N T S T GATGTGGAAAGATCCTCTTCAACCACTAGTTCAACTCCGG~GTATGAATACGAGCACC
6360
6361
ATAACACACACCAATACGGATATCGGAAACTCACGCTCACGCTATTCTCG~CCATGAGCAGA
6421
Y N P Q S T S S T N V T H F P S A L S N TATAATCCTCAATCTACTAGTTCTACAAACTCACGCGTTACCCACTTTCCCTCGGCATTATCAAAC 6480
6481
A P P F Y V A N G S S R R P R S M D D Y GCTCCACCGTTTTATGTTGCCAACGGGAGTTCTCGGAGACCTCGATCAAT~ATGATTAT
6541
AGTCCTGATGTAACGAACAAGCTCGAAACAAATAATGTTTCATCTGTTAATAATAACAGC
6600
6601
P H S Y Y S R S N K W R S I G T P S R P CCTCATTCTTATTACTCTAGGAGCAACAAATGGAGATCCATTGGAACGCCTTCCAGACCA
6660
6661
P F D N H V G N M T T T S N T N S I H Q CCATTTGATAATCATGTCGGCAATATGACGACCACCAGCAATACTAACTCGATCCATCAA
6720
6721
R E P F W K A N S T T I L K S T H S Q S AGGGAACCTTTTTGGAkAGCAAATAGTACTACTATTTT~TCAACTCATTCACAGTCA
6780
6781
TCGCCTTCCCTTCATACTATTCACGATGCGAATAAACTCACGTTGGACAAACCAGAGGCT
6840
6841
S V K V E T P S K D E T K T I S Y H D N TCAGTTAAAGTTGAAACACCCAGTAAAGATGATGAGAC~CCATATCGTACCATGATAAC
6900
6901
AATTTTCCACCAAGAARATCAGTTTCTAAACCT~T~ACCTTTAG~CCCGAT~TATC 6960
6961
K V G E E D A L G K R E V H K S G R E I AAGGTTGGCGAAGAAGATGCAT TGGGGUGAAGTACATWGTGGGCGTGAGATA
YCR592
S
I
S
S
N
K
T
P
P
F
N
H
D
S
P
S
T
V
L
P
F
N
T
H
R
P
T
N
T
K
K
D
K
K
S
S
I
L
K
V
S
G
E
F
S
N
N
T
H
K
N
S
N
D
P
G
R
N
A
N
F
Y
V
N
A
V
S
S
K
P
S
R
S
L
L
S
K
V
D
E
P
T
N
K
P
T
M
N
P
D
A
S
N
E
N
D
R
6420
6540
S
A
I
7020
42 1
COMPLETE SEQUENCE OF UNIT YCR59
7021
A K E H P T P V K M K E H D E L E A R A G C A A A G G A A C A T C C T A C T C C T G T ~ T G M ~ ~ A T G A T G A A C T A G ~ ~ T C 7080 ~GCT
7081
K K V N K I N I D G K Q D E I W T T A K AAAAAAGTAAATAAAATCAATATTGATGGMKAGGACGAAATTTSACaUKAAAA
7141
ACAGTGGCCAGTGCAGTCGAAGTTTCCAAAGMGTCATAAGGAACTAACAC~TCTGTT
7201
GAAAGGAAGGAAAGTCCAGATTAGAGATTATG~GAGCCTG
T
E
K
7261
V
R
T
A
K
D
S
E
A
A
S
T
V
P
K
E
E
L
V
I
T
S
R
V
K
D
D
E
Y
D
S
E
D
H
R
N
K
A
K
E
Y
S
L
D
Y
T
P
E
R
K
E
S
A
P
V
K
V
E
G
C
I
F
P
L
P
K
A
E
T
R
L
W
E
7200
L
7260
L
AAAACAGACGCAACAAAGTTGACAGTAGACGATGATAATAAAAGTTACGAAGAACCTCTT
E
7140
7320
L
7321
GAAAAAGTGGAAGGGTGTATTTTCCCATTACCAAAAGCAGAAACGAGATTATWAATTG
7380
7381
K N Q K R N K I I S K Q K Y L L K K A I AAAAACCAGAAAAGAAACATAATAAGTAAACAAAAGTACTTACTGAAAAA~AATT
7440
R
7441
N
F
S
E
Y
P
F
Y
A
Q
N
K
L
I
H
Q
Q
A
T
AGGAATTTCTCAGAGTATCCTTTTTACGCACAGAACAAACTTATACATCAGGCTACC
G
L
I
L
T
K
I
I
S
K
I
K
K
E
E
H
L
K
K
7500
I
7560
7501
GGACTTATCTTGACGAAAATTATATCAAAGATAAMAAGGAGGAACATTTGWTA
7561
N L K H D Y F D L Q K K Y E K E C E I L A A T T T A A A A C A T G A T T A T T T C G A T C T C C A G A A G A A G T A T G ~ G A A T G C G ~ ~ T T T G7620
7621
T K L S E N L R K E E I E N K R K E H E ACTAAACTGAGTGAAAATTTAAGGAAGGAAGAAATCGAAAAT~CGTAAAGA~ACGAA 7680 L
M
E
Q
K
R
R
E
E
G
I
E
T
E
K
E
K
S
L
R
7740
7681
TTAATGGAGCAGAAAAGACGTGAAGAAGGTATCGAAACAGAGG
7741
H P S S S S S S R R R N R A D F V D D A C A T C C A T C C T C G T C T T C C T C A T C T C G T C ~ A G G G G 7800
7801
E M E N V L L Q I D P N Y K H Y Q A A A GAAATGGAAAATGTATTGCTACAAATCGACCCAAATTATAAACATTATCAGGCT~T~A 7860
7861
T I P P L I L D P I R K Y S Y K F C D V A C A A T T C C T C C G C T A A T T T T AGATCCAA~CC~~TACTCTTAC~TTCTGTGATGTA7920
7921
N N L V T D K K L W A S R I L K D A S D A A T A A C T T G G T T A C A G A C A T T T ~ G T C T A G A A T A T T G M G A C ~ C T C T G A C 7980
7981
N F T D H E H S L F L E G Y L I H P K K AACTTTACTGACCATGAGCACTCT~TATTTTTGGAGGGTTATTTAATTCATCCT~ 8040
8041
F G K I S H Y M G G L R S P E E C V L H TTCGGTAAAATTTCTCACTACATG~GGCTTAAGAAGTCCTGAAGAGTGTGTCCTACAT
8100
8101
Y Y R T K K T V N Y K Q L L I D K N K K TATTATAGAACAAAGAAAACTGT~TTATMCAACTTCTTATCGAT~GAACAAGAAA
8160
422
Y. JIA, P. P. SLONIMSKI A N D C. J. HERBERT
8161
R K M S A A A K R R K R K E R S N D E E AGAAAAATGTCAGCCGCTGCGAAGCGCCGCAAGAGGAAGGAAATTACTGGGAAGTAATGACGAGGAA
8220
8221
V E V D E S K E E S T N T I D K E E K S GTCGAAGTTGATGAGAGTAAAGAAGAGTCAACGAACACGATAGATAAGGAAGWGT
8280
8281
E N N A E E N V Q P V L V Q G S E V K G GAGAACAATGCCGAGGAAAATGTTCAGCCGGTTCTAGTTCAAGGTTCTGAAGTGAAAGGT
8340
8341
D P L G T P E K V E N M I E K R G E E F GATCCATTAGGTACACCGGAAAAAGTTGAAAATATGATTGAAAAGAGAGGCGAAGAGTTT
8400
8401
A G E L E N A E R V N D L K R A H D E I GCAGGTGAATTGGAAAATGCTGAGAGGGTAAATGACTTAAAAAGGGCGCATGATGAAATT
8460
8461
G E E S N K S S V I E T N N E V Q I M A GGAGAAGAGAGCAATAAGTCCAGTGTAATAGAAACCAACAATGAGGTACAAATAATGGCT
8520
8521
P K G G V R N G Y Y P E E T K E L D F S CCAAAAGGAGGTGTTCGGAATGGTTATTATCCAGAGGAGACC~GAACTTGACTTCAGT
8580
8581
L E N A L Q R K K H K S A P E H K T S Y TTAGAGAATGCGTTACAGAGAAATTACTGGGAAATTACTGGCACAAATCTGCACCAGAGCAT~CAAGTTAT 8640 W
S
V
R
E
S
Q
L
F
P
E
L
L
K
E
F
G
S
Q
W
8641
TGGAGTGTTCGTGAATCTCATCTTTCCAGAAT 8700
8701
S L I S E K L G T K S T T M V R N Y Y Q TCTCTCATATCAGAAAAACTGGGTACCAAATCTACTACAATGGTAAGGAATTACTACCAA
8761
R N A A R N G W K L L V D E T D L K R D AGAAATGCAGCTCGCAATGGATGGAAATTACTGGTTACTGGTTGATG~CCGACTTAAAGCGAGAT 8820
8821
G T S S E S V Q Q S Q I L I Q P E R P N GGGACTAGTTCAGAATCTGTACAACAATCTCAAATTTTGATACAACCAGAACGACCAAAC
8880
8881
I N A Y S N I P P Q Q R P A L G Y F V G ATCAATGCCTATAGTAATATTCCTCCTCAACAAATTACTGGGACCGGCTTTGGGTTATTTTGTTGGA
8940
8941
Q P T H G H N T S I S S I D G S I R P F CAACCAACTCATGGGCATAATACATCTATTTCATCTATCGATGGCTCTATAAGACCATTT
9000
9001
G P D F H R D T F S K I S A P L T T L P GGGCCTGATTTTCATCGTGATACCTTTTCTAAAATTAGTGCTCCTTTAACCACTTTACCA
9060
9061
P P R L P S I Q F P R S E M A E P T V T CCACCAAGACTACCATCTATTCAGTTTCCTCGTTCAGAAATGGCAGAACCTACAGTGACA
9120
9121
D L R N R P L D H I D T L A D A A S S V GATTTGCGTAACAGGCCCTTAGACCATATTGACACGTTGGCTGATGCAGCTTCGTCAGTA
9180
9181
T N N Q N F S N E R N A I D I G R K S T ACAAATAATCAAAACTTCAGTAATGAAAGGAATGCAATTGACATTGGCCGTAAATCGACG
9240
9241
T I S N L L N N S D R S M K S S F Q S A ACAATCAGCAATCTATTGAATAATTCGGATCGAAGCATGAAATCTTCTTTCCAAAGCGCT
9300
8760
423
COMPLETE SEQUENCE OF UNIT YCRS9
S
R
H
E
A
Q
L
E
D
T
P
S
M
N
N
I
V
V
Q
E
9301
T C A A G A C A C G A A G C A C A G C T C G A A G A C A C T C C C A T G A
9360
9361
I K P N I T T P R S S S I S A L L N P V ATAAAACCGAATATTACTACGCCAAGATCGAGTTCTATTTCTGCATTACTAAATCCTGTA
9420
9421
N G N G Q S N P D G R P L L P F Q H A I AATGGGAATGGGCAATCAAACCCAGATGGAAGGCCGTTGCTGCCATTTCAGCATWTATT
9480
9481
S Q G T P T F P L P A P R T S P I S R A TCTCAAGGCACTCCTACTTTCCCTTTACCGGCCCCCTC~ACTAGTCCAATAAGTCGTGCG
9540
9541
P P K F N F S N D P L A A L A A V A S A CCTCCAAAGTTCAATTTTTCGAATGATCCGTTGGCA~TTTGGCTGCGGTTGCCTCCGCG
9600
9601
P D A M S S F L S K K E N N N CCAGATGCAATGAGCAGTTTTTTATCTAAAAAGGAAAATAATAATTGAACAAACGGCTGA
9660
9661
GACGGGCAATACATATGCTCTACTTCTTTTCCATCCAATGGTTGGTGAAACTCTCGAGCA
9720
9721
TACATTACCTTACGTGTGT
9780
9781
GGGAGGAGTTTTTAATTATATTGTAATTTCGTATTTTTTCTGCATTATACAGTTTTTTC
9840
9841
CGATTTTAAACGACTTTATTTAAGTGTCGTGTAAATATGTCACATTTTATTTTTGTACGT
9900
9901
ATTCACATGTCCTGGCGTGCGGCCATTGCTGAAAATCGCAAAACCCACAGAGAAATAAAC
9960
9961
ATCGCGAAAAAGTCAATGAAAAATTGGAAAATATTTTTCATTTCACTATTATCCACAAGC
10020
10021
AATTTTGTACAAAGTGAAAAGGTTGAACTAATTATCTTCGTCTAGAAGCCATGAATTC
sequence (EMBL release 24 and MIPS release 17 respectively) no significant similarity was found with any known sequence. YCR591 terminates with a TAG stop codon at position 5500 (see Figure 2), and the second ORF present in YCR59, YCR592, starts at position 5968. YCR592 is also a long ORF of 1226 codons and is encoded entirely within YCR59. The MW of the deduced protein sequence is 138.4 kDa; again the codon usage shows no bias indicating that the gene is expressed at a low level. Comparisons with the EMBL and MIPSdata banks showed no significant similarity with known sequences. As both of the deduced protein sequences are very long we repeated the comparison with the MIPS data bank dividing the sequences into fragments of 500 amino acids which overlapped by 100 amino acids. This should allow us to detect any motifs in the proteins that are similar to those in the data bank but are not seen when the whole sequence is compared as the rest of the sequence (non-similar) reduces the resolution. However this did not affect the result and no significant similarity was detected. When the region 3' to YCR591 was examined no
10078
sequence that could easily be fitted to a consensus terminator (Zaret and Sherman, 1982 and Henikoff and Cohen, 1984) or a polyadenylation signal could be found. However a feature of all terminator sequences is that they are T or A/T rich and several such regions can be found in the 175 bp 3' to YCR591 (see Figure 2). An examination of the region 5' to YCR592 showed a TATA box (Hahn et al., 1985) at position 5874,94 bp upstream of the ATG; none of the sequences identified by Dobson el al. (1982) as elements of efficient promoters are present, which is consistent with the absence of a codon bias. 3' to YCR592 there is a reasonable match to the Zaret and Sherman (1982) consensus terminator sequence { TAG..TAGTiTATGT..(A--T rich)..TTT} (Figure 2), but no obvious polyadenylation sequence. After this structural analysis there remains the question of whether these ORFs are expressed. In their transcript analysis of chromosome 111, Yoshikawa and Isono (1990) have detected two transcripts of 6.5 and 4.2 kb derived from the region corresponding to YCR59. The size of these transcripts is in very good agreement with
424
the size of the two ORFs (6501 and 3678 bp respectively), suggesting that the ORFs are part of genes that are expressed. Also Yoshikawa and Isono (1990) found that the transcripts were of low abundance which is consistent with the absence of a codon bias. In conclusion, as part of the EEC project we have sequenced a 10.1 kb segment of chromosome 111 which contained no previously mapped genetic markers. The sequence has revealed the presence of two large ORFs which show no significant similarity to known proteins. We believe that this shows the value of systematic sequencing projects in identifying new genes that have been ‘missed’ by classical genetics and biochemistry. At present we are engaged in gene deletion experiments to help us determine the function of these new genes. ACKNOWLEDGEMENTS We are grateful to Carol S. Newlon and Steve Oliver for the distribution of the clones, John G. Sgouros and Hans W. Mewes for the computer coordination, Fritz Pohl for the sequence of part of H9G and Denise Menay for synthesizing the oligonucleotides. This work was supported by the Commission of the European Communities under the BAP program of the Division of Biotechnology. REFERENCES Bennetzen, J. F. and Hall, B. D. (1982). Codon selection in yeast. J. Biol. Chem. 257,3026-303 I . Devereux, J., Haeberli, P. and Smithies, 0. (1984). A comprehensive set of sequence analysis programs for the VAX. Nucl. Acids Res. 12,387-395. Dobson, M. J., Tuite, M. F., Roberts, N. A., Kingsman, A. J., Kingsman, S. M., Perkins, R. E., Conroy, S. C.,
Y.JIA, P. P. SLONIMSKI AND C. J. HERBERT
Dunbar, B. and Fothergill, L. A. (1982). Conservation of high efficiencypromoter sequences in Saccharomyces cerevisiae. Nucl. Acids Res. 10,2625-2637. Hahn, S., Hoar, E. T. and Guarente, L. (1985). Each of the three ‘TATA elements’ specifies a subset of the transcription initiation sites at the CYCl promoter of Saccharomyces cerevisiae. Proc. Natl. Acad. Sci. USA 82,8562-8566. Henikoff, S. and Cohen, E. H. (1984). Sequences responsible for transcription termination on a gene segment in Saccharomyces cerevisiae. Mol. Cell. Biol. 4, 15 151520. Maniatis, T., Fritsch, E. F. and Sambrook, J. (1982). Molecular Cloning. A Laboratory Manual. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York. Marck, C. (1988). ‘DNA Strider’: a ‘C’ program for the fast analysis of DNA and protein sequences on the Apple Macintosh family ofcomputers. Nucl. Acids Res. 16,1829-1836. Mortimer, R. K., Shild, D., Contopoulou, C. R. and Kans, J. A. (1989). Genetic map of Saccharomyces cerevisiae. Yeast 5,321404. Newlon, C. S.,Green, R. P., Hardeman, K. J., Kim, K. E., Lipchitz, L. R., Palzkill, T. G., Synn, S. and Woody, S. T. (1986). Structure and organisation of yeast chromosome 111. UCLA Symposium on Mol. Cell. Biol., New Series. 33,211-223. Sanger, F., Nicklen, S. and Coulson, A. R. (1977). DNA sequencing with chain terminating inhibitors. Proc. Natl. Acad. Sci. USA 74,5463-5467. Thierry, A., Fairhead, C. and Dujon, B. (1990). The complete sequence of the 8.2 kb segment left of M A T on chromosome I11 reveals five ORFs including a gene for a yeast ribokinase. Yeast 6,521-534. Yoshikawa, A. and Isono, K. (1990). Chromosome 111of Saccharomyces cerevisiae: An ordered clone bank, a detailed restriction map and analysis of transcripts suggests the presence of 160 genes. Yeast 6,383402. Zaret, K. S. and Sherman, F. (1982). DNA sequence required for efficient termination in yeast. Cell 28, 563-573.