YEAST

VOL.

7: 41 3- 424 ( 1 99 I )

The Complete Sequence of the Unit YCR59, Situated Between CR YI and M A T, Reveals Two Long Open Reading Frames, which Cover 9 1% of the 10.1 kb Segment YANKAI JIA. PlOTR P. SLONIMSKI A N D CHRISTOPHER J. HERBERT C'enrre de Gc;ni.rique Moli.culuire. Lnhoruroire propre du CNRS nssocie d I'Universite Pierre et Marie Curie. F-91 I98 G(fisur-Yverte Cedes, France

Rcccivcd 19 November 1990; accepted 28 November 1990

We have cntirely sequenced YCR59. which is a 10.1 kb segment of the right arm ofchromosome 111, and is part of the clone ESF from the Newlon collection. The segment contains two long open reading frames (ORFs): YCR591 which starts in the adjacent fragment H9G (situated towards C R Y I and the centromere), and continues with 1833 codons in YCR59. The second O R F YCR592 is 1226 codons long and encoded entirely within YCR59. The two ORFs represent 91% of the total length of the segment. Excellent agreement in both location and length is found between the ORFs YCR591 and YCR592 and the transcripts 86 and 87 respectively in the Yoshikawa and lsono (1990) map of chromosome 111. Thc two ORFs correspond to new genes and show no significant similarity with any known genes. KEY WORDS - -

Yeast; Chromosome 111; sequcncc.

INTRODUCTION The complete nucleotide sequence of a 10.1 kb segment of chromosome 111 of Saccharomyces cerevisiae has been determined as part of the European project to sequence the whole of the chromosome. This segment ( Y C R 5 9 ) lies between CRYI and the MA T locus on the right arm of the genetic map of Mortimer et al. (1989). No genetic loci have been reported in this segment. MATERIALS A N D METHODS

proA+ B' , lacIp,IucZAM 15)) and DH5aF' (supE44, AlacU169 ( ( ~ 8 lacZAM15) 0 hsdRI7, recAI, endAI, gyrA96, thi-I. relAI. F). DNA manipulations

All routine plasmid preparations, transformations, subcloning and other D N A manipulations were carried out as described by Maniatis et al. ( I 982).

Sequencing strutegy The starting material for the sequencing was the Strains clone E5F (Newlon et ul., 1986) which is a 22 kb The Escherichia coli strains used were TGI fragment from the small ring of chromosome 111 (A(lac-pro). rhi-1, supE44, hsdD5, F' (traD36, containing the ectopic recombination between 0749-503X 91:W0413 I2 506.00 0 1991 by John Wiley & Sons Ltd

414

Y. JIA, P. P. SLONIMSKI A N D C. J. HERBERT

Figure 1. Restriction map and ORF map of YCR59; the positions of the two ORFs YCR591 and YCR592 are marked by a hatched bar. The start of YCR591 in H9G wasdetermined using data provided by Prof. F. Pohl.

MAT and HML, cloned in YIPS. The 2.1 kb BamHI-EcoRI fragment and the 2.7 and 5.2kb EcoRI fragments (see Figure I ) were subcloned into M I3mp 18 or M 13mp I9 (Pharmacia). Clones suitable for sequencing were generated by exonuclease 111 digestion of the replicative form DNA. Single strand DNAs were sequenced by the chain termination method of Sanger et al. ( I 977) using either the Klenow fragment of the E. cofi DNA polymerase (Boehringer) or T7 DNA polymerase (Pharmacia). Where necessary oligonucleotide primers were synthesized to fill in gaps in the data from the exonuclease 111 generated clones. The data from the individual clones were assembled using the StadenPlus@ software package (Amersham), and the junctions between the restriction fragments were determined by sequencing on the double stranded DNA of E5F using oligonucleotide primers. In this way the entire sequence of both strands of the 10.1 kb segment was determined. The sequence was analysed using the UWGCG programs (Devereux e f al., 1984) and DNA Strider (Marck, 1988).

RESULTS AND DISCUSSION The complete sequence of the 10.1 kb chromosome 111 unit YCR59 (located between CR YI and MA T) has been determined. The right-most EcoRI site of our sequence (see Figure I), is the left-most EcoRI site of the sequence of Thierry et al. ( 1 990); this has been verified by sequencing across the junction on the original clone. An analysis of the sequence shows two long open reading frames (ORFs) which are encoded on the same strand (Figure I). The first O R F (YCR59 1) begins in the preceding fragment H9G; with data from this fragment provided by Prof. F. Pohl we have been able to reconstruct the entire ORF. YCR591 is an exceptionally long O R F of 2167 codons, of which 1833 are encoded in YCR59. The MW of the protein deduced from the sequence is 250.8 kDa; when calculated according to Bennetzen and Hall ( 1 982) there is no codon bias indicating that the gene will be expressed at a low level. When YCR591 was compared to the data banks either as a nucleotide or deduced protein

Figure 2. The complete nucleotide sequence of YCR59; the translation of the section of YCR591 encoded in YCR59 and the translation of YCR592 are marked. The putative 'TATA' box for YCR592 (position 5874) is double underlined, and the putative terminator for YCR592 is underlined.

415

COMPLETE SEQUENCE OF UNIT YCRSY

YCR591 1

G S M I C S I K V Y R F Y L W D G L L T GGATCCATGATTTGTTCAATTAAAGTATATAGATTTTATTTGTGGGATGGATTATTAACA E

61

F

A

I

N

I

L

Q

A

I

G

T

N

Y

Q

Y

T

F

S

K

GAATTTGCGATAAATATACTTCAAGCTATCGGCACCAATTACCAATATACATTTAGCmG

K

K

E

G

P

E

V

L

S

L

C

Q

D

F

L

I

A

K

A

60

120

H

121

AAAAAAGAAGGGCCTGAAGTTTTATCGCTCTGCCAAGACTTCAT

180

181

L M A R P A T E I S S T K Y I D E I E L TTAATGGCCAGGCCTGCAACAGAAATATCTTCTTCCAC~TACATCGATGAGATTGAACTT

240

241

L E M E N I I I D V N P N D I L Q D F T CTTGAAATGGAkAATATCATTATTGATGTTAACCCAAATGATATTCTTCAAGATTTCACC

300

301

GAATCGTCTAATTTTACGGTAAAATTTGAGGAAAGCACAAACTCGAATTCCGGAA

360

361

V G K C Y F Y R S S N L V S K F V S I D GTGGGTAAGTGCTATTTCTATAGGAGTTCAAACTTGGTTTCAAAATTTGTGTCCATTGAT

420

42 1

S I R L A F L N M T E S G S I D D L F H TCTATACGGCTTGCGTTTTTAAACATGACAGAATCCGGTAGTATAGACGATCTGTTTCAT

480

481

~ V S H L M N L L R N I D I L N W F K K CATGTATCACATCTGATGAATCTTTTACGAAATATTGATATTCTTAATTGGTTTAAAAAA

540

541

D F G F P L F A Y T L K Q K I T Q D L S G A C T T T G G C T T C C C T T T A T T T G C T T A T A C T T T ~ C ~ T A A C A C ~ G A T T T A T C T600

601

Q P L N I Q F F N L F L E F C G W D F N CAGCCTCTGAATATCCAATTTTTCAATTTATTCTTAGAATTTTGCGGGTGGGATTTCAAC

660

661

D I S K S I I L D T D A Y E N I V L N L GATATTTCCAAATCCATAATTCTAGATACTGATGCCTACGAAAACATAGTCCTTAACTTG

720

721

D L W Y M N E D Q S S L A S G G L E I I GATTTATGGTATATGAATGAGGATCAAAGTTCTCTGGCGTCAGGCGGATTAGAAATTATC

780

781

R F L F F Q I S S L M E A S I Y S K F N AGATTTCTTTTCTTCCAAATTTCAAGTTTGATGGAAGCCTCTATTTATTCTAAGTTCAAT

840

841

S N K F N D M N I L E K L C L S Y Q A V TCCAATAAATTCAATGATATGAATATCCTAGAAAAACTATGTTTAAGCTATCAGGCTGTC

900

901

T K R E N Q N S K F N E L S N D L I S V ACAAAAAGAGAAAATCAGAACAGTAAATTTAATGAGCTATCAAATGATTTAATTTCTGTA

960

961

F V T L L K S N T D K R H L Q W F L H L T T T G T T A C T T T A T T G A A A T A C T G A T A A A C G A C A C C T ~ A G T ~ T T T T T A C A T ~ T C1020

E

S

S

Y

S

Y

N

F

F

I

T

K

V

R

K

K

F

D

E

V

E

R

S

S

T

T

N

E

S

I

K

I

N

L

I

Q

P

A

E

V

1021

TCATATTACTTTATTAAGAGAAAAGATGTACGTTCTACAGAAATTATA

1080

1081

D Q L F S F Y L D Q G S D E N A K I L S GATCAACTTTTTTCGTTTTACTTAGATCAA~TAGCGACGAAAATGCGAAGATACTTTCA

1140

416

Y. JIA, P. P. SLONIMSKI A N D C. J. HERBERT

E

I

I

P

L

K

L

M

L

M

I

M

D

Q

I

V

E

N

N

E

1141

GAGATTATACCACTTAAGCTAATGCTGATGATTATGGATCAAATAGTGGATAATGAA

1200

1201

S N P I T C L N I L F K V V L T N K P L TCAAACCCTATTACGTGCTTGAATATCTTATCTTATTTAAGGTAGTTCTGACCAATAAACCGCTT

1260

1261

F K Q F Y K N D G L K L I L T M L C K V TTCAAACAATTTTACAAAAATGATGGTTTGAAACTCATATTGACTATGCTTTGTAAGGTA

1320

1321

G K S Y R E E I I S L L L T Y S I G N Y GGGAAAAGCTATCGAGAGGAGATTATTTCTTTGCTTCTCACATATTCTATTGGCAATTAT

1380

1381

T T A N E I F S G A E D M I G G I S N D ACCACAGCTAACGAAATATTTTCAGGTGCTGAAGACATGATTGGAGGAATTTC~CGAC

1440

1441

K I T A K E I I Y L A V N F I E W H V I AAGATAACTGCAAAAGAAATTATTTATTTGGCTGTCAACTTCATTGAGTGGCATGTGATT

1500

1501

N S N A S D S S S V L D L N N H I L R F AATTCTAATGCCAGTGATTCTTCTTCTGTATTGGACCTGAACAACCATATATTAAGATTC

1560

1561

V E D L K S L S A V P I N E S V F D P K GTCGAATATCTGATCTGAAATCGCTGAGCGCTGTTCCGATTAATGAATCTGTATTTGATCCTAAA

1620

1621

K S Y V M V S L L D L S I A L N E S E D AAAAGTTATGTGATGGTTTCATTATTAGATCTCTCGATAGCTTTGAATGAATCGGAGGAC

1680

1681

I S K F K S S S K V I S E L I K G N I M ATCTCAAAGTTCAAGAGCTCTTCAAAAGTGATTTCAGAGCTCATTAAAGGTAATATAATG

1740

1741

C A L T K Y A A Y D F E V Y M S T F F C TGTGCTCTTACGAAATATGCCGCTTATGATTTCGAAGTCTATATGAGCACATTTTTTTGT

1800

1801

CACAGTACAGAATACAAACTGGTTTATCCAAAAACTGTAATTA

1860

1861

E L S F I V T L L P E I L N D L I D S N GAGCTATCATTTATAGTGACACTCCTACCCGAAATACTTAATGACCTGATAGATAGCAAT

1920

1921

N N L N L M M L K H P Y T M S N L L Y F AACAATTTGAACCTGATGATGTTGAAGCATCCATACACGATGTC~TCTCCTTTATTTT

1980

1981

L R K F R P D T S Q I V M P K D F Y F S CTTCGCAAATTTCGACCTGATACGTCACAGATAGTTATGCCTAAAGATTTTTATTTCTCA

2040

2041

S Y T C L L H C V I Q I D K S S F Y H F AGTTATACATGTCTCTTGCATTGTGTTATTCAGATTGATAAATCATCATTTTACCATTTC

2100

2101

K N V S K S Q L L Q E F K I C I M N L I AAAAACGTTTCTAAGTCGCAACTGTTACAGGAATTCAAAATCTGCATAATGAACTTAATA

2160

2161

Y S N T L K Q I I W E K E E Y E M F S E TATTCCAATACTCTAAAGCAGATAATATCTTCTGGGAGAAAGAAGAATACGAGATGTTTTCTGAG

2220

2221

S L M A H Q E V L F A H G A C D N E T V TCACTGATGGCGCATCAGGAATATCTGTTTTATTTGCACATGGAGCATGTGATAATGAGACCGTT

2280

H

S

T

E

Y

K

L

V

Y

P

K

T

V

M

N

N

S

S

Y

L

417

COMPLETE SEQUENCE OF UNIT YCR59

2281

G L L L I F F A N R L R D C G Y N K A V GGCTTATTGTTAATATTTTTTGCCAACAGATTACGTGATTGT~ATACAACAAAGCAGTC

2340

2341

F N C M K V I I K N K E R K L K E V A C TTCAATTGTATGAAAGTGATCATTAAGAACAAGGAAAGGAAACTAAAGGAGGTGGCGTGT

2400

2401

F F D A A N K S E V L E G L S N I L S C TTTTTTGACGCAGCGAATATGAAGTACTCGAA~TTTAAGTAATATCCTCTCATGC

2460

2461

N N S E T M N L I T E Q Y P F F F N N T AATAACTCTGAAACAATGAACCTCATAACTGAACAATACCCATTTTTTTTCAACAATACA

2520

Q

Q

V

R

F

I

N

I

V

T

N

I

L

F

K

N

N

N

F

S

2521

CAACAGGTACGGTTCATAATTGTCACCAATATCTTGTTTAAGAACAACAATTTTTCT

2580

2581

P I S V R Q I K N Q V Y E W K N A R S E CCAATAAGCGTTAGACAGATCAAAAACCAAGTTTTACGAATGGAAAAATGCAAGATCAGAA

2640

2641

Y V T Q N N K K C L I L F R K D N T S L TACGTCACCCAAAACAATAAAAAGTGCCTTATTTTATTTAGAAAAGACAACACATCCTTA

2700

2701

D F K I K K S I S R Y T Y N L K T D R E GATTTTAAAATCAAAAAGTCCATATCAAGATACACTTACAACCTCAAAACGGATAGAGAA

2760

2761

E N A V F Y R N N L N L L I F H L K H T GAAAATGCAGTTTTCTATCGAAATAATTTT~TCTTTTGATTTTTCATCTG~CATACA 2820

2821

L E I Q S N P N S S C K W S L D F A E D CTGGAGATACAATCAAATCCAAATTCGTCGTCCTGCAAGTGGTCATTGGACTTT~CAG~GAT2880

2881

F D G M K R R L L P A W E P K Y E P L I TTTGATGGGATGAAACGGAGGCTTTTGCCTGCTTGGGAACCAAAATATGAACCACTCATT

2941

N E E D A N Q D T I T G G N R Q R R E S AACGAGGAAGATGCTAATCAAGATACTATAACAGGTGGTGGTAACAGAC~GGAGAG~GT 3000

3001

G S I L S Y E F I E H M E T L E S E P V GGAAGCATTTTATCCTACGAATTTATCG~CATAT~AGACTCTTGAGTCGGAGCCAGTT 3060

3061

G D L N E N R K I L R L L K D N D S I A GGAGATTTGAATGAGAATAGAAAAATTCTTCTTAGACTTTTGAAGGAT~CGATTCTATTGCA 3120

3121

T I W N C S L I I G L E I K E G I L I H ACTATTTGGAATTGCAGTTTGATTATTGGATTAG~TTAAGGAGGGGATTTTAATTCAT

3181

G S N Y L Y F V S D Y Y F S L E D K K I GGCAGTAATTACCTTTACTTTGTAAGTGATTACTATTACTATTTTAGTTTAGAGGAT~GATT 3240

3241

L K L S E V S Q E S R D M T V S L I N G CTAAAATTATCAGAAGTATCGCAAGAATCACGGGGATATGACGGTTAGCTTAATTAACG~

3300

3301

P D V K R V S T F L K H E V F V W K L L CCTGATGTTAAAAGGGTATCAACTTTCCTAAAGCACGAAGTCTTTGTTTGG~CTTCTC

3360

3361

D I T F V T K R P F L L R D V A I E L L GATATCACTTTCGTTACCAGACCCTTTCTACTTCGGGTTATTG

3420

2940

3180

418

Y. JIA, P. P. SLONIMSKI AND C. J. HERBERT

3421

F K E R V S A F F S F Y N K R V R D D V TTCAAAGAGAGAGTTAGCGCTTTTTTTAGTTTTTACAACAAAAGAGTGAGAGATGACGTT

3480

3481

L R V L N K I P K H L P A D P I F S S V TTACGGGTACTGAATAAGATCCCGAAGCACCTTC~GCAGATCCAATTTTTTCAAGCGTT

3540

3541

L Q E I N D R G N S I V A R N G I G K A TTACAAGAAATAAACGACCGAGGAAATAGTATAGTGGCAAGAAATGGAATAGGAAAGGCA

3600

3601

S I A S K F T S V F S A N N S L I D G F AGCATTGCTTCCAAATTCACTAGCGTCTTCTCAGCGAA~CAGCCTAATAGATGGATTT

3660

3661

E I S K K W V R G E I S N F Y Y L L S I GAGATCAGCAAAAAATGGGTTAGGGGAGAGAGATTTCTAATTTTTATTACCTGTTGAGTATC

3720

3721

N I L A G R S F N D L T Q Y P V F P W V AACATCCTAGCGGGAAGGTCATTCAACGATTTGACCCAATATCCAGTGTTTCCGTGGGTT

3780

3781

I A D Y E S N V L D L E N P K T Y R D L ATTGCAGATTACGAAAGTAACGTACTCGATTTAGAGAATCCTAAAACTTACCGGGACCTA

3840

3841

S K P M G A Q S E K R K L Q F I E R Y E TCGAAACCTATGGGCGCTCAAAGTGAGAAAAGGAAATTACAGTTTATAGAGCGTTATGAA

3900

3901

A L A S L E N A D S A P F H Y G T H Y S GCTTTGGCTTCCCTGGAAAATGCTGATTCCGCACCATTTCATTATGGCACGCATTATTCC

3960

3961

S A M I V S S Y L I R L K P F V E S F L TCAGCTATGATAGTATCTTCATATCTGATRAGGCTGAAGCCCTTTGTCG~TCCTTTTTG

4020

4021

L L Q G G S F G P A D R L F S S L E R A TTATTGCAAGGCGGAAGTTTTGGCCCTGCAGATCGTTTATTTAGTTCGCTTGAAA~CC 4080

4081

W S S A S S E N T T D V R E L T P E F F TGGAGCTCTGCTTCTTCTGAAAATACAACGGATGTCAG~AATTGACACCTGAATTTTTT

4140

4141

F L P E F L I N V N S Y D F G T D Q S G TTTCTACCTGAATTTTTGATCAACGTTAATAGTTATGACTTTGGTACAGAC~GCGGT

4200

4201

K K V D D V V L P P W A N G D P K V F I AAAAAAGTTGACGACGTCGTACTTCCACCCTGGGCAAATGGTGACCC~GGTTTTCATT 4260

4261

Q K N R E A L E S P Y V S A H L H E W I CAAAAGAATAGAGAAGCTTTAGAAAGTCCTTATGTATCA~ACATTTACATGAATGGATT

4320

4321

D L I F G Y K Q K G E I A V K S V N V F GATTTGATATTTGGTTACAAACAAAAGGGGGAAATTGCTGTGAAATCTGTTAACGTATTC

4380

4381

N R L S Y P G A V N L D N I D D E N E R AACAGATTGAGTTACCCAGGGCTGTAAATCTAGAT~TATTGACGATGAAAATGAGCGC

4440

4441

R A I T G I I H N F G Q T P L Q I F Q E AGAGCTATCACAGGCATTATTCACAACTTTGGTC~CGCCTTTAC~TATTTCAGGAA 4500

4501

CCTCATCCGGAAAAAATAGCCTGCAATGTTCAACAGCTAA

P

H

P

E

K

I

A

C

N

V

Q

Q

L

T

T

E

V

W

R

R 4560

419

COMPLETE SEQUENCE OF UNIT YCR59 V

4561

P

M

K

P

I

F

E

K

T

I

F

N

L

N

E

K

N

R

S

GTTCCAATGAAGCCAATATTTGAGAAGACAATCTTTAATTTGAATG~GAACAGGTCT V

D

Y

V

I

H

D

P

S

Y

F

D

S

L

Y

W

R

G

F

4620

A

4680

4621

GTCGATTATGTTATACACGATCCTAGTTACTTCGATTCATTATACTGGAGG~TTCGCT

4681

F P N L F F R T E E S L V S L R I V H K TTCCCAAACTTGTTTTTCAGAAC~~GAATCGTTAGTGTCATTGAGAATTGTGCAT~4740

4741

N W L K I G L D I F K K T H M A Q I T S AATTGGTTAAAAATTGGACTAGATATTTTTAACGCATATGGCTCAGATTACATCG

4800

4801

F A Y W K L G E F I T G D K N G L I K V TTTGCGTACTGGAAGTTGGGCGAATTCATAACTGGTGATATT

4860

W

4861

K

Y

R

K

D

K

H

S

V

S

G

N

L

E

N

K

K

T

M

TGGAAATATCGTAAAGATAAGCATTCGGTTTCAGGTAACCTTGAGAACWCAATG F

G

H

L

C

E

L

K

E

M

R

C

Y

H

D

Y

N

T

L

4920

L

4921

TTTGGGCACCTATGCGAGCTAAAGG~TGCGCTGTTATCACGACTA~TACGCTTTTA 4980

4981

T L D I S G L V Y V W D M I N F E L V R ACCTTAGACATCAGCGGCTTAGTATATGTCTGGGACATGACATGATT~TTTCGAACTAGTGAGA 5040

5041

Q I T N D A Q K V A I S Q H A G S I M V CAAATAACAAATGATGCGCAAAAGGTCGCAATATCAACATTATGGTA

5101

L T K N N A I S I F N L N G Q I Y T S K T T G A C T A A G A A T A a C G C C A T T T C G A T C T T C A A T C T ~ T ~ A C ~ T A T A T A C A T C A A A G 5160

5161

AAATTCGAACCAGCTAAAATTGTAAGCTC~TTGATTTTTTTGACTTCACTAAGTTAGAC

K

A

F

G

E

Y

P

R

A

K

K

H

I

I

V

Y

S

W

S

I

K

E

D

M

F

F

E

D

I

L

F

L

T

V

K

G

L

F

5100

D

5220

E

5221

GCAGGTTACAGAAAGCATATCTATTGGAAAGAGAGATGG~TACTACTAGTGGGCTTTGAA

5281

D G T I E I Y E L F L T F H N E W A I K G A T G G A A C T A T A G ~ T T T A C G A G C T C T T T T T G A C T T T T C A T ~ T G ~ T ~ G C G A T A A A G5340

5341

CTACTGAAACAGCTCTGTACCGAAAGAGGGAAAGCCATAACTAGCATTAAGGGACAGGGG

5401

AAGACATACCTGTCCCAGAAAAGACGCAAGGATACA~AGA~CTCATGAGATAGAAGTG

5461

ATTGCGGGAACATTAGATGGCAGATTAGCTATTTGGTACTAGGCATGACATCGTAACGCC

5520

5521

TTTCTTTAAATGATTCAATTTTTGTAGTTTATATCTTTACTTTTGAAACTGATTTCTCAT

5580

5581

CCCACCTAGTATTGTAATTGCGTACGTATCCAATATCATTACCAACGCCGGGTATTTTTT

5640

5641

TCTAGTATTTCTTCTCCATTTCGCCTATGGAAAACAGCAAAAGGGTAAAAGWCA

5700

5701

AACGATTAATTCTTCATTGAATTATGTAAAAATCAGCAACCGCAGATTTAATAGAG

5760

5761

ACCAGAAATTCGGATTACTATTGACTTTGTGCACCACCTTC~TTTACTCATTGTTTAA

5820

L

K

I

L

T

A

K

Y

G

Q

L

T

L

S

L

C

Q

D

T

K

G

E

R

R

R

R

L

G

K

A

K

D

I

A

T

W

I

A

T

E

S

P

I

H

K

E

G

I

Q

E

5280

G

5400

V

5460

Y

420

Y. JIA, P. P. SLONIMSKI AND C. J. HERBERT

5880

5821

GACAGGCAGTGGGAAAGAAGCCGTCATATTGCTCGAATCCTTMCAAGC-CAA

5881

CCACTAAATTATTCCGAAAGGGCCTGCTTAATAATTTGCCTACTAACTTGTGCATAGAAC 5940

5941

M G Y P P P T R R L G AGCAAACAGAAACAAAGCGTAAGAAACATGGGGTATCCGCCACCTACACGAAGGCTTGGA

6000

6001

D K K R Y H Y S N N P N R R H P S A V Y GATAAGAAAAGGTACCATTATTCCAATAATCCTAACCGAAGGCATCCTTCCGCTGTTTAT

6060

6061

TCCAAGAATAGCTTTCCACAAGCAATAATGGATTTGTATCTTCTCCTACTGCCGAT

6120

6121

N S T N P S V T P S T A S V P L P T A A AATTCAACAAATCCGTCTGTAACTCCCAGTACTGCATCTGTACCTCTTCCTACAGCGGCA

6180

6181

P G S T F G I E A P R P S R Y D P S S V CCTGGAAGCACGTTTGGTATCGAAGCACCCAGGCCATCTCGATATGATCCGA~TCAGTC

6240

6241

S R P S S S S Y S S T R K I G S R Y N P AGTAGGCCTTCGTCATCATCTTATTCGTCAACAAG~TTGGAAGCCGTTATAACCCA

6300

6301

D V E R S S S T T S S T P E S M N T S T GATGTGGAAAGATCCTCTTCAACCACTAGTTCAACTCCGG~GTATGAATACGAGCACC

6360

6361

ATAACACACACCAATACGGATATCGGAAACTCACGCTCACGCTATTCTCG~CCATGAGCAGA

6421

Y N P Q S T S S T N V T H F P S A L S N TATAATCCTCAATCTACTAGTTCTACAAACTCACGCGTTACCCACTTTCCCTCGGCATTATCAAAC 6480

6481

A P P F Y V A N G S S R R P R S M D D Y GCTCCACCGTTTTATGTTGCCAACGGGAGTTCTCGGAGACCTCGATCAAT~ATGATTAT

6541

AGTCCTGATGTAACGAACAAGCTCGAAACAAATAATGTTTCATCTGTTAATAATAACAGC

6600

6601

P H S Y Y S R S N K W R S I G T P S R P CCTCATTCTTATTACTCTAGGAGCAACAAATGGAGATCCATTGGAACGCCTTCCAGACCA

6660

6661

P F D N H V G N M T T T S N T N S I H Q CCATTTGATAATCATGTCGGCAATATGACGACCACCAGCAATACTAACTCGATCCATCAA

6720

6721

R E P F W K A N S T T I L K S T H S Q S AGGGAACCTTTTTGGAkAGCAAATAGTACTACTATTTT~TCAACTCATTCACAGTCA

6780

6781

TCGCCTTCCCTTCATACTATTCACGATGCGAATAAACTCACGTTGGACAAACCAGAGGCT

6840

6841

S V K V E T P S K D E T K T I S Y H D N TCAGTTAAAGTTGAAACACCCAGTAAAGATGATGAGAC~CCATATCGTACCATGATAAC

6900

6901

AATTTTCCACCAAGAARATCAGTTTCTAAACCT~T~ACCTTTAG~CCCGAT~TATC 6960

6961

K V G E E D A L G K R E V H K S G R E I AAGGTTGGCGAAGAAGATGCAT TGGGGUGAAGTACATWGTGGGCGTGAGATA

YCR592

S

I

S

S

N

K

T

P

P

F

N

H

D

S

P

S

T

V

L

P

F

N

T

H

R

P

T

N

T

K

K

D

K

K

S

S

I

L

K

V

S

G

E

F

S

N

N

T

H

K

N

S

N

D

P

G

R

N

A

N

F

Y

V

N

A

V

S

S

K

P

S

R

S

L

L

S

K

V

D

E

P

T

N

K

P

T

M

N

P

D

A

S

N

E

N

D

R

6420

6540

S

A

I

7020

42 1

COMPLETE SEQUENCE OF UNIT YCR59

7021

A K E H P T P V K M K E H D E L E A R A G C A A A G G A A C A T C C T A C T C C T G T ~ T G M ~ ~ A T G A T G A A C T A G ~ ~ T C 7080 ~GCT

7081

K K V N K I N I D G K Q D E I W T T A K AAAAAAGTAAATAAAATCAATATTGATGGMKAGGACGAAATTTSACaUKAAAA

7141

ACAGTGGCCAGTGCAGTCGAAGTTTCCAAAGMGTCATAAGGAACTAACAC~TCTGTT

7201

GAAAGGAAGGAAAGTCCAGATTAGAGATTATG~GAGCCTG

T

E

K

7261

V

R

T

A

K

D

S

E

A

A

S

T

V

P

K

E

E

L

V

I

T

S

R

V

K

D

D

E

Y

D

S

E

D

H

R

N

K

A

K

E

Y

S

L

D

Y

T

P

E

R

K

E

S

A

P

V

K

V

E

G

C

I

F

P

L

P

K

A

E

T

R

L

W

E

7200

L

7260

L

AAAACAGACGCAACAAAGTTGACAGTAGACGATGATAATAAAAGTTACGAAGAACCTCTT

E

7140

7320

L

7321

GAAAAAGTGGAAGGGTGTATTTTCCCATTACCAAAAGCAGAAACGAGATTATWAATTG

7380

7381

K N Q K R N K I I S K Q K Y L L K K A I AAAAACCAGAAAAGAAACATAATAAGTAAACAAAAGTACTTACTGAAAAA~AATT

7440

R

7441

N

F

S

E

Y

P

F

Y

A

Q

N

K

L

I

H

Q

Q

A

T

AGGAATTTCTCAGAGTATCCTTTTTACGCACAGAACAAACTTATACATCAGGCTACC

G

L

I

L

T

K

I

I

S

K

I

K

K

E

E

H

L

K

K

7500

I

7560

7501

GGACTTATCTTGACGAAAATTATATCAAAGATAAMAAGGAGGAACATTTGWTA

7561

N L K H D Y F D L Q K K Y E K E C E I L A A T T T A A A A C A T G A T T A T T T C G A T C T C C A G A A G A A G T A T G ~ G A A T G C G ~ ~ T T T G7620

7621

T K L S E N L R K E E I E N K R K E H E ACTAAACTGAGTGAAAATTTAAGGAAGGAAGAAATCGAAAAT~CGTAAAGA~ACGAA 7680 L

M

E

Q

K

R

R

E

E

G

I

E

T

E

K

E

K

S

L

R

7740

7681

TTAATGGAGCAGAAAAGACGTGAAGAAGGTATCGAAACAGAGG

7741

H P S S S S S S R R R N R A D F V D D A C A T C C A T C C T C G T C T T C C T C A T C T C G T C ~ A G G G G 7800

7801

E M E N V L L Q I D P N Y K H Y Q A A A GAAATGGAAAATGTATTGCTACAAATCGACCCAAATTATAAACATTATCAGGCT~T~A 7860

7861

T I P P L I L D P I R K Y S Y K F C D V A C A A T T C C T C C G C T A A T T T T AGATCCAA~CC~~TACTCTTAC~TTCTGTGATGTA7920

7921

N N L V T D K K L W A S R I L K D A S D A A T A A C T T G G T T A C A G A C A T T T ~ G T C T A G A A T A T T G M G A C ~ C T C T G A C 7980

7981

N F T D H E H S L F L E G Y L I H P K K AACTTTACTGACCATGAGCACTCT~TATTTTTGGAGGGTTATTTAATTCATCCT~ 8040

8041

F G K I S H Y M G G L R S P E E C V L H TTCGGTAAAATTTCTCACTACATG~GGCTTAAGAAGTCCTGAAGAGTGTGTCCTACAT

8100

8101

Y Y R T K K T V N Y K Q L L I D K N K K TATTATAGAACAAAGAAAACTGT~TTATMCAACTTCTTATCGAT~GAACAAGAAA

8160

422

Y. JIA, P. P. SLONIMSKI A N D C. J. HERBERT

8161

R K M S A A A K R R K R K E R S N D E E AGAAAAATGTCAGCCGCTGCGAAGCGCCGCAAGAGGAAGGAAATTACTGGGAAGTAATGACGAGGAA

8220

8221

V E V D E S K E E S T N T I D K E E K S GTCGAAGTTGATGAGAGTAAAGAAGAGTCAACGAACACGATAGATAAGGAAGWGT

8280

8281

E N N A E E N V Q P V L V Q G S E V K G GAGAACAATGCCGAGGAAAATGTTCAGCCGGTTCTAGTTCAAGGTTCTGAAGTGAAAGGT

8340

8341

D P L G T P E K V E N M I E K R G E E F GATCCATTAGGTACACCGGAAAAAGTTGAAAATATGATTGAAAAGAGAGGCGAAGAGTTT

8400

8401

A G E L E N A E R V N D L K R A H D E I GCAGGTGAATTGGAAAATGCTGAGAGGGTAAATGACTTAAAAAGGGCGCATGATGAAATT

8460

8461

G E E S N K S S V I E T N N E V Q I M A GGAGAAGAGAGCAATAAGTCCAGTGTAATAGAAACCAACAATGAGGTACAAATAATGGCT

8520

8521

P K G G V R N G Y Y P E E T K E L D F S CCAAAAGGAGGTGTTCGGAATGGTTATTATCCAGAGGAGACC~GAACTTGACTTCAGT

8580

8581

L E N A L Q R K K H K S A P E H K T S Y TTAGAGAATGCGTTACAGAGAAATTACTGGGAAATTACTGGCACAAATCTGCACCAGAGCAT~CAAGTTAT 8640 W

S

V

R

E

S

Q

L

F

P

E

L

L

K

E

F

G

S

Q

W

8641

TGGAGTGTTCGTGAATCTCATCTTTCCAGAAT 8700

8701

S L I S E K L G T K S T T M V R N Y Y Q TCTCTCATATCAGAAAAACTGGGTACCAAATCTACTACAATGGTAAGGAATTACTACCAA

8761

R N A A R N G W K L L V D E T D L K R D AGAAATGCAGCTCGCAATGGATGGAAATTACTGGTTACTGGTTGATG~CCGACTTAAAGCGAGAT 8820

8821

G T S S E S V Q Q S Q I L I Q P E R P N GGGACTAGTTCAGAATCTGTACAACAATCTCAAATTTTGATACAACCAGAACGACCAAAC

8880

8881

I N A Y S N I P P Q Q R P A L G Y F V G ATCAATGCCTATAGTAATATTCCTCCTCAACAAATTACTGGGACCGGCTTTGGGTTATTTTGTTGGA

8940

8941

Q P T H G H N T S I S S I D G S I R P F CAACCAACTCATGGGCATAATACATCTATTTCATCTATCGATGGCTCTATAAGACCATTT

9000

9001

G P D F H R D T F S K I S A P L T T L P GGGCCTGATTTTCATCGTGATACCTTTTCTAAAATTAGTGCTCCTTTAACCACTTTACCA

9060

9061

P P R L P S I Q F P R S E M A E P T V T CCACCAAGACTACCATCTATTCAGTTTCCTCGTTCAGAAATGGCAGAACCTACAGTGACA

9120

9121

D L R N R P L D H I D T L A D A A S S V GATTTGCGTAACAGGCCCTTAGACCATATTGACACGTTGGCTGATGCAGCTTCGTCAGTA

9180

9181

T N N Q N F S N E R N A I D I G R K S T ACAAATAATCAAAACTTCAGTAATGAAAGGAATGCAATTGACATTGGCCGTAAATCGACG

9240

9241

T I S N L L N N S D R S M K S S F Q S A ACAATCAGCAATCTATTGAATAATTCGGATCGAAGCATGAAATCTTCTTTCCAAAGCGCT

9300

8760

423

COMPLETE SEQUENCE OF UNIT YCRS9

S

R

H

E

A

Q

L

E

D

T

P

S

M

N

N

I

V

V

Q

E

9301

T C A A G A C A C G A A G C A C A G C T C G A A G A C A C T C C C A T G A

9360

9361

I K P N I T T P R S S S I S A L L N P V ATAAAACCGAATATTACTACGCCAAGATCGAGTTCTATTTCTGCATTACTAAATCCTGTA

9420

9421

N G N G Q S N P D G R P L L P F Q H A I AATGGGAATGGGCAATCAAACCCAGATGGAAGGCCGTTGCTGCCATTTCAGCATWTATT

9480

9481

S Q G T P T F P L P A P R T S P I S R A TCTCAAGGCACTCCTACTTTCCCTTTACCGGCCCCCTC~ACTAGTCCAATAAGTCGTGCG

9540

9541

P P K F N F S N D P L A A L A A V A S A CCTCCAAAGTTCAATTTTTCGAATGATCCGTTGGCA~TTTGGCTGCGGTTGCCTCCGCG

9600

9601

P D A M S S F L S K K E N N N CCAGATGCAATGAGCAGTTTTTTATCTAAAAAGGAAAATAATAATTGAACAAACGGCTGA

9660

9661

GACGGGCAATACATATGCTCTACTTCTTTTCCATCCAATGGTTGGTGAAACTCTCGAGCA

9720

9721

TACATTACCTTACGTGTGT

9780

9781

GGGAGGAGTTTTTAATTATATTGTAATTTCGTATTTTTTCTGCATTATACAGTTTTTTC

9840

9841

CGATTTTAAACGACTTTATTTAAGTGTCGTGTAAATATGTCACATTTTATTTTTGTACGT

9900

9901

ATTCACATGTCCTGGCGTGCGGCCATTGCTGAAAATCGCAAAACCCACAGAGAAATAAAC

9960

9961

ATCGCGAAAAAGTCAATGAAAAATTGGAAAATATTTTTCATTTCACTATTATCCACAAGC

10020

10021

AATTTTGTACAAAGTGAAAAGGTTGAACTAATTATCTTCGTCTAGAAGCCATGAATTC

sequence (EMBL release 24 and MIPS release 17 respectively) no significant similarity was found with any known sequence. YCR591 terminates with a TAG stop codon at position 5500 (see Figure 2), and the second ORF present in YCR59, YCR592, starts at position 5968. YCR592 is also a long ORF of 1226 codons and is encoded entirely within YCR59. The MW of the deduced protein sequence is 138.4 kDa; again the codon usage shows no bias indicating that the gene is expressed at a low level. Comparisons with the EMBL and MIPSdata banks showed no significant similarity with known sequences. As both of the deduced protein sequences are very long we repeated the comparison with the MIPS data bank dividing the sequences into fragments of 500 amino acids which overlapped by 100 amino acids. This should allow us to detect any motifs in the proteins that are similar to those in the data bank but are not seen when the whole sequence is compared as the rest of the sequence (non-similar) reduces the resolution. However this did not affect the result and no significant similarity was detected. When the region 3' to YCR591 was examined no

10078

sequence that could easily be fitted to a consensus terminator (Zaret and Sherman, 1982 and Henikoff and Cohen, 1984) or a polyadenylation signal could be found. However a feature of all terminator sequences is that they are T or A/T rich and several such regions can be found in the 175 bp 3' to YCR591 (see Figure 2). An examination of the region 5' to YCR592 showed a TATA box (Hahn et al., 1985) at position 5874,94 bp upstream of the ATG; none of the sequences identified by Dobson el al. (1982) as elements of efficient promoters are present, which is consistent with the absence of a codon bias. 3' to YCR592 there is a reasonable match to the Zaret and Sherman (1982) consensus terminator sequence { TAG..TAGTiTATGT..(A--T rich)..TTT} (Figure 2), but no obvious polyadenylation sequence. After this structural analysis there remains the question of whether these ORFs are expressed. In their transcript analysis of chromosome 111, Yoshikawa and Isono (1990) have detected two transcripts of 6.5 and 4.2 kb derived from the region corresponding to YCR59. The size of these transcripts is in very good agreement with

424

the size of the two ORFs (6501 and 3678 bp respectively), suggesting that the ORFs are part of genes that are expressed. Also Yoshikawa and Isono (1990) found that the transcripts were of low abundance which is consistent with the absence of a codon bias. In conclusion, as part of the EEC project we have sequenced a 10.1 kb segment of chromosome 111 which contained no previously mapped genetic markers. The sequence has revealed the presence of two large ORFs which show no significant similarity to known proteins. We believe that this shows the value of systematic sequencing projects in identifying new genes that have been ‘missed’ by classical genetics and biochemistry. At present we are engaged in gene deletion experiments to help us determine the function of these new genes. ACKNOWLEDGEMENTS We are grateful to Carol S. Newlon and Steve Oliver for the distribution of the clones, John G. Sgouros and Hans W. Mewes for the computer coordination, Fritz Pohl for the sequence of part of H9G and Denise Menay for synthesizing the oligonucleotides. This work was supported by the Commission of the European Communities under the BAP program of the Division of Biotechnology. REFERENCES Bennetzen, J. F. and Hall, B. D. (1982). Codon selection in yeast. J. Biol. Chem. 257,3026-303 I . Devereux, J., Haeberli, P. and Smithies, 0. (1984). A comprehensive set of sequence analysis programs for the VAX. Nucl. Acids Res. 12,387-395. Dobson, M. J., Tuite, M. F., Roberts, N. A., Kingsman, A. J., Kingsman, S. M., Perkins, R. E., Conroy, S. C.,

Y.JIA, P. P. SLONIMSKI AND C. J. HERBERT

Dunbar, B. and Fothergill, L. A. (1982). Conservation of high efficiencypromoter sequences in Saccharomyces cerevisiae. Nucl. Acids Res. 10,2625-2637. Hahn, S., Hoar, E. T. and Guarente, L. (1985). Each of the three ‘TATA elements’ specifies a subset of the transcription initiation sites at the CYCl promoter of Saccharomyces cerevisiae. Proc. Natl. Acad. Sci. USA 82,8562-8566. Henikoff, S. and Cohen, E. H. (1984). Sequences responsible for transcription termination on a gene segment in Saccharomyces cerevisiae. Mol. Cell. Biol. 4, 15 151520. Maniatis, T., Fritsch, E. F. and Sambrook, J. (1982). Molecular Cloning. A Laboratory Manual. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York. Marck, C. (1988). ‘DNA Strider’: a ‘C’ program for the fast analysis of DNA and protein sequences on the Apple Macintosh family ofcomputers. Nucl. Acids Res. 16,1829-1836. Mortimer, R. K., Shild, D., Contopoulou, C. R. and Kans, J. A. (1989). Genetic map of Saccharomyces cerevisiae. Yeast 5,321404. Newlon, C. S.,Green, R. P., Hardeman, K. J., Kim, K. E., Lipchitz, L. R., Palzkill, T. G., Synn, S. and Woody, S. T. (1986). Structure and organisation of yeast chromosome 111. UCLA Symposium on Mol. Cell. Biol., New Series. 33,211-223. Sanger, F., Nicklen, S. and Coulson, A. R. (1977). DNA sequencing with chain terminating inhibitors. Proc. Natl. Acad. Sci. USA 74,5463-5467. Thierry, A., Fairhead, C. and Dujon, B. (1990). The complete sequence of the 8.2 kb segment left of M A T on chromosome I11 reveals five ORFs including a gene for a yeast ribokinase. Yeast 6,521-534. Yoshikawa, A. and Isono, K. (1990). Chromosome 111of Saccharomyces cerevisiae: An ordered clone bank, a detailed restriction map and analysis of transcripts suggests the presence of 160 genes. Yeast 6,383402. Zaret, K. S. and Sherman, F. (1982). DNA sequence required for efficient termination in yeast. Cell 28, 563-573.

The complete sequence of the unit YCR59, situated between CRY1 and MAT, reveals two long open reading frames, which cover 91% of the 10.1 kb segment.

We have entirely sequenced YCR59, which is a 10.1 kb segment of the right arm of chromosome III, and is part of the clone E5F from the Newlon collecti...
709KB Sizes 0 Downloads 0 Views

Recommend Documents