306

RNA

PREDICTING

SECONDARY

1181

STRUCTURES

[jaeger.fold.seq]tmas.seq 1 1 76 Dangle.Dat Loop.Dat Stack.Dat TStack.Dat TLoop.Dat :R LRNA 2

1 10 100 0 tmal .sav N Y N tma.out 120 Y tmal .ct

Acknowledgments This work was supported by National Institutes of Health Grant GM22939. Zuker is a Fellow of the Canadian Institute for Advanced Research.

[ 183 Detecting Pairing

Michael

Pseudoknots and Other Local BaseStructures in RNA Sequences By

HUGO

M.

MARTINEZ

Introduction Martinez’ describes a suite of programs, RNAFOLD, comprising a workbench for studying RNA secondary structures. We here describe a

1 H. M. Martinez, Nucleic Acids Rex 16, 1789 (1988). METHODS

IN ENZYMOLOGY,

VOL.

183

Copyright 0 1990 by Academic Press, Inc. All rights of reproduction in any form reserved.

1181

PSEUDOKNOTS

AND

HAIRPINS

307

number of extensions to this workbench aimed at finding pseudoknots and hairpins in both single and multiple sequences. As originally described, RNAFOLD consists of four principal programs: DYNPRO for finding global structures of the orthodox type using a dynamic programming algorithm; MONTECARLO for generating a set of orthodox, global structures using an algorithm intended to simulate the folding process; STEMS for finding hairpins; and, finally, DRAWSTRUCT for the drawing of a structure when presented as a base-pairing list. The extensions apply to the first three of these programs. Specifically, (1) an option has been added to STEMS for finding pseudoknots, and (2) all three programs can now deal with more than one sequence at a time to obtain individual and common structures. Thus, given an input set S of sequences, DYNPRO finds a global structure for each member of S and then hairpins common to these global structures. MONTECARLO obtains a set of structures for each member of S, finds a common hairpin list for each such set and then finds the hairpins common to these lists. STEMS generates a set of hairpins for each member of S as a set of local structures and then finds the hairpins common to these sets. When used in the pseudoknot mode, STEMS instead deals only with pseudoknots, finding a set for each member of S and then finding the common ones. In the following these extensions are described in detail, giving the rationale for the extensions and some results of their application to particular sets of RNA sequences. STEMS

Extensions

The pseudoknot option was prompted by two reports: that of Puglisi et ~1.~ regarding the experimental confirmation of a pseudoknotted RNA oligonucleotide and that of McPheeters et ~1.~ implicating a pseudoknot structure as an autogenous regulatory site on the bacteriophage gene 32 messenger RNA. Both of these reports strongly support the thesis of the potential importance of pseudoknots as earlier advocated by Pleij et al4 In trying to deal with this new kind of base-pairing structure which partially violates the “orthodox” constraint of bases within a hairpin loop not pairing with bases outside the loop, the decision was made to only consider them as independent of a global structure, just as is done when finding hairpins independent of global structure considerations. In part this strategy was a matter of expediency because including pseudoknots as parts 2 J. D. Puglisi, J. R. Wyatt, and I. Tinoco, Jr., Nature (London) 331,283 (1988). 3 D. S. McPheeters, G. D. Stormo, and L. Gold, J. Mol. Biol. 201, 5 17 ( 1988). 4 W. A. Pleij, K. Rietveld, and L. Bosch, Nucleic Acids Rex 13, 17 17 (1985).

308

RNA

PREDICTING

SECONDARY

STRUCTURES

1181

of a global structure presents some as yet unresolved technical difficulties and also because of the increasing reliance on comparative methods which base the significance of a local structure on conservation criteria. An example of a pseudoknot is shown in Fig. 1 in two different ways. In the first, the base-pairing sections, identified as stems 1,2, and 3, are drawn in a noncoaxial manner, while in the second a coaxial arrangement has been imposed to emphasize the stacking of stem 2 to stem 1 and the stacking of stem 3 to stem 2. It is this stacking which gives the coaxial extension of stem 1 and of stem 2 to form a larger, single stem comprising the pseudoknot complex. A simpler pseudoknot would consist of just stems 1 and 2 or of just stems 2 and 3. Identifying a pseudoknot as stacked stems readily suggests how to go about finding them. Thus, since stems 1 and 3 are the stems of two hairpins with stem 2 linking these two hairpins together, one need only generate a list of hairpins and determine which pairs of them can be linked together by an intermediate stem. U 52

I 1

\

A AUG I UACGCC AAAAAA CGGUAAG I AUUC AA ! I I I cccccc AAAU 93

52

I

I

\

\ I

A

A I I AUG:CGG/UAAG UAC/GCC:AUUC

AAAAAA \ I

cccccc I AA I’ 93’ 1 ! AAAU FIG. 1. Pseudoknot shown in two different ways: (top) noncoaxial manner and (bottom) coaxial manner. The slashes indicate extended bonds between successive bases. Paired bases are shown as letters in the same column. As referred to in the text, stem 1 is AUG paired with CAU, stem 2 is GCC paired with GCX, and stem 3 is UAAG paired with CUUA.

H81

PSEUDOKNOTS

AND

HAIRPINS

309

The pseudoknot mode in STEMS is selected by setting the pseudoknot option parameter to 1. In this mode STEMS generates the hairpins (with their stems constrained to have no bulges or inner loops) for all members of the sequence set S but does not report them. Instead, each hairpin list (one list for each sequence) is first sorted in increasing order relative to the 5’ starting position of a hairpin and then scanned in the following manner. For a given hairpin in the list, successive ones are examined for linking to it via base pairing in the form of an intermediate stem that forms the actual pseudoknot (a stem 2). The first one found such that the complex has a negative free energy is kept, and no further possibilities are examined. If no such complex can be formed for the given hairpin, the sequence of bases is examined for a stem 2 that can give the simpler form of a pseudoknot complex. Regarding the energy of a pseudoknot complex, this is calculated using the same base-pairing, stacking, and destabilization energies as for orthodox stems and hairpins. Once a pseudoknot list is formed for each member of the sequence set S, a search is made for pseudoknots which are common to those of the first sequence. That two pseudoknots are to be regarded as the same is subject to the following six user-adjustable parameters: 1. Min-pseudoknot-loop-size, which specifies the least number of free bases in the hairpin loop of stem 1 or of stem 2 (default is 2) 2. Max-pseudoknot-loop-size, which specifies the maximum number of free bases in the hairpin of stem 1 or of stem 2 (default is 30) 3. Max-pseudoknot-length, which specifies the maximum number of base pairs in stem 2 (default is 9) 4. Reg-fact, which concerns the allowed difference in position of the pseudoknots in their respective sequences as a fraction of the smaller sequence length (default is 0.3) 5. Len-fact, which concerns the difference in length of two pseudoknots as a fraction of the smaller length (default is 0.3) 6. Align-fact, which concerns the alignment score of the subsequences comprising the pseudoknots as a fraction of the maximum possible (default is 0.6) For each pseudoknot of the first sequence a list is made of the pseudoknots in the other sequences which are identical to it, according to the above criteria. The corresponding list of subsequences is then passed on to a multiple sequence alignment program for an assessment of what is conserved among them. Figures 2 and 3 show the output of STEMS in the pseudoknot mode for the two sequences investigated by McPheeters et al3 This output has been edited to show only the common pseudoknots. There are two common

310

RNA

PREDICTING 1 . 2. 1 2

T4 T2

mRNA mRNA

SECONDARY

f181

STRUCTURES

(5-32) (2-33)

UG CCAGCUAUGAGGU A A II 11111lIIIIIII I I UGaCCAGCUAUGAGGUcAuAcAucGUCAUAGC

AguGUCAUAGC I I I I I I I I I

UG-CCAGCUAUGAGGU-A-A-A--GUCAUAGC

energy

=

-14.3

kcal A-

_ _ _

5

\ \

UGCC AUGGAGUAUCG I UCAUAGC A I 1 I AGUG energy

=

-14.1

A-

I \

in pseudoknot and T2 mRNA

\ 32

kcal

2

FIG. 2. Output of STEMS the sequences of T4 mRNA quences.

l’

_ _ _

\

I

I UGACC I ACUGGAGUAUCG UCAUAGC I U \ / A 33 I CAUCG mode showing one common pseudoknot and the alignment of the corresponding

pair of subse-

pseudoknots, and both are of the simple type in that a stem 3 is missing. The first pair of pseudoknots (Fig. 2) are those reported by McPheeters et ~1.~ The alignment of the corresponding subsequences and the structures themselves clearly show what is conserved. When the pseudoknot option is not invoked, STEMS proceeds to find the hairpins in all the members of S subject to the normal constraints of max-bulge-size, max-inner-loop-size, and max-hairpin-loop-size. As for pseudoknots, it then proceeds to find the hairpins which are common to those of the first sequence. The criteria of when two hairpins are the same are just the last three for pseudoknots. The output shown in Fig. 4 is also

[181

PSEUDOKNOTS 1 . 2. 1 2

T4 T2

mRNA mRNA

AND

311

HAIRPINS

(10-36) (8-35)

GCUAUGAGGU A A AguGUCAUAGCACca IIIIIIIIII I I I I I I I I I I I I I GCUAUGAGGUcAuAcAucGUCAUAGCAC GCUAUGAGGU-A-A-A--GUCAUAGCAC--

energy

=

-2.2

kcal GGUA 10

I \

I

GA

GCUAUGA CGAUACUGU

/ CA

I A

I \

\ 36

I _ _ _ -c

energy

=

-0.2

kcal GGUCA 8

I \

/ GCUAUGA CGAUACUG C

I I \

I _ _ _ _ -A

FIG. 3. Output of STEMS the sequences of T4 mRNA quences.

in pseudoknot and T2 mRNA

CUACAU /

I

\ 35

mode showing another common pseudoknot of and the alignment of the corresponding subse-

edited to show only common hairpins, and it also applies to the two sequences treated with the pseudoknot option. The max-bulge-size parameter has been set to 0 in order to concentrate on hairpins with perfect stems. Note that the common hairpins are just those found in the common pseudoknots. MONTECARLO

Extensions

The Monte Carlo method which we use for folding an RNA sequence is an iterative one. First the most energetic hairpins are found, up to a

312

RNA

PREDICTING

T4 T2

1 .

2.

SECONDARY

STRUCTURES

mRNA

[I81

(10-32) (8-33)

mRNA

1 GCUAUGAGGU

A A AguGUCAUAGC

IIIIiIIIII

I

I

I

I I I I I I I I

2 GCUAUGAGGUcAuAcAucGUCAUAGC GCUAUGAGGU-A-A-A--GUCAUAGC energy

=

10 32

-7.3

GCUAUGA CGAUACU

kcal ggua ; cugu

energy

=

8 33

-7.1

kcal ggucau

GCUAUGA CGAUACU

\ I cugcua

1. 2.

T4 -mRNA T2 -mRNA

(5-20) (2- 19)

1 UG CCAGCUAUGAGGU II

IIIIIIIIIIIII

A I

2 UGaCCAGCUAUGAGGUcA UG-CCAGCUAUGAGGU-A energy 5 20

=

-1.2 agcu

UGCC AUGG

kcal

\ I ggag

energy 2 19

=

-6.5 agcu

UGACC ACUGG

FIG. 4. Output of STEMS

kcal \ I

ggag in hairpin mode. Only common hairpins are shown.

user-specifiable maximum number. Such a list, called LO, serves as the initial one for all the foldings. Regarding LO as a population of stems competing to form, members from LO are selected with a probability that is a function of the free energy relative to the competing population. Once a

1181

PSEUDOKNOTS

AND

HAIRPINS

313

hairpin is selected for adding to the current structure, the remaining population is screened so that only those which are compatible with the new structure can compete. The sequential selection and readjustment of the competing population continues until LO is exhausted of members compatible with the current structure. On exhaustion of LO a new hairpin list Ll is constructed using the unpaired bases of the structure obtained from LO; Ll is then used for the competition until it is exhausted, etc. The structure resulting from this kind of folding is encoded as a list of base pairs so that it is an easy matter to compare the structures resulting from separate foldings and thus tally the frequency with which a particular structure occurs. Only the distinct base pair lists are retained, and from each a hairpin list is constructed consisting only of hairpins whose loops are free of additional hairpins (do not bifurcate). It is these kinds of local structures which are compared from the various foldings using the same comparison criteria as for the Hairpin mode of STEMS. What is reported are the hairpins common to the base pair list with the largest frequency of occurrence. In the case of several sequences, each is folded the prescribed number of times, and for each the corresponding common hairpin list is found. Then these common hairpin lists are compared, and the hairpins common to those of the first sequence are saved. Figure 5 shows the MONTECARLO folding of three small nuclear RNA sequences obtained from the GenBank database: Drosophila (locus drour2b, 99 bp), pheasant (locus phsur2a, 95 bp), and pea (locus peanur2, 96 bp). The foldings (10 of each) have been done with default parameter settings, and in order to conserve space the output has been edited to show only the principal features of the foldings. Thus, in 10 foldings the sequence drour2b yields 5 distinct structures with the corresponding occurrence frequencies and energies shown, and for these 5 distinct structures the common hairpins are shown. The same kind of output applies to the other two sequences, and finally the hairpins common to those of the first sequence droudb are shown. DYNPRO

Extensions

With DYNPRO a sequence is folded using an adaptation of the dynamic programming algorithm of Zuker and Stiegler.” The corresponding minimum free energy structure is in the form of a base pair list which is

5 M. Zuker

and P. Stiegler,

NucleicAcidsRex 9, 133 (1981).

314

RNA

PREDICTING

*Sequence

‘drour2b’

SECONDARY

foldings

(5

1181

STRUCTURES

structures):

struct

#O:

frequency

= 4/10,

energy

= -30.7

kcal

struct

#I:

frequency

= 2/10,

energy

= -28.0

kcal

struct

#2:

frequency

= 2/10,

energy

= -30.7

kcal

struct

#3:

frequency

=

l/10,

energy

= -29.0

kcal

struct

#4:

frequency

=

l/10,

energy

= -29.0

kcal

**

Hairpins

cotmuon

to

the

foldings

6 20

UGAUUuuu ACUAA g g

22 54

GACGGAGuGcuUG: :Gagc CUGUCUC:C::ACcuCguu

57 96

GGGUUGg:::CCCGGuauugca CCCGGCuuuaGGGCCgccaug

*Sequence

‘peaur2’

foldings

(3

of

‘drour2b’

**

structures):

struct

#O:

frequency

= 7/10,

energy

=

-30.1

kcal

struct

#l:

frequency

= 2/10,

energy

=

-29.0

kcal

struct

#2:

frequency

= l/10,

energy

=

-27.6

kcal

**

Hairpins

common

to

the

foldings

of

‘peaur2’

**

::UUGCUAuUGG:GucucuuCGCgugucg 10 GGGG:GaagaGUCacacaguAGC 94 CCCCaCg:::CGG:::::::UCGuuAACGAU:AUCaCguu:::GCGuuuuc *Sequence

‘phsur2a’

struct

**

#O:

Hairpins

foldings frequency

common

(1

structure):

= lO/lO,

to

the

energy

foldings

4 53

CGGAuuuuugggcGCgGGAGuuGGA:Cccgga GCCU:::::::::CG:CCUC::CCUcGuucg

55 92

GCAUCGU:CCCGGuauGGcag CGUGGCAcGGGCCu::CCau FIG. 5. Output of MONTECARLO.

of

=

-51.7

‘phsur2a’

kcal

**

[181 **

PSEUDOKNOTS

HAIRPINS

COh&lON 1 2.

AND

TO THOSE

drour2b phsur2a

OF

315

HAIRPINS

SEQUENCE

‘drour2b’

**

(57-96) (55-92)

1 GggTtGgCCCGGTATtGCAGTACCgCCGGGatttCGGccC I

I

I

IIIIIIlI

IIIIIIII

IIIII

2 GcaTcGtCCCGGTATgGCAGTACCtCCGGG

III

I

caCGGtgC

G--T-G-CCCGGTAT-GCAGTACC-CCGGG----CGG--C energy 57 96 energy 55 92

=

-21.5

kcal

GGGUUGg:::CCCGGuauugca CCCGGCuuuaGGGCCgccaug = -28.3

kcal

GCAUCGU:CCCGGuauGGcag CGUGGCAcGGGCCu::CCau FIG. 5. (Continued)

then converted to a list of hairpins. In the case of several sequences these lists serve to identify the nonbifurcating hairpins common to those of the first sequence just as is done within MONTECARLO. Figure 6 shows the output of DYNPRO for the same set of sequences used by MONTECARLO (Fig. 5). Although edited, the actual structures have been retained rather than just their energies being shown. It will be noticed that the commonality output here differs from that of MONTECARLO. As is often the case, the results agree in certain areas, such as the commonality of hairpins belonging to the same pair of sequences, but differ in which hairpins are common. The difference resides in the different global structures produced, which generally give different sets of hairpins to compare. Summary The current version of RNAFOLD now has the capability for conducting comparative studies relative to nonbifurcating hairpins contained in the global structures produced by the dynamic programming and Monte Carlo methods. It also has the capability of both finding and comparing

316

RNA

PREDICTING

*Sequence

‘drour2b’,

> stem 6 20

22 54

57 96

kcal

a hairpin

2:

a hairpin

3:

a hairpin

‘peaur2’, stem

7 95

1:

bifurcates

energy

= into

-43.9 stems

kcal 2 3

UGaGGG(2)g(3) AC : CCC stem

13 51

53 90

= -37.2

GGGWGg:::CCCGGuauugca CCCGGCuuuaGGGCCgccaug

*Sequence

>

energy

GACGGAGuGcuUGGAGcu CUGUCUC:C::ACCUCgu

> stem

>

1:

I181

STRUCTURES

UGAWuuu ACUAA g g

> stem

>

SECONDARY

2:

a hairpin

G:GAAGAGuCaCaCAGUAGcu CgCUUCUCuG:G:GWAUCgu stem

3:

a hairpin

UGuCGCuuuuGCg:WGCacu AC:GCGgu ::CGuuAACGaua FIG. 6. Output of DYNPRO.

hairpins and pseudoknots independently of global structures. The efficacy of these increased capabilities has been tested for select families of sequences, and the results thus far indicate favorable utility. Under consideration is a further extension designed to incorporate pseudoknots within global structures. Written in the C language, RNAFOLD and its companion program, GENALIGN, for doing multiple alignments in the comparison of pseudoknots and hairpins, is available for UNIX systems on standard j-inch, 9-track tape or on SUN tape cartridges.

1181

PSEUDOKNOTS

*Sequence

‘phsur2a’,

> stem

1:

13 95

energy

bifurcates

317

HAIRPINS

= -58.1 into

kcal

stems

2 3

OF

SEQUENCE

GG:GC(2)(3) CCaCG

> stem

2:

17 56

a hairpin

GCG:GGAGuuGGAcccGGAGcu CGCgCCUCg:CCUc::CCUCgu

> stem

3:

57 90 **

AND

a hairpin

AUCGU:CCCGGuAuGGcag UGGCAcGGGCC:U:CCau HAIRPINS 1. 2.

COWlON

TO THOSE

drour2b phsur2a

1 GacGGAG I

IIll

‘drour2b’

**

(22-54) (17-56) T G CttGGAGCTTGCT

CC

I

II

I

I

IIIIIIIlII

aC CTCtG I

Ill

tC I

I

2 GcgGGAGtTgGaCccGGAGCTTGCTcCCtcCgCTCcGcgC G--GGAG-T-G-C--GGAGCTTGCT-CC--C-CTC-G--C energy 22 54 energy 17 56

= -15.4

kcal

GACGGAGuGcuUGGAGcu CUGUCUC:C::ACCUCgu = -25.0

kcal

GCG:GGAGuuGGAcccGGAGcu CGCgCCUCg:CCUc::CCUCgu FIG. 6. (Continued)

Acknowledgments The Biology 1988. I and for

work reported here was undertaken while the author was visiting the Mathematical Laboratory at the Frederick Cancer Research Facility, Frederick, Maryland, during wish to thank the director, Dr. Jacob Maizel, and staff for many useful discussions providing excellent computational facilities.

Detecting pseudoknots and other local base-pairing structures in RNA sequences.

The current version of RNAFOLD now has the capability for conducting comparative studies relative to nonbifurcating hairpins contained in the global s...
509KB Sizes 0 Downloads 0 Views