306
RNA
PREDICTING
SECONDARY
1181
STRUCTURES
[jaeger.fold.seq]tmas.seq 1 1 76 Dangle.Dat Loop.Dat Stack.Dat TStack.Dat TLoop.Dat :R LRNA 2
1 10 100 0 tmal .sav N Y N tma.out 120 Y tmal .ct
Acknowledgments This work was supported by National Institutes of Health Grant GM22939. Zuker is a Fellow of the Canadian Institute for Advanced Research.
[ 183 Detecting Pairing
Michael
Pseudoknots and Other Local BaseStructures in RNA Sequences By
HUGO
M.
MARTINEZ
Introduction Martinez’ describes a suite of programs, RNAFOLD, comprising a workbench for studying RNA secondary structures. We here describe a
1 H. M. Martinez, Nucleic Acids Rex 16, 1789 (1988). METHODS
IN ENZYMOLOGY,
VOL.
183
Copyright 0 1990 by Academic Press, Inc. All rights of reproduction in any form reserved.
1181
PSEUDOKNOTS
AND
HAIRPINS
307
number of extensions to this workbench aimed at finding pseudoknots and hairpins in both single and multiple sequences. As originally described, RNAFOLD consists of four principal programs: DYNPRO for finding global structures of the orthodox type using a dynamic programming algorithm; MONTECARLO for generating a set of orthodox, global structures using an algorithm intended to simulate the folding process; STEMS for finding hairpins; and, finally, DRAWSTRUCT for the drawing of a structure when presented as a base-pairing list. The extensions apply to the first three of these programs. Specifically, (1) an option has been added to STEMS for finding pseudoknots, and (2) all three programs can now deal with more than one sequence at a time to obtain individual and common structures. Thus, given an input set S of sequences, DYNPRO finds a global structure for each member of S and then hairpins common to these global structures. MONTECARLO obtains a set of structures for each member of S, finds a common hairpin list for each such set and then finds the hairpins common to these lists. STEMS generates a set of hairpins for each member of S as a set of local structures and then finds the hairpins common to these sets. When used in the pseudoknot mode, STEMS instead deals only with pseudoknots, finding a set for each member of S and then finding the common ones. In the following these extensions are described in detail, giving the rationale for the extensions and some results of their application to particular sets of RNA sequences. STEMS
Extensions
The pseudoknot option was prompted by two reports: that of Puglisi et ~1.~ regarding the experimental confirmation of a pseudoknotted RNA oligonucleotide and that of McPheeters et ~1.~ implicating a pseudoknot structure as an autogenous regulatory site on the bacteriophage gene 32 messenger RNA. Both of these reports strongly support the thesis of the potential importance of pseudoknots as earlier advocated by Pleij et al4 In trying to deal with this new kind of base-pairing structure which partially violates the “orthodox” constraint of bases within a hairpin loop not pairing with bases outside the loop, the decision was made to only consider them as independent of a global structure, just as is done when finding hairpins independent of global structure considerations. In part this strategy was a matter of expediency because including pseudoknots as parts 2 J. D. Puglisi, J. R. Wyatt, and I. Tinoco, Jr., Nature (London) 331,283 (1988). 3 D. S. McPheeters, G. D. Stormo, and L. Gold, J. Mol. Biol. 201, 5 17 ( 1988). 4 W. A. Pleij, K. Rietveld, and L. Bosch, Nucleic Acids Rex 13, 17 17 (1985).
308
RNA
PREDICTING
SECONDARY
STRUCTURES
1181
of a global structure presents some as yet unresolved technical difficulties and also because of the increasing reliance on comparative methods which base the significance of a local structure on conservation criteria. An example of a pseudoknot is shown in Fig. 1 in two different ways. In the first, the base-pairing sections, identified as stems 1,2, and 3, are drawn in a noncoaxial manner, while in the second a coaxial arrangement has been imposed to emphasize the stacking of stem 2 to stem 1 and the stacking of stem 3 to stem 2. It is this stacking which gives the coaxial extension of stem 1 and of stem 2 to form a larger, single stem comprising the pseudoknot complex. A simpler pseudoknot would consist of just stems 1 and 2 or of just stems 2 and 3. Identifying a pseudoknot as stacked stems readily suggests how to go about finding them. Thus, since stems 1 and 3 are the stems of two hairpins with stem 2 linking these two hairpins together, one need only generate a list of hairpins and determine which pairs of them can be linked together by an intermediate stem. U 52
I 1
\
A AUG I UACGCC AAAAAA CGGUAAG I AUUC AA ! I I I cccccc AAAU 93
52
I
I
\
\ I
A
A I I AUG:CGG/UAAG UAC/GCC:AUUC
AAAAAA \ I
cccccc I AA I’ 93’ 1 ! AAAU FIG. 1. Pseudoknot shown in two different ways: (top) noncoaxial manner and (bottom) coaxial manner. The slashes indicate extended bonds between successive bases. Paired bases are shown as letters in the same column. As referred to in the text, stem 1 is AUG paired with CAU, stem 2 is GCC paired with GCX, and stem 3 is UAAG paired with CUUA.
H81
PSEUDOKNOTS
AND
HAIRPINS
309
The pseudoknot mode in STEMS is selected by setting the pseudoknot option parameter to 1. In this mode STEMS generates the hairpins (with their stems constrained to have no bulges or inner loops) for all members of the sequence set S but does not report them. Instead, each hairpin list (one list for each sequence) is first sorted in increasing order relative to the 5’ starting position of a hairpin and then scanned in the following manner. For a given hairpin in the list, successive ones are examined for linking to it via base pairing in the form of an intermediate stem that forms the actual pseudoknot (a stem 2). The first one found such that the complex has a negative free energy is kept, and no further possibilities are examined. If no such complex can be formed for the given hairpin, the sequence of bases is examined for a stem 2 that can give the simpler form of a pseudoknot complex. Regarding the energy of a pseudoknot complex, this is calculated using the same base-pairing, stacking, and destabilization energies as for orthodox stems and hairpins. Once a pseudoknot list is formed for each member of the sequence set S, a search is made for pseudoknots which are common to those of the first sequence. That two pseudoknots are to be regarded as the same is subject to the following six user-adjustable parameters: 1. Min-pseudoknot-loop-size, which specifies the least number of free bases in the hairpin loop of stem 1 or of stem 2 (default is 2) 2. Max-pseudoknot-loop-size, which specifies the maximum number of free bases in the hairpin of stem 1 or of stem 2 (default is 30) 3. Max-pseudoknot-length, which specifies the maximum number of base pairs in stem 2 (default is 9) 4. Reg-fact, which concerns the allowed difference in position of the pseudoknots in their respective sequences as a fraction of the smaller sequence length (default is 0.3) 5. Len-fact, which concerns the difference in length of two pseudoknots as a fraction of the smaller length (default is 0.3) 6. Align-fact, which concerns the alignment score of the subsequences comprising the pseudoknots as a fraction of the maximum possible (default is 0.6) For each pseudoknot of the first sequence a list is made of the pseudoknots in the other sequences which are identical to it, according to the above criteria. The corresponding list of subsequences is then passed on to a multiple sequence alignment program for an assessment of what is conserved among them. Figures 2 and 3 show the output of STEMS in the pseudoknot mode for the two sequences investigated by McPheeters et al3 This output has been edited to show only the common pseudoknots. There are two common
310
RNA
PREDICTING 1 . 2. 1 2
T4 T2
mRNA mRNA
SECONDARY
f181
STRUCTURES
(5-32) (2-33)
UG CCAGCUAUGAGGU A A II 11111lIIIIIII I I UGaCCAGCUAUGAGGUcAuAcAucGUCAUAGC
AguGUCAUAGC I I I I I I I I I
UG-CCAGCUAUGAGGU-A-A-A--GUCAUAGC
energy
=
-14.3
kcal A-
_ _ _
5
\ \
UGCC AUGGAGUAUCG I UCAUAGC A I 1 I AGUG energy
=
-14.1
A-
I \
in pseudoknot and T2 mRNA
\ 32
kcal
2
FIG. 2. Output of STEMS the sequences of T4 mRNA quences.
l’
_ _ _
\
I
I UGACC I ACUGGAGUAUCG UCAUAGC I U \ / A 33 I CAUCG mode showing one common pseudoknot and the alignment of the corresponding
pair of subse-
pseudoknots, and both are of the simple type in that a stem 3 is missing. The first pair of pseudoknots (Fig. 2) are those reported by McPheeters et ~1.~ The alignment of the corresponding subsequences and the structures themselves clearly show what is conserved. When the pseudoknot option is not invoked, STEMS proceeds to find the hairpins in all the members of S subject to the normal constraints of max-bulge-size, max-inner-loop-size, and max-hairpin-loop-size. As for pseudoknots, it then proceeds to find the hairpins which are common to those of the first sequence. The criteria of when two hairpins are the same are just the last three for pseudoknots. The output shown in Fig. 4 is also
[181
PSEUDOKNOTS 1 . 2. 1 2
T4 T2
mRNA mRNA
AND
311
HAIRPINS
(10-36) (8-35)
GCUAUGAGGU A A AguGUCAUAGCACca IIIIIIIIII I I I I I I I I I I I I I GCUAUGAGGUcAuAcAucGUCAUAGCAC GCUAUGAGGU-A-A-A--GUCAUAGCAC--
energy
=
-2.2
kcal GGUA 10
I \
I
GA
GCUAUGA CGAUACUGU
/ CA
I A
I \
\ 36
I _ _ _ -c
energy
=
-0.2
kcal GGUCA 8
I \
/ GCUAUGA CGAUACUG C
I I \
I _ _ _ _ -A
FIG. 3. Output of STEMS the sequences of T4 mRNA quences.
in pseudoknot and T2 mRNA
CUACAU /
I
\ 35
mode showing another common pseudoknot of and the alignment of the corresponding subse-
edited to show only common hairpins, and it also applies to the two sequences treated with the pseudoknot option. The max-bulge-size parameter has been set to 0 in order to concentrate on hairpins with perfect stems. Note that the common hairpins are just those found in the common pseudoknots. MONTECARLO
Extensions
The Monte Carlo method which we use for folding an RNA sequence is an iterative one. First the most energetic hairpins are found, up to a
312
RNA
PREDICTING
T4 T2
1 .
2.
SECONDARY
STRUCTURES
mRNA
[I81
(10-32) (8-33)
mRNA
1 GCUAUGAGGU
A A AguGUCAUAGC
IIIIiIIIII
I
I
I
I I I I I I I I
2 GCUAUGAGGUcAuAcAucGUCAUAGC GCUAUGAGGU-A-A-A--GUCAUAGC energy
=
10 32
-7.3
GCUAUGA CGAUACU
kcal ggua ; cugu
energy
=
8 33
-7.1
kcal ggucau
GCUAUGA CGAUACU
\ I cugcua
1. 2.
T4 -mRNA T2 -mRNA
(5-20) (2- 19)
1 UG CCAGCUAUGAGGU II
IIIIIIIIIIIII
A I
2 UGaCCAGCUAUGAGGUcA UG-CCAGCUAUGAGGU-A energy 5 20
=
-1.2 agcu
UGCC AUGG
kcal
\ I ggag
energy 2 19
=
-6.5 agcu
UGACC ACUGG
FIG. 4. Output of STEMS
kcal \ I
ggag in hairpin mode. Only common hairpins are shown.
user-specifiable maximum number. Such a list, called LO, serves as the initial one for all the foldings. Regarding LO as a population of stems competing to form, members from LO are selected with a probability that is a function of the free energy relative to the competing population. Once a
1181
PSEUDOKNOTS
AND
HAIRPINS
313
hairpin is selected for adding to the current structure, the remaining population is screened so that only those which are compatible with the new structure can compete. The sequential selection and readjustment of the competing population continues until LO is exhausted of members compatible with the current structure. On exhaustion of LO a new hairpin list Ll is constructed using the unpaired bases of the structure obtained from LO; Ll is then used for the competition until it is exhausted, etc. The structure resulting from this kind of folding is encoded as a list of base pairs so that it is an easy matter to compare the structures resulting from separate foldings and thus tally the frequency with which a particular structure occurs. Only the distinct base pair lists are retained, and from each a hairpin list is constructed consisting only of hairpins whose loops are free of additional hairpins (do not bifurcate). It is these kinds of local structures which are compared from the various foldings using the same comparison criteria as for the Hairpin mode of STEMS. What is reported are the hairpins common to the base pair list with the largest frequency of occurrence. In the case of several sequences, each is folded the prescribed number of times, and for each the corresponding common hairpin list is found. Then these common hairpin lists are compared, and the hairpins common to those of the first sequence are saved. Figure 5 shows the MONTECARLO folding of three small nuclear RNA sequences obtained from the GenBank database: Drosophila (locus drour2b, 99 bp), pheasant (locus phsur2a, 95 bp), and pea (locus peanur2, 96 bp). The foldings (10 of each) have been done with default parameter settings, and in order to conserve space the output has been edited to show only the principal features of the foldings. Thus, in 10 foldings the sequence drour2b yields 5 distinct structures with the corresponding occurrence frequencies and energies shown, and for these 5 distinct structures the common hairpins are shown. The same kind of output applies to the other two sequences, and finally the hairpins common to those of the first sequence droudb are shown. DYNPRO
Extensions
With DYNPRO a sequence is folded using an adaptation of the dynamic programming algorithm of Zuker and Stiegler.” The corresponding minimum free energy structure is in the form of a base pair list which is
5 M. Zuker
and P. Stiegler,
NucleicAcidsRex 9, 133 (1981).
314
RNA
PREDICTING
*Sequence
‘drour2b’
SECONDARY
foldings
(5
1181
STRUCTURES
structures):
struct
#O:
frequency
= 4/10,
energy
= -30.7
kcal
struct
#I:
frequency
= 2/10,
energy
= -28.0
kcal
struct
#2:
frequency
= 2/10,
energy
= -30.7
kcal
struct
#3:
frequency
=
l/10,
energy
= -29.0
kcal
struct
#4:
frequency
=
l/10,
energy
= -29.0
kcal
**
Hairpins
cotmuon
to
the
foldings
6 20
UGAUUuuu ACUAA g g
22 54
GACGGAGuGcuUG: :Gagc CUGUCUC:C::ACcuCguu
57 96
GGGUUGg:::CCCGGuauugca CCCGGCuuuaGGGCCgccaug
*Sequence
‘peaur2’
foldings
(3
of
‘drour2b’
**
structures):
struct
#O:
frequency
= 7/10,
energy
=
-30.1
kcal
struct
#l:
frequency
= 2/10,
energy
=
-29.0
kcal
struct
#2:
frequency
= l/10,
energy
=
-27.6
kcal
**
Hairpins
common
to
the
foldings
of
‘peaur2’
**
::UUGCUAuUGG:GucucuuCGCgugucg 10 GGGG:GaagaGUCacacaguAGC 94 CCCCaCg:::CGG:::::::UCGuuAACGAU:AUCaCguu:::GCGuuuuc *Sequence
‘phsur2a’
struct
**
#O:
Hairpins
foldings frequency
common
(1
structure):
= lO/lO,
to
the
energy
foldings
4 53
CGGAuuuuugggcGCgGGAGuuGGA:Cccgga GCCU:::::::::CG:CCUC::CCUcGuucg
55 92
GCAUCGU:CCCGGuauGGcag CGUGGCAcGGGCCu::CCau FIG. 5. Output of MONTECARLO.
of
=
-51.7
‘phsur2a’
kcal
**
[181 **
PSEUDOKNOTS
HAIRPINS
COh&lON 1 2.
AND
TO THOSE
drour2b phsur2a
OF
315
HAIRPINS
SEQUENCE
‘drour2b’
**
(57-96) (55-92)
1 GggTtGgCCCGGTATtGCAGTACCgCCGGGatttCGGccC I
I
I
IIIIIIlI
IIIIIIII
IIIII
2 GcaTcGtCCCGGTATgGCAGTACCtCCGGG
III
I
caCGGtgC
G--T-G-CCCGGTAT-GCAGTACC-CCGGG----CGG--C energy 57 96 energy 55 92
=
-21.5
kcal
GGGUUGg:::CCCGGuauugca CCCGGCuuuaGGGCCgccaug = -28.3
kcal
GCAUCGU:CCCGGuauGGcag CGUGGCAcGGGCCu::CCau FIG. 5. (Continued)
then converted to a list of hairpins. In the case of several sequences these lists serve to identify the nonbifurcating hairpins common to those of the first sequence just as is done within MONTECARLO. Figure 6 shows the output of DYNPRO for the same set of sequences used by MONTECARLO (Fig. 5). Although edited, the actual structures have been retained rather than just their energies being shown. It will be noticed that the commonality output here differs from that of MONTECARLO. As is often the case, the results agree in certain areas, such as the commonality of hairpins belonging to the same pair of sequences, but differ in which hairpins are common. The difference resides in the different global structures produced, which generally give different sets of hairpins to compare. Summary The current version of RNAFOLD now has the capability for conducting comparative studies relative to nonbifurcating hairpins contained in the global structures produced by the dynamic programming and Monte Carlo methods. It also has the capability of both finding and comparing
316
RNA
PREDICTING
*Sequence
‘drour2b’,
> stem 6 20
22 54
57 96
kcal
a hairpin
2:
a hairpin
3:
a hairpin
‘peaur2’, stem
7 95
1:
bifurcates
energy
= into
-43.9 stems
kcal 2 3
UGaGGG(2)g(3) AC : CCC stem
13 51
53 90
= -37.2
GGGWGg:::CCCGGuauugca CCCGGCuuuaGGGCCgccaug
*Sequence
>
energy
GACGGAGuGcuUGGAGcu CUGUCUC:C::ACCUCgu
> stem
>
1:
I181
STRUCTURES
UGAWuuu ACUAA g g
> stem
>
SECONDARY
2:
a hairpin
G:GAAGAGuCaCaCAGUAGcu CgCUUCUCuG:G:GWAUCgu stem
3:
a hairpin
UGuCGCuuuuGCg:WGCacu AC:GCGgu ::CGuuAACGaua FIG. 6. Output of DYNPRO.
hairpins and pseudoknots independently of global structures. The efficacy of these increased capabilities has been tested for select families of sequences, and the results thus far indicate favorable utility. Under consideration is a further extension designed to incorporate pseudoknots within global structures. Written in the C language, RNAFOLD and its companion program, GENALIGN, for doing multiple alignments in the comparison of pseudoknots and hairpins, is available for UNIX systems on standard j-inch, 9-track tape or on SUN tape cartridges.
1181
PSEUDOKNOTS
*Sequence
‘phsur2a’,
> stem
1:
13 95
energy
bifurcates
317
HAIRPINS
= -58.1 into
kcal
stems
2 3
OF
SEQUENCE
GG:GC(2)(3) CCaCG
> stem
2:
17 56
a hairpin
GCG:GGAGuuGGAcccGGAGcu CGCgCCUCg:CCUc::CCUCgu
> stem
3:
57 90 **
AND
a hairpin
AUCGU:CCCGGuAuGGcag UGGCAcGGGCC:U:CCau HAIRPINS 1. 2.
COWlON
TO THOSE
drour2b phsur2a
1 GacGGAG I
IIll
‘drour2b’
**
(22-54) (17-56) T G CttGGAGCTTGCT
CC
I
II
I
I
IIIIIIIlII
aC CTCtG I
Ill
tC I
I
2 GcgGGAGtTgGaCccGGAGCTTGCTcCCtcCgCTCcGcgC G--GGAG-T-G-C--GGAGCTTGCT-CC--C-CTC-G--C energy 22 54 energy 17 56
= -15.4
kcal
GACGGAGuGcuUGGAGcu CUGUCUC:C::ACCUCgu = -25.0
kcal
GCG:GGAGuuGGAcccGGAGcu CGCgCCUCg:CCUc::CCUCgu FIG. 6. (Continued)
Acknowledgments The Biology 1988. I and for
work reported here was undertaken while the author was visiting the Mathematical Laboratory at the Frederick Cancer Research Facility, Frederick, Maryland, during wish to thank the director, Dr. Jacob Maizel, and staff for many useful discussions providing excellent computational facilities.