J. Math. Biol. DOI 10.1007/s00285-013-0737-8

Mathematical Biology

Unified approach to partition functions of RNA secondary structures Ralf Bundschuh

Received: 29 March 2013 / Revised: 15 October 2013 © Springer-Verlag Berlin Heidelberg 2013

Abstract RNA secondary structure formation is a field of considerable biological interest as well as a model system for understanding generic properties of heteropolymer folding. This system is particularly attractive because the partition function and thus all thermodynamic properties of RNA secondary structure ensembles can be calculated numerically in polynomial time for arbitrary sequences and homopolymer models admit analytical solutions. Such solutions for many different aspects of the combinatorics of RNA secondary structure formation share the property that the final solution depends on differences of statistical weights rather than on the weights alone. Here, we present a unified approach to a large class of problems in the field of RNA secondary structure formation. We prove a generic theorem for the calculation of RNA folding partition functions. Then, we show that this approach can be applied to the study of the molten-native transition, denaturation of RNA molecules, as well as to studies of the glass phase of random RNA sequences. Keywords RNA folding · Molten-native transition · RNA denaturation · Exact calculation of partition functions Mathematics Subject Classification (2000)

92E10 · 92C40 · 82B05

R. Bundschuh (B) Departments of Physics, Chemistry and Biochemistry, Center for RNA Biology, The Ohio State University, 191 West Woodruff Avenue, Columbus, OH 43210-1117, USA e-mail: [email protected] R. Bundschuh Division of Hematology, Department of Internal Medicine, The Ohio State University, 191 West Woodruff Avenue, Columbus, OH 43210-1117, USA

123

R. Bundschuh

1 Introduction RNA is a very important biomolecule. It is involved in many of the fundamental processes of life such as protein synthesis, development, and gene regulation (Alberts et al. 2002). Beyond this biological importance, RNA also serves as a model system that poses interesting statistical Physics problems (Higgs 2000). This contribution will focus exclusively on the latter aspect of RNA. More specifically, the aspect of RNA of interest here, is the formation of the socalled secondary structure. RNA is a heteropolymer composed of four different bases A, C, G, and U. Each RNA molecule is characterized by the specific sequence in which these four bases are polymerized together. This sequence is called the primary structure of the molecule. Like their counterparts in DNA, these four bases have a strong propensity to form base pairs. The most energetically favorable base pairs are the Watson–Crick base pairs G–C and A–U but the wobble base pairs G–U are also quite frequent. In order to form these base pairs, the RNA molecule has to fold back onto itself. The pattern of base pairs formed is called the secondary structure of the molecule (see Fig. 1a). Regions with consecutive base pairs, called stems, form a relatively rigid double helical structure similar to that of DNA. They are connected by flexible pieces of single stranded RNA containing those bases that have not found a partner base to form a base pair with. The ensemble of stems connected by flexible loops arranges into a specific three-dimensional shape driven by interactions between the stems and loops. This three-dimensional shape is called the tertiary structure of the molecule. In many cases the tertiary or secondary structure of the molecule are directly determining the biological function of a given RNA molecule. It is important to note that the interactions driving the formation of tertiary structure are in general weaker than the base pairing interactions that drive the formation of secondary structure (Tinoco and Bustamante 1999). Thus, the formation of secondary structure does not depend on the formation of the tertiary structure and can be studied in its own right. The study of the statistical Physics of RNA secondary structures has a long history (de Gennes 1968; Waterman 1978; Higgs 1996, 2000). E.g., one can ask how

multi−loop

bulge

k k

l

i

j

l j

i stack

(a)

hairpin

(b)

i j

k l

Fig. 1 Secondary structure of an RNA molecule. a The secondary structure of an RNA molecule is the set of all base pairs (dashed lines) formed between the bases (circles) of an RNA molecule, the backbone of which is indicated by the solid line. Any such secondary structure can be decomposed into different kinds of loops, i.e., areas that are surrounded by backbone and base pairs and do not themselves contain any base pairs. These loops have different names as indicated depending on their topology. b Two base pairs in a valid secondary structure have to be either nested (left) or independent (bottom). A configuration as shown in the top is not allowed. It is called a pseudo-knot and considered part of the tertiary structure

123

Partition functions of RNA secondary structures

many possible secondary structures there are for an RNA molecule with a length of N bases (Waterman 1978). More generally, one can ask how the secondary structure of RNA molecules generically behaves in the thermodynamic limit when the number N of bases becomes large (Higgs 2000). There are various factors that influence this behavior. Entropy favors a coexistence of many different secondary structures or even a complete breaking up of the secondary structure in order for the molecule to explore a maximum of spatial degrees of freedom. (Base pairing) energy in general favors the formation of secondary structures. For a given sequence different structures have different energies which makes one or a few structures favorable. For a generic “random” sequence this will lead to the formation of random structures. If the sequence is designed to fold into a certain structure of biological interest, this will lead to the formation of this designed, so-called “native” structure, if the design is strong enough. All these effects compete against each other. As a consequence, RNA secondary structure formation can be in different thermodynamics phases that are separated by phase transitions. The phase diagram of RNA is believed to qualitatively look like the one shown in Fig. 3 of Bundschuh and Gerland (2006). Completely random sequences at high temperatures are “denatured”, i.e., they form a very small number of base pairs. At very low temperatures, they are “frozen”, i.e., they form a few random structures that depend on the specific sequence. At intermediate temperatures they are “molten”, i.e., they form a significant number of base pairs but explore a large number of different possible structures. If the sequences are not random but instead designed to prefer a specific native structure, eventually any of these three phases of random sequences transitions into a “native” phase in which the native structure is the dominant one. These phases and phase transitions have been characterized quantitatively for quite a long period of time. The first paper on the statistical Physics of RNA secondary structure by de Gennes (de Gennes 1968) characterized the molten phase and the denatured phase and came to the conclusion that there is actually not a true transition but instead a more subtle cross-over separating these two regimes. The transition between the molten and native phase was discussed much later (Bundschuh and Hwa 1999). The existence and properties of the “frozen” (or glass) phase itself and the properties of the transition between the glass and the molten phase have been debated for quite a long time (Higgs 1996; Pagnani et al. 2000; Hartmann 2001; Pagnani et al. 2001; Bundschuh and Hwa 2002a,b; Marinari et al. 2002; Hui and Tang 2006; Monthus and Garel 2007) and—while some approximate theories are emerging (Lässig and Wiese 2006; David and Wiese 2007)—are still a subject of current research. The purpose of this manuscript is to present a general approach to the statistical Physics of RNA structures. The approach presented here provides a much more elegant calculation of the properties of the previously characterized molten native transition (Bundschuh and Hwa 1999) as well as of aspects of the molten glass transition. We will also demonstrate that this method can be applied to much more general situations. We will first review existing approaches to the statistical Physics of RNA secondary structures in general and to the native-molten transition in particular. Then we will introduce a general approach for studying a wide class of RNA secondary structures problems that is very useful for the study of the native molten transition.

123

R. Bundschuh

Lastly, we will discuss applications of this approach to different problems in RNA secondary structure formation. 2 Review of RNA secondary structure formation We start by reviewing RNA secondary structures, the techniques to study their behavior in the limit of infinite sequence length, and how the molten native transition was approached originally before presenting the new, more elegant way to do this calculation. 2.1 RNA secondary structure A secondary structure for an RNA molecule of N bases is the set of all base pairs that are formed. We denote such a secondary structure by S = {(i 1 , j1 ), (i 2 , j2 ), . . . , (i n , jn )} where n = |S| is the number of base pairs in the structure and 1 ≤ i k < jk ≤ N for all k indicate that the base at position i k is paired with the base at position jk . There are two constraints on secondary structure formation. First, every base can be involved in at most one base pair. Second, the base pairs are not allowed to form pseudo-knots, i.e., if (i, j) ∈ S and (k, l) ∈ S are two base pairs in a structure with i < k without loss of generality, they have to either be independent, i.e., fulfill i < j < k < l, or nested, i.e., fulfill i < k < l < j. This concept is explained in Fig. 1b). While both, bases paired with more than one partner, and base pairs forming pseudo-knots, are known to exist in natural RNA structures, they are rare and do not contribute much to the overall binding energy of an RNA structure. Thus, they are considered part of the tertiary rather than the secondary structure (Tinoco and Bustamante 1999). Sometimes an additional constraint is introduced that excludes base pairs between bases that are too close to each other along the backbone (Waterman 1978). This constraint reflects the fact that the RNA double helix is roughly 2 nm across while a single base has a backbone length of only 0.7 nm such that it takes at least three bases to cross the double helix. However, we do not impose this constraint in the definition of secondary structures and instead keep the option of imposing it in the definition of the energy models by assigning such secondary structures an infinite energy if they are to be excluded in a realistic model. 2.2 Energy models In order to do statistical Physics one needs a state space and an energy function. Our state space is the space of all allowed secondary structures defined in the previous paragraph. There are several different energy models that can be considered. A relatively simple energy model assigns an energy εi, j to every base pair (i, j), that is formed (Higgs 1996; Bundschuh and Hwa 1999), i.e., E[S] ≡

 (i, j)∈S

123

εi, j .

(1)

Partition functions of RNA secondary structures

The εi, j usually reflect the identity of the bases at sequence positions i, and j. E.g., a reasonable energy function that incorporates the facts that G–C base pairs are stronger than A–U base pairs which are in turn stronger than G–U base pairs, assigns −3 if the bases at positions i and j are a G and a C, −2 if they are an A and a U, −1 if they are a G and a U, and +∞ in all other cases reflecting the fact that these bases do not form base pairs at all. An even simpler choice could be −1 if the two bases are either a G and C or an A and a U while assigning +∞ in all other cases. This type of energy model is very well suited to study generic aspects of RNA secondary structure formation, such as the properties of the different thermodynamic phases and phase transitions as such generic aspects are not expected to depend on the details of the energy model and the simplicity of the model makes calculations easier. If one is interested in actual RNA secondary structure prediction one has to use a much more refined energy model. Such an energy model assigns the energy of paired bases depending on the neighboring paired bases in order to reflect the fact that actually the strongest contribution to the base pairing energy comes from stacking two base pairs on top of each other in a double helical structure. Also, energy is associated with unpaired bases in loops which reflects bending energy and entropic losses due to the formation of the loops including an infinite energy for hairpin loops with less than the minimum of three unpaired bases as well as partial stacking between the base pairs in the stems connecting to the loops and the unpaired bases in the loops. This leads to energy models with literally thousands of parameters that have been painstakingly measured (Mathews et al. 1999). These energy models, called “nearest neighbor energy models” are very successfully used in RNA structure prediction programs (Zuker and Stiegler 1981; Hofacker et al. 1994; Zuker 2003; Ding et al. 2004; Xayaphoummine et al. 2005) but much too complicated for our purposes here. 2.3 Partition functions The complete thermodynamic information about any physical system is encoded in its partition function. In our notation, the partition function for RNA secondary structure formation is written as Z=

 S



E[S] exp − kB T

 (2)

where T is the temperature in Kelvin and k B is the Boltzmann constant. The sum extends over all allowed secondary structures and is thus in general difficult to calculate. The partition function depends on the individual sequence of an RNA molecule through the energy function E[S]. The beauty of RNA secondary structure formation is that this partition function can be calculated in a very efficient manner for an arbitrary sequence (McCaskill 1990). For this purpose, we define as Z i j the partition function as in Eq. (2) but for an RNA molecule that only includes sequence positions i, i + 1, . . . , j of the original

123

R. Bundschuh

molecule. Then, we consider the partition function Z i, j+1 . In a given term of this partition function the jth base of the original molecule has to be either unpaired or paired with some other base i ≤ k ≤ j. Thus, 

Z i, j+1 =

S|∀k (k, j+1)∈ S

   j E[S] + exp − kB T



k=i S|(k, j+1)∈S

  E[S] exp − kB T

where “S|some condition” refers to all secondary structures S that fulfill that condition. In the base pairing energy model given by Eq. (1) the first of these two terms is simply the partition function Z i, j for positions i, . . . , j. The sum over the structures of the second term can be simplified by realizing that due to the no pseudoknot constraint if base k and base j + 1 are paired no other base pair (l, m) with l < k and m > k can appear in the structure S. Thus,   E[S] exp − kB T S|(k, j+1)∈S    E[S1 ∪ S2 ∪ {(k, j + 1)}] = exp − kB T S1 S2       εk, j+1 E[S1 ]  E[S2 ] , = exp − exp − exp − kB T kB T kB T 

S1

S2

where S1 and S2 represent the allowed secondary structures on the positions i, . . . , k−1 and k + 1, . . . , j, respectively. This leads to the recursion equation Z i, j+1 = Z i, j +

j  k=i

  εk, j+1 Z i,k−1 Z k+1, j exp − kB T

(3)



εi,i+1

where Z m,m−1 ≡ 1 by fiat. Starting from Z i,i = 1 and Z i,i+1 = 1 + e k B T (notice that in the simplest energy model used here hairpins with zero unpaired bases are allowed) this recursion equation can be used to calculate the partition function for all subsequences of length three from the knowledge of the partition functions of all subsequences of length one and two, the partition function for all subsequences of length four from the knowledge of the partition functions of all subsequences of length one, two, and three, etc. Thus, after calculating O(N 2 ) sub-partition functions the calculation of which each involves one summation and thus O(N ) operations each, the total partition function can be calculated with the help of Eq. (3) in O(N 3 ) time. The same idea—yet in a much more complicated incarnation—can still be used for the more complicated but more realistic nearest neighbor energy model and forms the basis of modern secondary structure prediction software (Zuker and Stiegler 1981; Hofacker et al. 1994; Zuker 2003; Ding et al. 2004; Xayaphoummine et al. 2005).

123

Partition functions of RNA secondary structures

2.4 Molten phase The molten phase occurs if temperatures are low enough such that base pairing is favorable but not yet so low that differences in pairing energies of different structures are relevant. Under these circumstances we can ignore all sequence details and just assign an energy to each structure that only depends on the number of base pairs in this structure, i.e., E[S] = ε0 |S|

(4)

with some (negative) ε0 . An example of a molecule in this molten phase suggested by de Gennes (1968) is a molecule with the sequence AUAUAUAU. . .. In such a molecule any consecutive AU can pair with any other consecutive AU and all such pairings contribute the same binding energy. Thus, this molecule is described by Eq. (4) if each pair of original bases is treated as one “effective base”. However, it has been pointed out that any molecule at temperatures close to but below the denaturation temperature behaves as if it were described by the energy model Eq. (4) if one combines several bases into one “effective base” since any piece of random RNA will attract any other piece of random RNA with an approximately equal binding energy (Bundschuh and Hwa 1999). In order to quantitatively characterize the molten phase of RNA, we have to calculate the partition function in this molten phase. We denote the partition function for a molecule of N bases by G(N + 1) and introduce the notation   ε0 q ≡ exp − kB T for the Boltzmann factor of a base pair. Since the energy function Eq. (4) does not depend on the actual sequence, the recursion equation (3) simplifies to G(N + 1) = G(N ) + q

N −1 

G(i)G(N − i)

(5)

i=1

with the boundary condition G(1) = 1. In the limit of large sequence length N this recursion equation can be solved by introducing the z-transform (Jury 1973; El Attar 2006)  ≡ G(z)

∞ 

G(N )z −N .

(6)

N =1

This z-transform is a function of the complex variable z that encodes G(N ) via the inverse z-transform  1 N  dz (7) G(N ) = G(z)z 2πi

123

R. Bundschuh

where the complex integral is along a closed contour outside the singularity with  largest absolute value of G(z). The large N behavior of G(N ) is completely given  in the vicinity of its singularity with the largest by the behavior of the function G(z) real part. This approach is closely related to the technique of generating functions followed by singularity analysis (see e.g., Nebel 2003), where the z-transform used here is equivalent to the generating function of the partition function G(N ) and the calculation of the inverse z-transform via complex contour integrals corresponds to the singularity analysis.  by multiplying Eq. (5) with z −N on both sides and summing We can calculate G(z) over N . Careful inspection of both sides yields  − 1 = G(z)  + q G(z)  2. z G(z)  with the solution This is a simple quadratic equation for G(z)  = z−1 − G(z) 2q





z−1 2q

2 −

1 q

(8)

(the other solution does not fulfill the boundary condition that any z-transform has to vanish as |z| approaches infinity). Applying the inverse z-transform Eq. (7) to this result yields (see e.g., (Liu and Bundschuh 2005))  G(N ) ≈

√ 1 + 2 q −3/2 √ N (1 + 2 q) N 3/2 4πq

(9)

 in the limit of large N which comes from the square root singularity of G(z) at √ z 0 = 1 + 2 q. From this result, we can e.g., read off how many RNA secondary structures there are, since the partition function for ε0 = 0 orequivalently q = 1 simply counts all the

3 N −3/2 3 N for a molecule with N − 1 structures. Thus, we find this number to be 4π bases. This differs from the classical result in Waterman (1978) since in the classical result structures in which two bases that are immediately next to each other are paired are excluded while we allow these structures here.

2.5 Molten native transition Now, we will study what happens if the sequence of an RNA molecule is designed to prefer a specific structure S0 , which we call the “native structure”. In the spirit of the molten phase this is modeled by assigning two different energies to the base pairs, namely one energy ε1 to every base pair in the native structure, and a different energy ε0 to every other base pair, i.e.,

123

Partition functions of RNA secondary structures

εi, j =

ε1 (i, j) ∈ S0 . ε0 (i, j) ∈ S0

(10)

If ε1 = ε0 this is the same as the molten phase but as ε1 becomes more negative than ε0 , there is a preference for the designed structure that becomes the more pronounced the bigger the difference between ε1 and ε0 . In the context of protein folding this model is known as the G¯o model (G¯o 1983). We will now review how the z-transform of the partition function of this model was calculated exactly in Bundschuh and Hwa (1999). In Bundschuh and Hwa (1999) the native structure for a molecule of (even) size M is a simple hairpin, i.e., the q) structure S0 = {(1, M), (2, M − 1), . . . , (M/2, M/2 + 1)}. We denote by Z (N ; q,

the partition function of this model for a molecule of length M = 2N − 2 where q has the same meaning as for the molten phase and   ε1

q ≡ exp − kB T is the Boltzmann factor for the native base pairs, i.e., for the base pairs in the native structure S0 . The first step in calculating this partition function is to note that the sum over all possible structures can be decomposed into a sum over the number n of native base pairs in a structure and within this into a sum over all possible positions i 1 , . . . , i n of these native base pairs as follows  N  −1  E[S] = Z (N ; q,

q) = exp − kB T S





n=0 1≤i 1

Unified approach to partition functions of RNA secondary structures.

RNA secondary structure formation is a field of considerable biological interest as well as a model system for understanding generic properties of het...
355KB Sizes 0 Downloads 0 Views