Modeling RNA Secondary Structures I. Mathematical Structural Model for Predicting RNA Secondary Structures F. SOLER AND K. JANKOWSKI Ddpartement de mathdrnatique, Ddpartement de chimie et biochimie, Universitd de Moncton, Moncton, New Brunswick, Canada EIA 3E9 Received 12 June 1990; revised 17January 1991

ABSTRACT A mathematical model for analyzing the secondary structures of RNA is developed that is based on the connection matrix associated with the planar p-h graph. The classification of the elementary structures allows the introduction of the basis of structural space from which to build the global secondary structure. All admissible solutions belong to the configuration space and can be obtained directly from its basis.

1.

INTRODUCTION

The accepted models of nucleic acids explain their biochemical behavior in a satisfactory and concerted manner. The search for alternative structures for D N A or R N A is essentially justified by theoretical rather than by physiological, biochemical, and functional interests and considerations. However, these structures, if they are shown to exist, allow us to know better the behavior of the molecule in question in a noncellular environment and by the same token to alter the current hypothesis and to some extent advance the quest for a new a m e n d m e n t to the accepted models. In principle, all methodologies applied are based on the concept of re- and denaturation of nucleic acids u n d e r several additional imposed constraints. These rules for formation (or breakdown) of nucleic acids are of two origins: thermodynamic and kinetic. In both cases the extremely complex stereochemistry of these compounds should be taken into consideration. The models resulting from such a study should respect the primary structures of the nucleic acids, the stereochemistry of double-strand D N A and a three-dimensional structure of RNA, and the stability analysis of helical or looplike domains resulting from calculations of the conformation

MATHEMATICAL BIOSCIENCES 105:167-190 (1991)

©Elsevier Science Publishing Co., Inc., 1991 655 Avenue of the Americas, New York, NY 10010

167 0025-5564/91/$03.50

168

F. SOLER AND K. JANKOWSKI

done on the oligo or longer chains of nucleosides and nucleotides, and so on. As a result, the calculated models are evaluated for their utility in order to explain their biochemical function and are then tested on larger compounds displaying specific function. The need for further models is particularly striking in the R N A family. The cloverleaf structure for t R N A is the most popular and the most studied structure. The folding of t R N A in the natural L-shaped structure, with the anticodon (a.c.) and dihydrouracil (D) stems forming one and the amino acid (a.a.) and thymidine (T) stems the second arm of the cloverleaf, leads to the structure with three loops and four minihelix areas. The stereochemical demand for folding of the 70-90-unit polynucleotide into the highly organized cloverleaf structures is such that the presence of invariant bases in key positions is necessary to ensure folding into the L-shaped structure. These bases are involved in a tertiary association when such folding occurs. The other bases as well as the minor and variable nucleotides present could lead to several unusual structures involving some nonclassical as well as interstem base pairings. The other R N A s s t u d i e d - - m R N A , c R N A , and r R N A - - r e v e a l the presence of non-cloverleaf structures. This consideration alone justifies the search and stability calculations for these new structures. The collapse of the polynucleotide chain into some organized structure can be based on free-folding concepts. Any given sequence of nucleic acid (primary structure) in its single-strand form can form hydrogenbonded base pairs (h-bonds) when allowed by external constraints. Pyrimid i n e - p u r i n e base pairs ( A - T or A - U and G - C ) display pseudosymmetry and are the conceptual cornerstone of the W a t s o n - C r i c k structures. The homo or hetero bases can be arranged in 28 ways, but only the abovementioned W a t s o n - C r i c k pairs displaying this pseudosymmetry can lead to the normal and regular double-helix or minihelix structures. When allowed to rotate freely, the t R N A sequence of 70-90 nucleotides can eventually adopt the cloverleaf structure. The starting point for analyzing its formation in a noncellular environment is the first base pair of W a t s o n - C r i c k or n o n - W a t s o n - C r i c k type; then the following bases pair by a zipperlike mechanism [1].

2.

THE PRIMARY STRUCTURE

Any R N A n-molecule can be represented as an ordered set, a sequence of n nucleotides. Each nucleotide is formed by a base that is covalently bonded to the next by a phosphate bond (p-bond). Any R N A can also be considered as a word built from the alphabet 0~¢'= {A, U, C, G} of bases. If •={•,2 . . . . . n}, then any ordered mapping m: 1 - , ~ is a primary R N A structure of the n-molecule. To each position

MODELING RNA SECONDARY STRUCTURES. I

169

i ~ I corresponds a base m(i) ~ ,~¢'. The molecule is oriented with the first base m(i) at the 3' end of the molecule and m(n) at the 5' end. The molecule can be represented as a sequence of n bases, 3'-ACC . . . UCG-5', or by the corresponding positions, 1~2~3~...

~n-l~n.

One p-bond begins in each position i ~ I and finishes in position i + 1, except that the n position has no emerging p-bond. The primary structure is built from ,~" with the p-bond relation and is consequently a chain that we call the p-graph. For the p-graph the vertices are the bases and the p-bonds the edges. F o r an n-molecule there are n!/a!u!c!g! different primary structures (where a, u, c, and g are the numbers of A, U, C, and G bases, respectively). W h e n considered as words of the sO" alphabet, the Kleene closure za¢'* is the d e n u m e r a b l e set of all R N A molecules. Although any word of ~ * is of chemical interest, only a subset of a~'* is of biological interest. 3.

THE SECONDARY STRUCTURE

W h e n an R N A molecule is allowed to interact in a free medium, the molecule folds, and hydrogen bonds, called here h-bonds, are formed. They can be of any of three kinds: (i) G~---C (ii) A ~ U (iii) G - - U

three h-bonds two h-bonds two h-bonds (wobble base pair).

(3.1)

W h e n only the h-bonds (i) and (ii) are considered, the so-called W a t s o n - C r i c k pairing takes place; n o n - W a t s o n - C r i c k pairing occurs when (3) is allowed. Some exceptional n o n - W a t s o n - C r i c k pairing will be allowed in our calculations when so specified. In an isolated molecule the primary structure folds, making certain h-bonds to minimize the free energy. The folding pathways may be greatly influenced by the sequential a p p e a r a n c e of residues [2]. The nonoriented graph ( p - h graph) whose vertices are the n bases of the R N A n-molecule and whose edges are the p-bonds and h-bonds, is called the secondary structure. Two vertices i , j ~ I are said to be p-adjacent if there exists a p-bond p ( i , j ) , where j = i + l or i = j + l because the p-graph is a chain. They are said to be h-adjacent if there is an

170

F. SOLER AND K. JANKOWSKI

h-bond h(i,j). W h e n between i,j ~ I there exists a p-bond or h-bond indifferently, we will write b(i, j). Because any base can have at most two p-bonds and one h-bond, every vertex of the p - b graph is at most of degree 3. A n y secondary structure must satisfy the following conditions: (i) (ii) (iii) (iv) (v) (vi) (vii) (viii)

b(i, i) do not exist; b(i, j) ~ i =g j b(i, i + 1), i < n, is a p-bond; p(i, i + 1) for i < n p ( i , j ) ~ i = j + l or j = i + l (3.2) p(i, j), i < n, do not exist b ( i , j ) ~ h(i, k) or h ( k , j ) are impossible for k 4= j, i h(i,j), i =g j, satisfy (3.1) h(i, j) - h(j, i) b ( i , j ) = h ( k , l ) ~ i = k and j = l

We will call set (3.2) the stereochemical constraints.

S, ( 5 )

=

I

S~ ( 4 )

=

5

Si ( 5 )

:

6 b

q q.,,

q

S, ( 6 )

: I0 I

FIG.

1.

(3.2)

171

MODELING RNA SECONDARY STRUCTURES. I

Theoretically there are many possibilities for the folding of an n-molecule. The number of p-bonds is always n - 1, but the number of h-bonds could be any number less than or equal to i n t ( n / 2 ) . The total number of secondary structures an n-molecule can take will be d e n o t e d by S(n), and the number of structures with only k h-bonds, where k ~< i n t ( n / 2 ) , by Sk(n). Then we have int(n/2)

S(n)=

~'. Sk(n )

(3.3 t

k=0

where S0(n), the number of structures with no h-bonds, is 1, corresponding to the primary structure with n - 1 p-bonds. W e represent the number of secondary structures of the n-molecule with only one h-bond by Sn(n). This will be shown later to be

S,(n)

(n - 1 ) (2n - 2 )

for

n> 2

(3.4)

corresponding to the triangular numbers. This gives us the values and graphs shown in Figure 1, where 1,3,6,10,... are triangular numbers. The preceding structures can easily be obtained if we represent the p - h graph as a linear or circular graph in Figure 2, where the dashed line in both cases represents the h-bond h(3, n - 1 )

n

Without restrictions (3.2), the total number of structures built with n bases and k h-bonds is (g) k

or! k!(~-k)?

(3.5)

where cr = n × n. However, when we introduce conditions (3.2) this number is smaller. If we suppose for one h-bond that b(i, j ) = b ( j , i ) , then g = n(n + 1)/2. If we eliminate b(i, i), then o- - n(n - 1)/2. If we add the restriction that h(i, i + 1)

172

F. SOLER AND K. JANKOWSKI

/I

/-

3

2

n-I

n

F I G . 2.

are not allowed, t h e n g = (n - 1)(n - 2 ) / 2 . With n o n e of these restrictions the total n u m b e r o f structures with any possible n u m b e r of h-bonds is d e d u c e d from (3.5) to be (0/+(7)+

"'"

(3.6)

W i t h Sl,(n) we r e p r e s e n t the n u m b e r of structures built with n - 1 p-bonds and k h-bonds, which satisfy conditions of (3.2).

5 6

6

5 4

r

3 2

~

-

I

~

I

I III I i I

2

3

/I/~'ll 4

5

~x"~k\

6

x~ x \\

/x

~,

rI-I

Flo. 3.

173

MODELING RNA SECONDARY STRUCTURES. I

17

I

i

/

/ /

\

-.. / f

\

-.. \

-..

/

\

\ N

/

/

/

I/ I

/

/

/

lI

// 2

/

5

4

/

j-

- ~

/

\

II z l 5

\ \

\ \ 6

7

8

9 FIG.

I0

\ \

\I II

\ \

\~ I2

\

\\ I.~

14

\1 L5

L6

17

4.

Examples of such structures are illustrated in Figure 3, where the number of arcs built must equal k. The three in Figure 4 represent the same secondary structure with five h-bonds. We found in the literature all three representations of the p - h graph. W e add restrictions (3.2) so that no knotted structures will be allowed; that is, for any consecutive h-bonds, h(il,Jl),

h(i2,Je),

h(i3,J3) ..... h(it,Jl),

we then have il

Modeling RNA secondary structures. I. Mathematical structural model for predicting RNA secondary structures.

A mathematical model for analyzing the secondary structures of RNA is developed that is based on the connection matrix associated with the planar p-h ...
610KB Sizes 0 Downloads 0 Views