This article was downloaded by: [Rutgers University] On: 08 April 2015, At: 16:27 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Journal of Biomolecular Structure and Dynamics Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/tbsd20

An Algebraic Representation of RNA Secondary Structures a

Yuri Magarshak & Craig J. Benham

a

a

Biomathematical Sciences Department, Box 1023 , Mount Sinai School of Medicine , 1 Gustave Levy Place, New York , NY , 10029 Published online: 21 May 2012.

To cite this article: Yuri Magarshak & Craig J. Benham (1992) An Algebraic Representation of RNA Secondary Structures, Journal of Biomolecular Structure and Dynamics, 10:3, 465-488, DOI: 10.1080/07391102.1992.10508663 To link to this article: http://dx.doi.org/10.1080/07391102.1992.10508663

PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is

Downloaded by [Rutgers University] at 16:27 08 April 2015

expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

Journal of Biomolecular Structure & Dynamics, /SSN 0739-1102 Volume 10. Issue Number 3 (1992), '"Adenine Press (1992).

An Algebraic Representation of RNA Secondary Structures

Downloaded by [Rutgers University] at 16:27 08 April 2015

Yuri Magarshak and Craig J. Benham Biomathematical Sciences Department, Box 1023 Mount Sinai School of Medicine I Gustave Levy Place New York, NY 10029 Abstract This paper develops mathematical methods for describing and analyzing RNA secondary structures. It was motivated by the need to develop rigorous yet efficient methods to treattransitions from one secondary structure to another, which we propose here may occur as motions ofloops within RNAs having appropriate sequences. In this approach a molecular sequence is described as a vector of the appropriate length. The concept of symmetries between nucleic acid sequences is developed, and the 48 possible different types of symmetries are described. Each secondary structure possible for a particular nucleotide sequence determines a symmetric, signed permutation matrix. The collection of all possible secondary structures is comprised of all matrices of this type whose left multiplication with the sequence vector leaves that vector unchanged. A transition between two secondary structures is given by the product of the two corresponding structure matrices. This formalism provides an efficient method for describing nucleic acid sequences that allows questions relating to secondary structures and transitions to be addressed using the powerful methods of abstract algebra. In particular, it facilitates the determination of possible secondary structures, including those containing pseudoknots. Although this paper concentrates on RNA structure, this formalism also can be applied to DNA

RNA Sequences and their Symmetries Nucleic Acid Sequence Vectors

To develop algebraic methods for describing and analyzing nucleic acid sequences, the four bases C, G, U (or, in DNA, T), and A are designated by the complex numbers -1, I, -i, and i. In consequence of this choice, the numbers associated to complementary bases sum to zero, and bases of the same type (i.e. both purines or both pyrimidines) have the same signs. This arrangement allows the sequence of nucleotides in a single stranded nucleic acid molecule to be written as a complex vector

465

466

Magarshak and Benham

The indices 1,.... , n enumerate the bases of the sequence, as they are encountered when the molecule is traversed in the 5'- 3' direction. The value of k-th coordinate of this sequence vector is the complex number associated to the k-th base according to our rule: A~ +i, C ~ -1, G ~ + 1, Tor U ~ -i. Complementary sequences within a single stranded RNA molecule can base pair to form a duplex structure, provided the complement!!_ry ba~es in the two participating strands have opposite 5'- 3' orientations. Vectors Aand IJ.Corresponding to perfectly complementary sequences of length n are related by

Downloaded by [Rutgers University] at 16:27 08 April 2015

Ak = -lln-k+l• k = 1,... , n.

The construction of a sequence complementary to a given one requires two operations. First, each complementary base must be found, which involves multiplying by -1 in the present formulation. Then these complementary bases must be ordered with the reverse orientation to that of the original sequence. A sequence containing two perfectly complementary subsequences, one comprising basesj+ l, .... j+n and the other comprising basesj+m+ l, .... j+m+n, (m>n), has the structure A;+k = -A;+m+n+l-k• k = 1,... , n.

[1]

A sequence_cont11ining such a complementary pair of subsequences sometimes is denoted .... A1 •••• A~ .... (Although sequence vectors actually are column vectors, to save space they are written within the text in this transposed form.) The collection of all possible RNA sequences containing n total bases comprises a set An of vectors residing in then-dimensional complex vector space en. An is not itself a subspace because it is not closed under addition (all components must be± 1 or ±l)(l). Its geometric realization is the set of vertices of the 2n (real) dimensional hypercube in en whose diagonals, when projected on each complex coordinate plane are [-1,1] and [-i.i]. (~ becomes a group when it is endowed with either of two binary operations either multiplication or the complex dot product of corresponding components. However this structure plays no role in the developments reported here.) Symmetry Relations within RNA Sequences

Subsequences within a nucleic acid molecule can be related to each other in several regular ways, some ofwhich are known to be biologically important. The simplest example is the direct repeat, a structure which occurs frequently throughout eukaryotic genomes. A sequence containing a direct repeat, one copy of which occurs at bases j+ l, ... .j+n and the other at basesj+m+ l, .... j+m+n, (m>n), has the structure A;+k = A;+m+k• k = 1,. .. , n.

A mirror repeat occurs when two subsequences have the identical order of nucleotides, where one is read in the 5' -3' direction and theotheris read in the 3'- 5' direction. A sequence containing a mirror repeat has the structure

Representation of RNA Secondary Structures '1+k = Ai+m+n+l-k•

m>n, k

=

467

1,... , n.

Downloaded by [Rutgers University] at 16:27 08 April 2015

Palindromes are a special case of a mirror image in which the two copies abut. The only difference between direct repeat and mirror repeat sequences is the direction in which the nucleotides are read. The reverse complementarity needed for duplex formation within a single stranded RNA molecule is a third important type of symmetry, in which the subsequences involved are related as shown in Equation [1] above. To construct either sequence from the other, each base is mapped to its complement and the sequences are arranged in reverse order. Many more types of symmetries are available to pairs of subsequences within a nucleic acid molecule, most of which have not been considered previously. Given a particular subsequence, there are 4! = 24 ways of constructing a related subsequence by altering the bases through one-to-one reassignments. For example, one could construct a new sequence from an existing one by replacing A by U, Uby A, G by C and C by G. This may be written as A ~ U, G ~C. In our algebraic notation this association is given by x- -x. This mapping replaces a given pyrimidine with its complementary purine, and vice versa. If the constructed sequence is ordered in the same direction as the original, one has a direct complement. If the ordering is reversed, the reverse complement sequence results. The mapping x - ix associates the bases in the order G qA q C q U q G. Complex conjugation, which reverses the sign of every imaginary component while leaving real components ftxed, associates A ~ U. Because the subsequences related by each of the 24 possible mappings can be oriented in either direct or inverted order, a total of 48 different possible relationships exist. Biologically important roles are known for direct and mirror repeats, and for reverse complemetary sequences. To date little consideration has been given to the possible biological significance of other types of symmetry relationships between subsequences. It is clear that various of the associations described above will preserve physico-chemical attributes such as possible secondary structures, hydrogen bond numbers therein, natural curvature, or bending or torsional stiffnesses. The distributions of speciftc attributes between the two subsequences involved will be related by a symmetry which depends on their relative orientations. Any of these attributes (or several others) could potentially be important in the biological context

Algebraic Representations of RNA Secondary Structures Self-complementary subsequences within a single-stranded RNA molecule can associate to form regions of intra-strand base pairing. Each collection ofcompatible base pairs permitted by the base sequence constitutes a secondary structure for the molecule. In each secondary structure any residue can pair with at most one other, and duplex formation can only occur between reverse complementary subsequences. The definition of secondary structure used here does not consider distance or steric constraints on base pair formation, so some possible secondary structures as presently defined may not be realizable in practice. Two methods of describing RNA secondary structures are presented here. The first

Magarshak and Benham

468

involves a matrix representation that provides a useful classification of all secondary structures available to a particular sequence. The matrices developed here describe individual secondary structures. The collection of all possible secondary structure matrices is shown to obey a simple eigenvalue equation. This approach is different from earlier matrix treatments, in which a matrix was used to describe the base pairings available to a given sequence. (2,3) The second method uses diagrammatic techniques that permit clear visualization of the structures involved. Secondary Structure Matrices

Downloaded by [Rutgers University] at 16:27 08 April 2015

-

The secondary structures compatible with a sequence A. are given by matrices having a particular form. Ifthe sequence contains n nucleotides, all of its structure matrices have dimension n Xn. The matrix S corresponding to a particular secondary structure is defined as follows. If base i is paired with basej in this secondary structure, then sii = sii = - 1. If base i is not paired with basej (i j), then sii = sii = 0. Also, S;; = 1 ifbase i is not paired with any other base, ands;; = 0 otherwise. Because any base can pair with at most one other, S has a single non-zero entry in each row, and in each column. All off-diagonal entries are either zero or -1, while all diagonal entries are eitherO or 1. Thus structure matrices are signed permutation matrices (signed in the sense that all non-zero off-diagonal elements have negative signs).

*

Two examples illustrate these constructions. First, the matrix corresponding to the secondary structure in which no base pairs are formed (hereafter called the trivial structure)is thenXn identitymatrix/n. ThesequenceAUAcan form two non-trivial secondary structures, in which the U base pairs with either the last or the first A The matrices corresponding to these structures are

(

1 0 0 ) 0 0 -1 ' 0 -1 0

(

0 -1 0) -1 0 0 . 0 0 1

[2)

A structure matrix S has entries s;i = sii = - 1 exactly when bases i and j are hydrogen bonded. It follows that Sis a symmetric matrix. Structure matrices also have the property that SS = In. In consequence, S = S- 1, so Sis a symmetric, orthogonal matrix, whose only eigenvalues are ± 1. (1). One can view the off-diagonal entries of S as describing the collection of base pairs that must be formed in passing from the trivial, unpaired structure to the pattern of base pairing characterizing the given structure. Then S- 1 describes the base pairs that must be disrupted in passing from the existing secondary structure back to the trivial one. Because S = S- 1, structure matrices describe the bases whose pairing is altered, either formed or disrupted, in passing between the given structure and the trivial one, in either direction. This perspective will be developed further in the section below treating transitions between secondary structures. The relationship between the structure matrix Sand the sequence vector is given by the eigenvalue equation:

469

Representation of RNA Secondary Structures

- -

Downloaded by [Rutgers University] at 16:27 08 April 2015

SA.= f...

[3]

The sequence vector is an eigenvector of every matrix corresponding to an admissible secondary structure for that sequence, whose associated eigenvalue is 1. All sequences oflength n satisfying this equation can adoptthe secondary structure corresponding to S. Conversely, the collection of secondary structures possible for a given sequence is precisely the collection of all signed, symmetric n Xn permutation matrices for which this eigenvalue equation holds. It follows that this equation provides a complete characterization of the possible secondary structures for a given nucleotide sequence. To find all secondary structures available to a particular sequence by this algebraic method, one must solve the reverse eigenvalue problem of equation [3]. Instead of finding the eigenvalues and eigenvectors associated to a specified matrix (the standard eigenvalue problem), here one must determine the collection of matrices S having the specified sequence vector as eigenvectpr with eigenvalue+ 1. The standard eigenvalue problem finds all sequence vectors A. which can assume a particular secondary structure given by S. Many orientation-preserving symmetries of the type~ de~crJ.bed in the previous section also preserve secondary structure. In particular, A.. -A.. if... and -if... all have identical collections of possible secondary structures, as substitution in equation [3] demonstrates. Choose a particular base sequence whose vector is A.. and denote by .I: the collection of all its possible secondary structures. Then .I: is comprised of all matrices Shaving the above structure for which the eigenvalue equation [3] holds. If S 1 and S 2 are two matrices in the set .I:, then

-

-

- -

SIS2A. = SI(S2A. ) = SIA. = A . Therefore products of elements in .I: also satisfy equation [3]. However, .I: does not form a group because products need not have the correct form to be structure matrices. In particular, products of structure matrices need not be symmetric. For example, the product of the structure matrices given in expression [2] above corresponding to the two non-trivial secondary structures of the sequence AUA, is

~ ~ ~1) ~1 ~ ~) 1

(

0 -1 0

(

0

0 1

~ ~ ~1) 1

= (

1 0

[4]

0

The multiplication of structure matrices is given meaning in terms of transitions between secondary structures, as described in the next section. This multiplication is not commutative, but instead satisfies

[5] where ST denotes the transpose of the matrix S.

470

Magarshak and Benham

s I---

a

Downloaded by [Rutgers University] at 16:27 08 April 2015

3 I---

s~-----

Figure 1: This figure shows three examples of secondary structures, each of which is represented in the three diagrammatic styles described in the text. Part a) shows two sequential hairpins. Part b) depicts a pseudo-knotted structure, while part c) shows a pair of loops.

Because structure matrices for long molecules are cumbersome to write or manipulate, it is advantageous to describe them in terms of the standard blocks of which they are comprised. First, a run of base pairing that hydrogen bonds the bases i+ l, .... ,i+m to the basesj+ l, .... ,J+m will contribute two symmetrically placed antidiagonal mXm blocks Am to the structure matrix, each having the following form:

Representation of RNA Secondary Structures

471

b

Downloaded by [Rutgers University] at 16:27 08 April 2015

5'---3'----

5 ·-Figure 1 continued

472

Magarshak and Benham

c 5

I---

Downloaded by [Rutgers University] at 16:27 08 April 2015

3 I---

5' 3

1

Figure 1 continued

0 0

Am =

0 ... 0 -1 0 ... -1 0

[6] -1 ... 0 0 -1 0 ... 0

0 0

This is an anti-diagonal sub matrix, all of whose non-zero elements are -1. A run of m consecutive unpaired bases will contribute a subblock consisting of the mXm

Representation of RNA Secondary Structures

473

identity matrix, Im, located on the diagonal. Finally, large rectangular subblocks whose entries all are zero will be found. Such sub blocks are written either as 0 or as Oq.,.. the qXr matrix with all the elements equal to zero.

Downloaded by [Rutgers University] at 16:27 08 April 2015

A secondary structure is said to be reducible if it consists of two disjoint parts. That is, there is a position kin the sequence which is a cut point for the secondary structure. In other words, the secondary structure divides into a non-trivial part involving only bases before the k-th, and another non-trivial part involving only the k-th and later bases. The secondary structure shown in Figure la is reducible, consisting of two separate, sequential hairpins. The structure matrix of a reducible secondary structure consists of diagonal blocks, one for each component. Examples A hairpin secondary structure can be formed by a sequence of the form

If~ spans bases k+ l, .... ,k+m and its complementary subsequence~ spans bases k+j+ l, .... ,k+j+m, {i>m), then the structure matrix corresponding to complete base pairing between these subsequences has entries -1 along the appropriate antidiagonal: sk+i.k+i+m+l-i

= -1, i = l,...,m.

A single hairpin containing i duplex base pairs separated byj intervening unpaired nucleotides has structure matrix

An example of a secondary structure containing two hairpins is shown in Figure la. A pseudo knot can be formed in sequences having the structure~= (...~1 ...~---~···~···)· The structure matrix corresponding to this pseudoknot has the anti-diagonal entries corresponding to the two runs of base pairing both occurring within the same diagonal block. A pseudo-knot secondary structure is shown in Figure lb. The structure matrix of a pseudo-k(not

hrrr ) ~

0 A; 0

0

A loop can be formed in sequences having the structure~= (...~···~···~·~···)·Figure lc shows an example containing two loops. An important special case occurs when a second _£opy o.f oqe Qfthese subsequeqces exists (4-9), so the molecule_has the structure (....A.1, ••• "'·"-2···~·1{...). In this case I{ can pair with either copy of A.1• Indeed, its

474

Magarshak and Benham

terminal bases can pair with the first copy and its initial bases with the second copy. As the noint separating the bases that pair with the two copies can occur anywhere within~' the possibility exists that this junction point may migrate, leading to loop

motion. This possibility will be examined in a later section.

Downloaded by [Rutgers University] at 16:27 08 April 2015

Other Applications Several important problems can be attacked in intriguing new ways using this approach. For example, biologically important RNA secondary structures have low energy, hence are highly base paired. Finding the collection of low energy secondary structures can be formulated here as a minimization problem. The number of unpaired bases in a secondary structure equals the number of non-zero diagonal entries in its structure matrix. As each such entry has value 1, the number of unpaired bases in the structure equals the sum of the diagonal entries of its structure matrix, which is called the trace of the matrix. Thus the most highly base paired conformations are those whose structure matrices have minimal trace. As developed to this point, the current formalism does not consider G · U base pairs,

and does not ascribe free energies to secondary structures. These factors can be accommodated by simple modifications that are described in the Discussion and Conclusions section below.

Structure Diagrams RNA secondary structures also can be depicted diagrammatically. Although many ways to do this have been used, here we adopt procedures that facilitate the visualization of transitions between alternative secondary structures, which we propose may occur. A diagram corresponding to a given RNA secondary structure is constructed as follows. First, draw a line denoting the polymer backbone. The individual residues are vertices along this line. If nucleotidesj and k are hydrogen bonded, then vertices j and k of the diagram are connected by an edge. If nucleotidej is not base paired, no edges originate from its vertex other than those of the backbone. The correspondence between a structure matrix and its diagram is quite direct. If nucleotidej is unpaired, then its corresponding vertex in the diagram is free, and the only non zero element of both row j and column j is the diagonal entry s;; = 1. If nucleotidesj and k are base paired, then their vertices in the diagram are connected by a supplementary edge. In this case the entries s;k = - 1 and ski = - 1 occur in the structure matrix, and all other elements ofrowsj and k, and of columnsj and k, are equal to zero. Structure diagrams can be drawn in any of several formally equivalent ways. Here we indicate the three standard diagram styles, each of which is suitable for illuminating particular problems. These styles are illustrated in Figure 1. In the first style, the backbone line is drawn in circular form, and the edges corresponding to hydrogen

Representation of RNA Secondary Structures

475

J

5' Downloaded by [Rutgers University] at 16:27 08 April 2015

3'

k 1 Figure 2: A secondary structure is shown that consists of a loop, a hairpin and stems. The letters i,j,k and l designate the numbers of nucleotides in corresponding regions. Here the unpaired region comprising the loop is depicted in a circular form that is equivalent to the more conventional bubble. This is done to emphasize the symmetries of the participating sequences. The matrix corresponding to this secondary structure is shown in the text. bonding are represented by interior arcs joining the vertices corresponding to the participating residues. In the second style the backbone is depicted as a line segment, and the hydrogen bonded residues are joined by arcs drawn above the backbone line. In both of these representations pseudo-knotted structures produces diagrams in which the edges corresponding to these bonds cross. The third representation draws the stems, loops and other parts of the secondary structure with the backbone arranged so as to minimize or eliminate overlaps. This is the most common diagrammatic method for representing RNA secondary structures (10). As described here, structure diagrams are mathematical objects, corresponding both to a given secondary structure and to its associated matrix. Although they are similar to diagrams presented elsewhere (10), here they implicitly are endowed with the properties associated with their corresponding matrices. In this sense they can be multiplied, each structure has associated eigenvalues and eigenvectors, an inverse structure (which happens to be itself), etc. As an example of the relationship between a structure diagram and its matrix, the

diagram depicted in Figure 2 has the following associated matrix:

S=

oii

0;;

0;; oki 0/i oki A;

[.

.I

Ok; Or.I ok..I 0;;

oik o.k .I okk o,k Ak oik

oil

o.,.I

ok, I, ok, oil

oik o.k .I Ak o,k okk oik

A; 0;; oki 0/i oki 0;;

Here the entries In,Ak and O;k are the submatrices described in the previous section.

Transitions between Secondary Structures The multiplicity of possible secondary structures available to an RNA molecule

Downloaded by [Rutgers University] at 16:27 08 April 2015

476

Magarshak and Benham

suggests the possibility that transition between alternative structures may occur. A case has been reported where an RNA molecule was found by NMR to exist in two distinct secondary structures, one of which contains a pseudo-knot (11,12). The possibility of transitions among available secondary structures has several important consequences. First, a population of identical RNA molecules may be distributed among its available conformations, as the observations reported by Wyatt, Puglisi and Tinoco (12) confirm. If the population were to achieve equilibrium, this distribution would accord with Boltzmann's law. Moreover, we propose here that the secondary structure of an RNA molecule may be dynamic, with motions of loops and other structures occurring in cases where the sequence permits such behavior. This in tum suggests the possibility that biological activities of RNA may depend on dynamic factors as well as static conformations. These matters will be explored below. First we present two ways of describing transitions between secondary structure, one algebraic and the other diagrammatic.

Transition Matrices Suppose that an RNA molecule changes its secondary structure from S 1 to S2. This process is described by a transition matrix T2,1, which is defined to be

T2 = s2"'I c--1

.

[7]

This matrix product describes the sequence of operations in which~ 1 first unfolds the original structure into the trivial one, then S2 refolds the trivial structure into the final conformation. Because structure matrices are their own inverses, S 1 = ~ 1 , it follows that T2,1 = S 2S 1• Transition matrices also have the structure of signed permutation matrices. There is a single non-zero entry in each row and each column, whose value is ±I. In contrast to structure matrices, transition matrices need not be symmetric, and they can have off-diagonal entries that are positive. These two attributes are shown in the example given by equation (4] above. Because the structure matrix of the trivial conformation is In,

Thus a structure matrix Sk can be considered to be a transition matrix between the trivial conformation and the given one, in either direction. According to equation [5] above, transition matrices need not be symmetric, but instead obey the rule T

(TI,2)

,.....-!

= T2.1 = 1. 1.2·

(8]

As such, transition matrices are orthogonal.

Transition diagrams

Transitions between alternative secondary structures may be represented using

Representation of RNA Secondary Structures

477

k

a

5'

Downloaded by [Rutgers University] at 16:27 08 April 2015

3'

5'



3'

5'

3'

5'

3' Figure 3: Parts a) and c) show a cd-passage in each of the two transition diagram styles. Parts b) and d) depict the changes in secondary structure involved in this passage. Part e) shows a more complex transition, whose matrix is given in expression 9 of the text.

478

Magarshak and Benham

Downloaded by [Rutgers University] at 16:27 08 April 2015

c

d

e q

5' 3'

q

Q

S/SZS) c q Figure 3 continued

diagrams. In this case two types ofedges must be distinguished. Ifnucleotides k and 1 are disconnected in the transition process, a disconnective edge (abbr. d-edge) is drawn as a wavy line joining their corresponding vertices. lfnucleotides i andj are connected in the transition process, a connective edge (c-edge) is drawn between their vertices. This can be smoothly curved or straight, as best suits the problem at hand, but it is distinguished from a d-edge by the absence of inflection points. Vertices whose connections (either present or absent) are not changed in the transition process need not be shown in the diagram. However, if there is a need to display these invariant connections, a third type of edge must be used. Example of transition diagrams are shown in Figure 3. A connective-disconnective passage (abbr. cd-passage) is the passage from vertex i

Representation of RNA Secondary Structures

479

Downloaded by [Rutgers University] at 16:27 08 April 2015

of the transition diagram along a connective edge il and thence along a disconnective edge lk. This corresponds to disconnecting residue I from residue k, and connecting it to residue i. The transition diagrams corresponding to a cd-passage are shown in Figures 3a and 3c, while the resulting changes in secondary structure are depicted in Figures 3b and 3d. A cd-passage is also referred to as a two-step passage. A one-step passage may consist of either a connective or a disconnective edge. Thus a two-step passage is comprised of two one-step passages, one of each type, with one vertex in common. In a transition diagram each vertex can be the terminus for at most one c-edge and at most one d-edge. The transition matrix corresponding to a transition diagram may be found as follows. We number the nucleotides in the order they are encountered in the 5'-3' direction, and consider each in tum. In considering the edges of a transition diagram, connective edges are given precedence over disconnective ones. Thus, the edges of a cd-passage are traversed starting at the end of the connective edge, passing along that edge to the common vertex, and thence along the disconnective edge. The nature of the connective and/or disconnective edges starting at each vertex determines the entry in the corresponding row of the transition matrix. First, if no connective or disconnective edge occurs at the vertex corresponding to nucleotide i, then the only nonzero element in the i-th row of the transition matrix is the diagonal entry tii = 1. If two edges occur at vertex i, then they must be one of each type. In this case the c-edge takes precedence. The d-edge determines the matrix entry only if there is no c-edge at the same vertex. In that case, if the d-edge connects vertex ito vertex I, then til = - 1. Otherwise, the nature of the c-edge takes precedence, in which situation two cases can occur. First, suppose that the c-edge at nucleotide i is the start of a cd-passage that passes through intermediatevertexk, and terminates at nucleotide I. In the corresponding transition, base pairing is disrupted between nucleotides k and I, and formed between nucleotides i and k. Then the only nonzero element in row i occurs in column I of the transition matrix and has value til= 1. Alternatively, suppose that the c-edge at vertex i terminates at vertex k without there being ad-edge at vertex k to complete a cd-passage. In this case t;k = -1. Here transition diagrams are drawn in two different styles. In the usual style stems are displayed as parallel sections of backbone, and unpaired regions are denoted by arcs. Thus, mispaired or unpaired bases occurring within a stem structure are denoted by bulges in the backbone. Connective edges are denoted by short straight connections, and disconnective edges by wavy lines. Base pairs that are unchanged in the transition are denoted implicitly by depicting the corresponding sections of the molecular backbone as parallel straight lines. This style of transition diagram corresponds to the standard style for depicting secondary structures. In the alternative style the molecular backbone is drawn as an arc that is nearly a complete circle. Connective and disconnective edges are drawn as straight and wavy chords of this circle, respectively, joining the appropriate vertices. This style corresponds to an alternative way to depict secondary structures. Examples of these transition diagram styles are shown in Figures 3a and 3c. As an example, Figure 3e shows a transition diagram involving six nucleotides. The

transition passes from a secondary structure in which base pairings occur between

480

Magarshak and Benham

nucleotides 2 and 6, and between nucleotides 3 and 5, to one in which nucleotides 1 and 6, 2 and 5, and 3 and 4 are base paired. The transition matrix corresponding to this situation is the following:

0 0 0 0 0 -1

1 0 0 0 0 0

0 1 0 0 0 0

0 0 -1 0 0 0

0 0 0 1 0 0

0 0 0 0 1 0

[9]

Downloaded by [Rutgers University] at 16:27 08 April 2015

Secondary Structure Dynamics - Loop Motions Here we illustrate the possibility of dynamic transitions between alternative RNA secondary structures by considering the case of loop motions. We note that other types of dynamic structural transitions also are possible, such as the transient formation of pseudo-knot structures that has been observed by Wyatt, Puglisi and Tinoco (12). A complete consideration of the dynamics of structural transitions will be presented in a later contribution. As described in a previous section, a loop can be formed in a sequence having

the structure

-A= (...-At ...-A( ...-A( ...)

when the subsequence A1 pairs partly with the first copy of the complementary sequence and partly with the second. This causes the region between the two copies, as well as parts Qf each, to form a loop as shown in Figure 4. If the partitioning of the base pairing ofA1 between the two complementary copies changes, effective motio_n of the loop results. This motion consists of a succession of steps i!! which a base in A1 next to the separation point lets go of its partner in one copy of A( and forms a pair with the corresponding base in the other copy. In the transition diagram this is a cdpassage. A sin_gle step of this type is called a shift. In a shift the base pairing of one nucleotide in ~ is moved to the identical nucleotide in the other copy. The one that originally was in the loop pairs and joins the stem while the other one, originally in the duplex stem, unpairs and joins the loop. This elementary shift transition moves the loop one step and simultaneously rotates it by one nucleotide, as shown in Figure 4. Our initial theoretical estimate of the time required for a single shift is approximately 10- 5 sec (13). This time has the same order of magnitude as that for one step of RNA synthesis by polymerase. In a slight modification of this scheme, one or more unpaired nucleotides could occur at the join region. This might be necessary because of steric effects in this region, which could preclude complete pairing. The transition diagram corresponding to a fundamental shift in this case is shown in Figure 5c. For the balance of this paper we assume that fundamental shifts occur through cd-passages, as described in the previous paragraph. The fundamental attributes of a shift can be demonstrated by considering the

481

Representation of RNA Secondary Structures

5' 3'

A

c

c

u

A

G

Downloaded by [Rutgers University] at 16:27 08 April 2015

3'

5'

5'

3'

c

3' A

c

u

G

A

5'

3'

5'

c

c

A

3'

u

A

G

5'

Figure 4: The structure diagram of a loop is shown. together with the transition diagram corresponding to a fundamental shift one step to the left.

sequence AAU. Suppose the initial structure pairs the terminal U with the second A, and the final structure pairs the U with the first A The transition from this initial structure to the terminal one is called a left shift. The transition matrix corresponding to this shift is

(

~ ~ ~I

-1 0 0

)

(

~ ~1

0 ) 0 -1 0

= (

~ ~ ~I

) .

-1 0 0

A succession of such shifts results in a rolling motion of the loop. In the absence of

Downloaded by [Rutgers University] at 16:27 08 April 2015

482

Magarshak and Benham

3'

5' Figure 5: Parts a) and b) show the structure and transition diagrams for a transition in which base pairing is disrupted at the base of the loop. Parte) depicts the motion of this loop by a single shift. The single dotted line in this Figure indicates the mirror symmetry between the sequence of the loop and the sequence of the upper strand. The double dotted line indicates the complementary symmetry between the upper and lower strands.

base pairing within the loop, this is expected to be an approximately isoenergetic process because the sizes of all loops and the numbers and types of all base pairs formed are identical throughout. In a succession of shifts, the loop appears to roll over the duplex region of the molecule. Although a complete examination of the dynamics governing this process will be deferred until a later paper, the loop can be considered as a travelling dislocation exhibiting soliton-like behavior. The isoenergetic character ofloop rolling suggests that its progress along the molecule may approximate a random walk. When the loop reaches either end of its range, A.1 is completely bonded to one of the copies ofits complementary subsequence. Either of two events

Downloaded by [Rutgers University] at 16:27 08 April 2015

Representation of RNA Secondary Structures

C U G AU

483

5'

U G A U Figure 5 continued

can occur, depending on the state of base pairing of the non-complementary region abutting the free complementary copy. If that region is unpaired, then the free copy that occurs as the loop reaches this end simply contributes more unpaired nucleotides to this region and the loop disappears. This effectively constitutes an absorbing barrier for the random walk. On the other hand, if the abutting region is and remains base paired, then when the loop reaches this end it does not disappear. In this case the end acts as a reflecting barrier. The initiation of looping from an absorbing barrier requires the formation of the loop, which is expected to be energetically disfavored. The size of this energy barrier to initiation is expected to

Magarshak and Benham

484 g

g

Downloaded by [Rutgers University] at 16:27 08 April 2015

5'

5'

3' Figure 6: Two loops moving on the same strand are shown colliding and unifying into one large loop.

govern the overall rate at which transitions between the two end states occurs. Loop motions between two reflecting barriers would be an approximately permanent form of RNA structural polymorphism that may have biological significance. The loop involved cannot be absorbed without unpairing of at least one ofthe adjoining duplexes. These matters will be examined in detail in a later contribution. An RNA sequence may possess local symmetries that permit multiple loops to form. These present the possibility that several types ofinteresting interactions may occur. To describe these possibilities we distinguish in the accompanying figures between the upper and lower strands, even though these may be two regions on the same molecule.

Representation of RNA Secondary Structures g

485

g

5'

Downloaded by [Rutgers University] at 16:27 08 April 2015

3'

g

5'

c

3'

5'

g

3'

g

¢::= 5'

3'

(

' g'

5'

g

3'

5'

Figure 7: Two loops moving on the same strand that pass through (or, equivalently, reflect om each other.

Two loops can occur on the same strand in a molecule having the following type of sequence:

--- -

A= (...f....IAtAt···f....1···)·

[10)

In this case either no loops, one loop or two loops can form. If there are two loops present on the same strand, both may move through a succession of elementary shifts. When they collide two possibilities can occur. Either they amalgamate into one large loop, as shown in Figure 6, or they pass through (or, equivalently, reflect

486

Magarshak and Benham g

5'

Downloaded by [Rutgers University] at 16:27 08 April 2015

3'

5'

III I I II

II

3'

¢:: g

S'

Figure 8: Two loops moving on opposite strands are shown. Their collision can result in enlargement, shrinkage, annihilation, or passage.

off) each other, as shown in Figure 7. If they amalgamate to form a large loop, this may subdivide by the reverse process into two smaller loops.

Representation of RNA Secondary Structures

487

Two loops can occur on opposite strands in a molecule whose sequence has the following attribute:

Downloaded by [Rutgers University] at 16:27 08 April 2015

[II]

In this case also both loops can move by a succession of elementary shifts. When they encounter each other, several alternative possibilities can occur. The loops can either enlarge, shrink, annihilate, or slide by each other, as shown in Figure 8. At the time of collision both loops can shrink through the formation of duplex base pairs between complemenary bases on the upper and lower strands. Alternatively, the loops can enlarge by the reverse process to shrinkage. If the two loops are the same size, then repeated subtractions can result in their complete annihilation. If one loop contains more nucleotides than the other, then complete subtraction annihilates only the smaller loop. Figure 8 depicts a two-loop rolling collision. Some rolling loops preserve the numbers of secondary structure elements. In the process of single-loop rolling only the lengths of stems change, but not their dispositions relative to each other, nor the numbers of loops or stems. Two-loop unifications (Figure 6) each decrease the number ofloops and stems by one. A twoloop annihilation decreases the number of loops by two.

Discussion and Conclusions The mathematical formalism presented here permits important questions regarding macromolecular sequences and secondary structures to be formulated. Representation of nucleotides as complex numbers suggested consideration of different symmetry relations than those examined to date - direct and mirror repeats, and reverse complementarity. The possibility exists that some of these non-standard symmetries may have biological importance that has not been recognized to date. The authors are initiating searches of nucleic acid sequences to determine the distribution of pairs of subsequences related by any of the 48 symmetries described here. Either the presence, the absence, or the locations of symmetrically related subsequences may be biologically important, so each is being evaluated. Preliminary investigations have found several non-standard types of symmetries in HIV RNA sequences that are unlikely to be the results of random events. The results of this work will be presented elsewhere. The representation of nucleotides as complex numbers suggests that base sequences correspond to complex vectors. This in tum permits their secondary structures to be described using matrices. This approach permits the powerful techniques of abstract algebra to be applied to the problem of finding all secondary structures available to a given RNA sequence. In this new approach, the possible secondary structures are given by the set of all signed, symmetric, unitary permutation matrices for which the sequence vector is an eigenvector whose eigenvalue is I. These methods provide a new way to attack this important problem that does not impose artificial restrictions on the solution set. In particular, secondary structures containing pseudo-knots will be found.

488

Magarshak and Benham

Downloaded by [Rutgers University] at 16:27 08 April 2015

Once a set of possible structures has been found, a second order refinement of each can be performed to determine the collection of additional G · U pairs that can occur in it. A free energy can be associated to each resulting structure by applying the standard rules (14). The results of this analysis may be used to find the low energy secondary structures, equilibrium distributions among states, and other attributes of interest. In principle this approach improves on dynamic programming methods in two ways- it can find all low energy secondary structures, and it includes pseudo-knotted conformations. Dynamic programming algorithms artificially preclude pseudo-knots, although they are known to occur in RNA molecules. The present approach provides an alternate method to attack this important problem. However, at present it is too early to determine whether the calculations needed to implement this method will be tractable in practice. We have shown that sequences related by several types of simple symmetrie~ h_ave tqe sl!_me collection of secondary stll!ctures available to them. For example, A.,i'A., A., -iA.and the complex conjugate of A. all have the same secondary structures available to them, as insertion into the eigenvalue equation demonstrates. However, the energy associated to a given secondary structure in each case depends on the sequence, because A· U and G · C pairs have different energies and some of these symmetric structures change the identity of the base pairs involved. This formalism illuminates many intriguing properties of secondary structure which may have biological significance, including movable loops (12,13). In later contributions questions relating to the dynamics of loop motion will be explicitly addressed.

Acknowledgments The authors gratefully acknowledge discussions with Dr. A. Kister, who first suggested the possibility of movable loops to us. We also thank Dr. V. Ivanov for fruitful discussions. This work was supported in part by grant DMB 88-96284 from the National Science Foundation. References and Footnotes 1. 2. 3. 4. 5. 6. 7.

8. 9. 10. II. 12. 13. 14.

B. Noble, Applied Linear Algebra, Prentice Hall, Englewood Cliffs NJ, (1969). I. Tinoco, O.C. Uhlenbeck and M.D. Levine, Nature 230, 362 ( 1971 ). E.N. Trifonov and G. Bolshoi,.!. Mol. Bioi. 169, I (1983). AA Mironov and AE. Kister,.!. Biomol. Str. Dyn. 4, I (1986). J.N. Topper and D.A Clayton,.!. Bioi. Chern. 265, 13254 (1990). K. Rietveld, K. Linschovten, C.W.A Pleij and L. Bosch, The EMBO Journal], 11,2613 (1984). K. Rietveld, R. Van Poelgeest, C.W.A Pleij, J.H. Van Boom and L. Bosch, Nucl. Acids Res. 10, 1929 (1982). K.H. Jornson and D.M. Gray,.!. Biomol. Str. Dyn. 9, 733 (1992). S.P. Teng, S.A Woodson and D.M. Crothers, Biochem. 28,3901 (1991). B.AShapiro, J.Maizel, L.E.Lipkin et al. Nuc. Acids Res. 12, 75 (1984). J.R. Wyatt, J.D. Puglisi and I. Tinoco, Bioessays II. 100 (1989). J.R. Wyatt, J.D. Puglisi and I. Tinoco,.!. Mol. Bioi. 214,455 (1990). Y. Magarshak, C.J. Benham, J. Malinsky and L. Blumenfeld, Biophysics (USSR) 37,4 (1992). Freier et al, Proc. Nat/. Acad. Sci. USA 83, 9373-9377 (1986).

Date Received: July 15, 1991, April 20, 1992

Communicated by the Editor Ed Trifonov

An algebraic representation of RNA secondary structures.

This paper develops mathematical methods for describing and analyzing RNA secondary structures. It was motivated by the need to develop rigorous yet e...
1MB Sizes 0 Downloads 0 Views