This article was downloaded by: [University of Toronto Libraries] On: 16 March 2015, At: 07:46 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Journal of Biomolecular Structure and Dynamics Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/tbsd20

Explicit Distance Geometry: Identification of all the Degrees of Freedom in a Large RNA Molecule a

Mark Andrew Hadwiger & George E. Fox

a

a

Dept. of Biochemical and Biophysical Sciences , University of Houston , Houston , Texas , 77204-5500 Published online: 21 May 2012.

To cite this article: Mark Andrew Hadwiger & George E. Fox (1991) Explicit Distance Geometry: Identification of all the Degrees of Freedom in a Large RNA Molecule, Journal of Biomolecular Structure and Dynamics, 8:4, 759-779, DOI: 10.1080/07391102.1991.10507843 To link to this article: http://dx.doi.org/10.1080/07391102.1991.10507843

PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content.

Downloaded by [University of Toronto Libraries] at 07:46 16 March 2015

This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

Journal of Biomolecular Structure & Dynamics, /SSN 0739-1102 Volume 8, Issue Number 4 (1991), @Adenine Press (1991).

Explicit Distance Geometry: Identification of all the Degrees of Freedom in a Large RNA Molecule

Downloaded by [University of Toronto Libraries] at 07:46 16 March 2015

Mark Andrew Hadwiger and George E. Fox Dept. of Biochemical and Biophysical Sciences University of Houston Houston, Texas 77204-5500 Abstract An alternative approach to distance geometry(" explicit" distance geometry) is being developed for problems, such as the modeling of RNA folding in the ribosome, where relatively few distances are known. The approach explicitly identifies minimal sets of additional distances that can be added to a distance matrix in order to calculate structures that are consistent with all the known information without distorting the original input data. These additional distances are bounded to the extent possible by the known distances. These explicitly added distances can be treated as degrees of freedom and used to explore the full range of alternative foldings consistent with the original input in an organized way. The present paper establishes that it is practical to explicitly determine such degrees of freedom for even very large RNAs. To demonstrate the feasibility of the approach tRNA was represented as a simple undirected graph containing all relevant information represented in the usual cloverleaf secondary structure and nine base-base tertiary interactions. Using a three atom representation for each residue a total of206 degrees of freedom are explicitly identified. To accomplish this a graph theoretic approach was used in which a minimal covering cycle basis was determined.

Introduction Although structural biology appropriately focuses much of its attention on high resolution structure determination, there are important problems that can at present only be approached at lower resolution. One such problem is the folding of the ribosomal RNAs in functional ribosomes. Numerous studies have been employed to determine the spatial location of the ribosomal proteins including partial reconstitution studies (1,2), protection from chemical probes and enzymatic digestion (3-5), chemical crosslinking (6), immune electron microscopy (7,8) and most informatively, neutron diffraction (9). The rRNA itself can be partially folded as a result of predictive studies of secondary (10-12) and tertiary (13,14) structure as well as a variety ofexperiments that reveal protected and exposed regions. Orientation relative to the proteins can be partially deduced from a variety of experiments that relate various portions of the rRNAs to specific proteins ( 15-18). Several investigators have developed physical models on effectively intuitive grounds to represent the existing data ( 19-23). These models do not explicitly represent much of the actual bulk of the "single-stranded" regions or the flexibility of the double stranded regions, and they

759

760

Hadwiger & Fox

Downloaded by [University of Toronto Libraries] at 07:46 16 March 2015

are not easily modified to accomodate new data that is forthcoming after they are generated. A more recent model is superior to these models with respect to these complaints (24), however none of these studies can readily reveal the full range of global RNA foldings that remain consistent with the data. The long range goal of this work is to develop a more appropriate modeling system that can be readily used by investigators in the ribosome field. The possibility of using the formalism known as distance geometry is being explored. In recent years distance geometry has emerged as an important analytical component in the determination of high resolution structure with 20 and 3D NMR where numerous interproton distances can be measured. For the types of modeling studies ofinterest here, distance geometry has two especially attractive features that suggest that it may be useful. The first of these is that it routinely handles any information that can be represented as distances regardless of source. Hence it can handle distances that are fairly directly measured, such as crosslinks or distances that are inferred such as distance bounds deduced from shape and center of mass information. Each distance known greatly reduces the number of possible structures, especially tertiary contact information which places indirect constraints on all the degrees of freedom in the intervening polymer chain. Secondly, once all the information is available, distance geometry can identify geometric inconsistencies. The usual distance geometry approach with macromolecules is to determine the triangle and tetrangle constraints imposed by interatom distances that are already known to form bounds, or limits, on values of the remaining unknown interatom distances. A value is randomly assigned to every unknown distance such that it is within the triangle and tetrangle bounds imposed on it by input distance information. After a choice for each distance is made, the unknown distances are adjusted until each unknown distance is consistent with the bounds placed on it by the choices made for all the distances, whose value was either input or guessed. The process that correlates the unknown distances into mutual consistency with one another averages any random local fluctuations in such a way that the actual distances used are frequently near the average of the upper and lower bounds initially placed on the object. Since the distances are not checked or correlated into mutual consistency with respect to the pentangle inequality (sometimes not even with the tetrangle inequality), the object represented by this completely filled distance matrix is more than three-dimensional. The next step is to mathematically embed the overspecified object in three dimensions in order to obtain coordinates. Because the object specified is not three-dimensional, it must be projected down upon three-dimensions, and this results in distortions of the original input distance information that must be corrected by minimization techniques. If the number of distances added to the matrix is too large compared to those originally provided, they will strongly influence the embedding process and control the solution that is obtained. If the averaged values placed in the unknown distances are reasonable reflections of the values placed into "known" distances, then the structure obtained will not be difficult to correct for "projection" distortions, and will be reasonable. The result obtained will be highly dependent on the

Explicit Distance Geometry

761

Downloaded by [University of Toronto Libraries] at 07:46 16 March 2015

process used to correlate the values used for the unknown distances, and the particular distances which are given the most emphasis during this procedure. In problems where the number of unknown distances is relatively small compared to the known distances, it is possible that the systematic errors introduced by the usual distance geometry approach will either be inconsequential or at least localized to the less constrained areas of the structure, and that the answer obtained is truly the only possibility allowed by the input. In the kind of application of interest here, low resolution modeling of rRNA folding, there are always many more unknown distances than known distances. In this situation the correlated "random" values placed on the unknown distances will provide substantial bias in the final three dimensional coordinates. The result is that many generated structures are qualitatively similar and it is thus virtually impossible to fulfill the important goal of exploring the available configurational space in a meaningful way without calculating prohibitive numbers of structures. For example, distance geometry calculations employing the usual approach have been conducted on tRNA (Hadwiger, Davison and Fox, unpublished results) using only secondary and tertiary contact information. The results consistently oriented the helical axes of coaxial helices, like the D-helix and anticodon helix, at approximately right angles to one another rather than correctly aligning them in parallel. Even when constraints were provided that would assist in this orientation, they were averaged out because they represented extremes in relationship to the other input distance bounds. A proper sampling of the configuration space would have given orientations covering the full range ofthe relative helical orientations. The present report furthers the ongoing effort to develop an alternative version of the distance geometry approach that can work with sparse information problems. It was previously shown (25) that not all distances need to be known to specify, or fix, the structure of a molecule absolutely in three dimensions. Instead, the distances contained in one tetrahedra, and four distances to every other point is sufficient to fix the structure of the object in three dimensions (26,27). Four non-coplanarpoints is the minimum number of points necessary to specify a three-dimensional object. Knowing the distances between these four points allows the structure of that tetrahedra to be built, with the chirality of the tetrahedra being the only remaining degree of freedom. If the distances between three points of this tetrahedra and another point are known, then those three points and that other point form a tetrahedra whose coordinates maybe embedded by a similiarprocedure, and that may be oriented with respect to the first tetrahedra on the basis of the three non-colinear points they have in common. The fourth distance to the other point, fixes the relative chirality of the new tetrahedra with respect to the first. The remainder of the points may be attached to the growing structure in a similiar fashion. These findings provide the basis for an alternative approach which promises to be very effective in the ribosomal RNA problem. First all relevant information is converted to either specific distances or distance ranges and assembled into a distance matrix. A search is then conducted to determine how many and which additional distances need to be added to the matrix in order to convert it from the underconstrained condition

762

Hadwiger & Fox

Downloaded by [University of Toronto Libraries] at 07:46 16 March 2015

to a fully, but not over constrained condition. Once an explicit set of appropriate additional distances are located all the remaining distances can be specifically calculated in terms of known distances. The result is a fully filled matrix which now is not overconstrained. Coordinates for specific structures can be calculated from this structure in the usual way. The key additional distances that are added can actually be treated as degrees of freedom. By systematically varying them it is possible to explore the full extent of the configuration space that is consistent with the known data. In previous work, the fundamental rules of construction (25) needed for identifying the required additional distances were presented. In the present report, a practical method for constructing minimum sets of distances, from an arbitrary input distance matrix containing all known distance information, that can be used in exploring major alternative foldings of a large RNA molecule are presented. The feasibility of the ideas demonstrated here is established by identifying all the required degrees of freedom in tRNA when only base-base interactions (i.e., secondary and tertiary structure) are provided. From a formal perspective it is shown that a distance matrix represents an irregular polyhedral object; that that object may be decomposed into a covering set of distinct cycles; and that a set of distances specifying the exact flexibility of each cycle may be inserted with a simple standard set of procedures. Since the degrees of freedom are known exactly, any of the structures geometrically allowed by the distance matrix may be calculated with no distortion of input distance information. The degrees of freedom serve as a state space tree which is amenable for search using common computer algorithms. Furthermore it is found that if the degrees of freedom are grouped according to the cycle in which they are found, each long range contact point serves as a node in a branch and bound search of the state space tree.

Identification of Degrees of Freedom in A Large RNA Three Point Representation of the RNA

A practical technique that can be used for identifying the key distances needed to fully constrain the distance matrix associated with the global folding of an RNA backbone will be demonstrated in the following paragraphs. The first issue to resolve is the desired level of resolution. Regardless of the problem, a large number of the initially known distances come from the covalent, or primary, structure. Therefore it is natural to initiate clustering of distances by grouping atoms into chemical functional groups that they are part of, and in which all inter-atomic distances are known. For RNA, examples of this would be the phosphorus and its oxygens, the atoms in the heterocycle, and the carbons in the sugar ring after fixing the sugar pucker. Not all of the atoms associated with these chemical functional groups are actually needed to represent the flexibility of the molecule. In fact, it can be shown that if the position of nine particular atoms in each residue is specified, the entire nucleotide structure is fixed absolutely in three dimensions. This means that a relatively large RNA fragment can be worked with in full detail. Nevertheless it is clear that the modeling of large RNAs must balance the needs of computational feasibility with an appropriate representation of the actual bulk and connectivity of

Downloaded by [University of Toronto Libraries] at 07:46 16 March 2015

Explicit Distance Geometry

763

the molecule. If nine atoms per residue were used, even yeast phenylalanine, which contains 76 residues, would require an immense distance matrix. In a modeling application where that matrix would be underspecified the number of degrees of freedom that would need to be explored would quickly become prohibitive. Since the long range intention of this work is to develop techniques suitable for the low resolution study of the possible foldings of the rRNA backbone it is more than appropiate to use a simplified representation. Hence in the discussion that follows, a lower resolution RNA model will be used that employs one point from each chemically significant group. These three points are the phosphorus from the phosphate, carbon 4 from the sugar, and the central base-pairing Nitrogen (N3 on pyrimidines, and Nl on purines) from the heterocycle. These three atoms will be referred to as group points. All the distances between these points will be fixed, or considered to be known. Each nucleotide is now represented by a single completelyconnected set of three points and these may be linked covalently into a continuous path. Obviously, there is some reduction in the structural information with this representation. However, since each chemically rigid (or semi-rigid) group is still represented, most of the space occupancy information is retained. In addition, a reasonable amount of flexibility still exists between the significant chemical groups such that after solving for the positions of these points, all atom representations can be superimposed upon them, with minimal deviation from the actual position of the points used in the calculations. Representation of an RNA Structure-Generation of a Segment Matrix

Forphenylalanine tRNA the 76 sets of group points are clustered along the diagonal of the distance matrix according to the residue they represent. All the distances within a residue are considered to be known. These distances form boxes along the main diagonal of the matrix that are referred to as residue sets which are of course ordered along the diagonal according to the RNAs primary sequence. A Connection set is the set of distances in the intersection between two residue sets. Connection sets may contain no known distances, they may contain fewer distances than is necessary to completely-connect the two residue sets, or they may contain enough distances to completely-connect the two residue sets into a single completely-connected set. Covalent distances that link residue sets are in the connection sets adjacent to the main diagonal, while tertiary distances that link residue sets are in connections sets off the main diagonal. In a typical RNA modeling problem a number of helical regions comprising the secondary structure of the molecule will be well known and will divide the RNA into well defined single stranded and double stranded regions. Experimental evidence from double stranded regions in the crystal structure of tRNA and smaller model RNAs establishes that variability exists in the structural details of helical regions. Nevertheless, at the level of resolution associated with the rRNA modeling problem these structures are not significantly different. To a first approximation then, they can be treated as being like theA' helices of double stranded RNA(28) or an alternative standard structure derived from what has been encountered in tRNAor model compounds. In either case one has quite a good idea of the structure of such regions and

Hadwiger & Fox

Downloaded by [University of Toronto Libraries] at 07:46 16 March 2015

764

the flexibility associated with these canonical forms.lt is thus a reasonable approximation to represent these stretches of residues as rigid bodies. Such completelyconnected sets of residues, whether they consist of one, or more than one, residue will be referred to hereafter as segments. A segment then may be represented by a subset of the total number of group points ofits constituent residues. The problem offixing the group points of two such helical segments with respect to one another was addressed previously (25). Group points in the single stranded segments may be fixed with respect to one another by consecutively linking each residue to the residue immediately preceding it using the intersection graph of two distinct bodies (25). Once all the group points contained in a segment are fixed as a rigid body, we need only to know the positions of three of these group points in order to orient the complete rigid segment in three dimensions. This subset of points is referred to as the representative subset. In order to avoid diluting input distance information, the representative subset should include all the points to which there is a known distance, and, therefore, may number more than three points. Fixing the structure of a string of residues reduces the number of degrees of freedom in the system, and representing 1

c . ...

C:aCT .• • C3C •. • TC4C . . . . C:;C • • C6C •.

• T •

T .

. H •

T

.c,c.

• 15 . 16 • 17

.• C&C. .CgC • CtoC.

T. T.

.. cue. .cue .

. TH .•

. ao

• C13C . C 14C . T • C t:;C •• H • . Ct&C . . .

. . c

• 18 . 19

T

.,c ..

• . • Ct8C . • H •• CtgC.

T. • • . T ...

T. . T.

.

c,.oc. cue . .CnC. . C:a3C . . C:a4C.

c :as c .

• C::o6C . • T •

.

H • ~---""'--ir-~;;.,;~

.T . . . T

.T . H.

H ..

Figure 1: Segment matrix oftRNAPhe. The letters in the matrix represent the type of interaction between the segments: T representing a tertiary interaction, C representing a covalent connection, and H representing a helical interaction. The diagonal element indicates the segment's position in the primary sequence, and the left of the matrix indicates the residues contained in each segment and the anatomical portion of the tRNA they belong to. Because of the limited geometrical influence of the Anti-codon loop on the rest of the molecule, the residues of the Anti-codon loop have been lumped into segments to minimize the number of superfluous degrees of freedom.

Explicit Distance Geometry

765

Downloaded by [University of Toronto Libraries] at 07:46 16 March 2015

the segments by a representative set of group points reduces the size of necessary data structures. The effect of these considerations is to reduce considerably the immediate complexity of the problem. Before proceeding it is useful to reformulate the original distance matrix as a new matrix that will be referred to as the segment matrix. An example of the segment matrix that is derived from the base-base secondary and tertiary constraints known to exist in yeast phenylalanine tRNAis shown in Figure 1. A total of34 segments are present in this matrix, including eight that represent the helical regions. Analysis of the segment matrix will allow the determination of an orderly scheme for determining the degrees of freedom in the system, calculating the bounds on each unknown distance, systematically guessing a value for each distance that is a degree of freedom, and determining the structure represented by these guessed distances. The key to this is identification of subsets of unknown distances upon which the known distances place obvious bounds that may be readily calculated. This involves subdivision of the total problem into smaller problems that can be solved individually such as was done previously (25) for the case of a loop (polymer chain with end to end contact). Once a heirarchy of these smaller problems is established, one can use it to systematically sample the available configurations with minimal effort. Identification ofAppropriate Subproblems - The Optimal Spanning Path and Minimal Covering Cycle Basis

In order to divide any particular segment matrix into an optimal heirarchy of subproblems a graph theoretic approach is used. For this purpose the segment matrix can be regarded as the adjacency matrix of a graph in which each segment is represented as a vertex, and the connection sets between segments in which there are known distances become edges. In this graph the connection sets that represent covalent structure, and those that represent tertiary structure are now indistinguishable. The graph theoretic method of subdividing the segment matrix into readily solvable subproblems, is to subdivide the segment matrix into a covering set of cycles, or loops. One way ofidentifying cycles is to first find all spanning paths of the segment graph. Each spanning path may be represented by an adjacency matrix with the first segment in the spanning path being the first element in the matrix, and the remainder of the elements ordered in the sequence of the spanning path. The connection sets that represent the spanning path will fill the diagonal immediately adjacent to the main diagonal in this particular segment matrix. The edges, or connection sets, that constitute the spanning path may be referred to as the pseudo-covalent structure since some of them actually represent tertiary contacts. The remaining connection sets, those not in the spanning path, fall off the diagonal in this particular segment matrix, and each represents afundamental cycle ofthe total graph (29), and will be referred to as cycle connection sets. Each fundamental cycle may be denoted by the two segments that its cycle connection set connects, the lesser being referred to as the first segment, the greater of these being referred to as the end segment, and the segments in between being referred to as intervening segments. In order to be a cycle, there must be a nonnull connection set between each consecutive segment in the spanning path, and a non-null connection set between the first and the end segments.

Downloaded by [University of Toronto Libraries] at 07:46 16 March 2015

766

Hadwiger & Fox

This simple, general cycle structure may be solved using the approach developed in this paper. If an extra connection occurs across a cycle linking two segments that are not consecutive elements of the spanning path, where the first and the end segments are considered to be "consecutive" for the particular cycle being considered, the possibility of overspecif)ring the structure arises using the approach developed here. In order to avoid this, each cycle must have a distinct end segment, or in other words each cycle connection set must fall into a distinct column above the diagonal, The cycles must be solved in the order of increasing end segments. In this way, at the time the cycle is fixed, it contains at least one segment that has not been previously fixed. Once a cycle is fixed, all the segments within it become a single completely-connected segment, and that is how it is treated during the solution of the remainder of the fundamental cycles. If two cycle connection sets occur in adjacent columns above the diagonal, then the second of these must be less than completely-connected. In this case the end segment of the second cycle is connected to a segment in the first cycle through the cycle connection set, as well as being connected to the end segment of the first cycle through the pseudo-covalent backbone. Essentially, there are two distinct connection sets connecting the end segment of the second cycle to the first cycle. When the segments of the first cycle are fixed into a single segment, if one of the connection sets to the end segment of the second cycle is completely-connected, the position of the end segment of the second cycle is overspecified. Therefore, there must be at least one column above the diagonal that does not contain any cycle connection sets prior to a completely-connected cycle connection set. The fixed structure of the previous cycles may still be such that the second cycle can not form, but this can be detected without introducing overspecifications into the matrix. It is difficult to guarantee the presence of a spanning path with a set of unique end segments. Care must be taken in the number of connections to each segment, or in graph theoretical terms, the degree of each segment. In order for every segment to be enclosed in a cycle, each segment must be connected to at least two other segments. If there is a segment that is connected to only one cycle, then there exists a path in the graph that cannot be made into a cycle. A path is solved by sequential connections of connected segments as opposed to the cycle solution given in this paper. Such is the case for the CCA tail of the tRNXhe, and for this reason, it is not considered in the segment matrix given. The presence of two connections assures that there is one unique path into the segment, and a different unique path out of that segment. The presence of a third connection guarantees that this segment will be the start or the end segment for at least one unique cycle. This can be seen in the segment matrix, by counting the nunber of non-null connection sets in the column of the matrix corresponding to the segment of interest: There is one connection to the segment before it in the pseudo-covalent backbone, one connection to the segment after it in the pseudo-covalent backbone, and a third connection to some other segment. If this third connection set is above the diagonal, it must be considered when determining the feasibility of the spanning path, otherwise, it is not. If this third connection set is above the diagonal, the segment in that column is connected to a previous segment in the primary sequence of the pseudocovalent backbone. Most segments have only two or three connections.

Difficulty arises when a segment has a degree greater than three. A segment of

Downloaded by [University of Toronto Libraries] at 07:46 16 March 2015

Explicit Distance Geometry

767

degree greater than three can follow, in the sequence of the spanning path, no more than two of the segments to which it is connected, where one of those segments will immediately precede it in the sequence. If it comes after more than two, than there will be two cycle connection sets in its column above the diagonal that are not a part of the pseudo-covalent backbone. The situation is more complicated when there are more than two segments of degree greater than three. In order to find a feasible spanning path, there must be a path that connects the segments with degree greater than three and satisfies the following conditions: First, no segment of degree greater than three comes before more than two of the segments to which it is connected. Second, the segments in this path may not be connected to another segment that follows more than one position after in the sequence, that is connected to a segment of degree greater than three, and that follows the particular segment of degree greater than three to which it is connected. Obviously, the more segments in the graph of degree higher than three, and the more times the segments of degree higher than three are connected to one another, the more difficult it is to find this path. If a feasible spanning path does not exist, it would be necessary to remove appropriate connection sets, in order to use this particular strategy. The connection sets removed could be replaced by constraining the corresponding degrees of freedom once they have been found. The optimal spanning path partitions the cycles in a manner that minimizes the number of fixed segments in the largest cycle, remembering that once a cycle is fixed it becomes a single segment when considering the remaining cycles. Each segment will have a certain number of degrees of freedom orienting it with respect to some other segment and that number will not exceed ten. It is important to make sure that all the cycles contain approximately the same number of segments. This ensures that the degrees of freedom are partitioned as equally as possible amongst the cycles. This becomes significant when each cycle is considered as a node in a state space tree, as it helps to avoid creation of a bottleneck in a branch and bound search of the that tree by preventing the occurrence of too many possible structures at any cycle. Figure 2 shows the segment connectivity matrix obtained for yeast phenylalanine tRNA when the ideas used in this section are employed to order the 34 segments according to the optimal spanning path.

Explicit Identification ofAppropriate Degrees of Freedom in the Original Distance Matrix Once the optimal spanning path is found it is now computationally feasible to explicitly identify a useful set of distances that can be added to the original distance matrix to make it fully constrained and which can serve as degrees of freedom for exploring the configuration space defined by the original underconstrained distance matrix. Many of the ideas needed to do this were presented earlier (25) and liberal use of the concepts and terminology employed in that earlier work are used in what follows. In order to begin the process a new matrix known as the Degree Of Freedom matrix, or the DOF matrix will be constructed. This is the matrix that will actually be searched for an appropriate set of degrees of freedom (ie distances to be explicitly added to the original distance matrix in order to complete it) that will allow thorough exploration of the configuration space defined by the original input distance data.

Hadwiger & Fox

768 1:;C . . H. C. C 1GC . . • .

. C11C . . .

• 3-&- 36

• . CtiiC . . . H • . ClJ!C . . . C:anT. C.

• 37-38 43 AC HELIX V LOOP uAC LINK

• 3!1

. C.

T1 C. . . Cm3THI;. • . • T 3 C .

· it4

T . ..

. :aG • C:a:a

.c .

C:a:;C.

. T.

.T

C:aoC. T

LUUl'

• lf_D • :i4 . 55

G3

T T

HELIX

T

HELIX

LUUl'

. so . 57

. ;;8

C3nC

C3• C . H.

HELl X

V

. ltll

. T.

. H.

C:aGC. T. C 27C . C :aS C . .

D

• ~G • ~7

. C:a3C.

. ~9 . ~u

. C3:aC C 33C . . C3-4H.

. 61

1------·-c--,·.-------------· ~ T .

G:;

7:1 AA HELIX

• GO

. H • C.

T

13

• 14:>

C:a4LC~·------~

HELIX

DAA LINK

• 10-

c:

. C.

:a:; D

. II

. • • HC4 T . .

1----·_c_:--y~ ~

Downloaded by [University of Toronto Libraries] at 07:46 16 March 2015

31 AC HELIX 33 AC LUUl'

. 27

· ~:a

• 1-

·~ . 14

..

. . T:; "

.coc.

T DAA LINK D L()fll'

• 1:1

.C7C. .C8C.

. 16 • 17

r-----------~~T~~~~~~~~~·~C~9C

. 18

r-~~~~~~~~~~~~-T~·~~~~~~~~·~C~1~uC . 19

c.

.c.,c~u

. ... c

1:1:11

Figure 2: Optimal spanning path for the segment matrix oftRNAPhc. The lines enclose the segments contained in each cycle. The cycles being solved from left to right in the order of their rightmost line. The other conventions are the same as those in Figure I. The cycles are designated by Roman numerals I- XIII which indicate their position in the sequence in which the cycles are solved.

In order to construct the DOF matrix each segment is replaced with its representative set of group points; the reordered segment matrix is expanded; the group points are grouped according to the segment; the segments are ordered along the diagonal in the order of the optimal spanning path; the intra-segment distances are placed between the appropriate groups points in the on-diagonal, sub-matrix representing each segment; and connection distances are placed in the appropriate elements of the off-diagonal sub-matrices representing the connection sets. There are two types ofbounds that the known distances in the DOF matrix place on unknown distances: direct bounds, and indirect bounds. Determining whether the bounds existing on any particular unknown distance are direct or indirect requires identifying all paths, through known distances, between the two points representing the distance being bounded. For simplicity, consider a single path. If that path contains the two points representing the distance being bounded, and a third point, then the bounds on the distance may be calculated by the solution of a single triangle inequality generated from the appropriate Cayley-Menger determinant. These are direct bounds. If the path contains the two points representing the distance being bounded, and n other points, where n> 1, this requires the solution ofn

Downloaded by [University of Toronto Libraries] at 07:46 16 March 2015

Explicit Distance Geometry

769

triangle inequalities for each consecutive sequence of three points along the path. Such bounds are indirect. The upper bound arising from such an extended path structure can be obtained by summing the upper bounds from each of its constituent triangle inequalities. Determination of the lower bound structure requires forming all possible combinations of one of either the upper or lower bounds from each triangle inequality, making all possible sums and differences of each of these combinations, and retaining the lowest absolute value from all of these sums. Without exception, indirect bounds require more computation than direct bounds. Characterization of the paths becomes more complicated when structures connecting the two points representing the distance being bounded contain triangles, tetrangles, and completely connected structures, but the analysis is similiar. It is clearly desireable to work primarily with those distances upon which all bounds are direct as long as this is feasible. Identifying directly-bounded distances becomes complicated as the DOF matrix is filled and more connections are present. Each point is adjacent to, or connected by a single known distance to a set of other points that shall be referred to as its adjacent set. If the intersection of the adjacent sets for the two points representing the distance being bounded is also a cutset between the two points representing the distance being bounded, then all bounds on that distance are direct.

When solving each fundamental cycle of the DOF matrix, there are two possible types of cycles: CASE 1: A Cycle Without Intervening Segments

The first type occurs when the number of intervening segments is zero, and one is attempting to link two adjacent segments. In this case, it is appropriate to link the two separate segments by sequentially completely-connecting three points on one segment to the other segment as was illustrated previously (25). In order to do this it is convenient to reorder the sequence of points within each completely-connected set, in decreasing order of connectivity with the other set. As an example of this, Figure 3 illustrates how the reordering of an input data matrix clusters the known distances in a corner of the connection set sub-matrix where additional distances need to be added. For this particular connection set there are a number of different sequences for inserting the remaining distances necessary to completely-connect the two segments, in which all inserted distances are directly bounded, and the bounds are easily identified from the matrix at the time of insertion. However, many other possible connection set matrices exist. Other connection sets may require a specific sequence in which the distances are inserted to avoid complication, and/or may require the calculation of indirect bounds on the inserted distances. As an example, take Figure 4, where three points on the first segment are connected to four points on the second segment. The fact that there are four points on the second segment connected to the first segment requires that points on the first segment be sequentially completelyconnected to points on the second segment. Sequential complete connection of three points on the second segment to points on the first segment would always leave an extra distance not considered during connection of the first three points. This

770

Hadwiger & Fox

Downloaded by [University of Toronto Libraries] at 07:46 16 March 2015

A7 7 7 7 7 787777 7 7 (' 7 7 7 777D77 7 7 7 7 E 7 77777F'i t (,' 7 7 7 7 7 7 7

A i 7 7 7 7

7

7 I

H 7 7 7 7

I I

7 7 7 7D 7 7 7 7 7 7

c

7 7

77 7 7 I 7 7 7 iJ 7 7 771\"7 77 7 L

i 7 7 E 7 i

i I 7 7 7 7 7 7 n77 7 Fi 7 I .] 7 7 7 i 7 i

7 7 7 7 7 7 (; 7 7 7 7 ":" Hi 7 i i i I 7 i 7 7 7 ,\. i i 7 7 7 T.

Figure 3: Reordering of points within a pair of segment sets to facilitate identification ofbounds on and insertion of the distances required to completely-connect the two groups. Following the conventions established previously (12), the sevens represent distances that are known, and the periods represent distances that are not known. Since the elements of the main diagonal are zero, they are used to hold the name of the point represented by that element of the matrix .

.I 7 7 7 i 7

i 7 i 7 7 (' 7 7 7 /) 7 7 7 7 7 7

n7

7 7 7 7

7 7 i 7 E 7 7 F'7 7 T (; 7 7 7 TT 7 7 7 7 7 7 7 7 7 7

7 7 7 I 77 I 7 7.1

7 7 7 7 iT 1\ 77 7

7 7 7 7 7 L

Figure 4: Connection set that requires a particular sequence of insertion of distances and calculation of fixed distances to minimize the number of indirect bounds calculated on the distances added at the time they are added. Notation is as in Figure 3.

extra distance would cause overspecification unless corrected for using the complicated procedure detailed in the previous reference (25). Connecting point F first requires calculation of fewer indirect bounds merely because there are fewer connections to points D and E. Bounds on the distance required to completely-connect point F to the second segment can often be calculated without error considering only the connections to point F, thereby ignoring the indirect bounds placed on this distance by the connections to points D and E. Connecting pointF to the second segment fixes point F with respect to that segment, such that the distances between point F and all other points on the second segment are fixed and can be calculated. The distances in the direct bounding structures on points E and D created by fixation of pointFwith respectto the second segment can be calculated. Connecting pointF

Explicit Distance Geometry

771

to the point on the second segment to which either pointE or point D is attached eliminates the need to calculate one of these distances. It is wise to choose connection sets such that the connectivities that require the calculation of indirect bounds, like this latter example, are avoided. Creation of a general procedure for linking two segments with an arbitrary connection set connectivity is central to the algorithm because of the numerous times it must be done in the determination of a complete set of distances for the molecule.

Downloaded by [University of Toronto Libraries] at 07:46 16 March 2015

CASE 2: A Cycle With Intervening Segments

A second type of situation occurs when the number of intervening segments is more than zero. In general in these cases, the cycle connection set can be completelyconnected or not. However the discussion presented here will assume it is not since the former is actually a special case of the latter. One can begin with the matrix and graph given in Figure 5. The intervening segments are linked to a single point in the first segment, and that point is refered to as the central point. As illustrated previously (25) in an analogous example of polymer chain with end to end contact, the segments are linked sequentially, starting with the segment adjacent to the first segment, and ending with the end segment. In the solution given in the previous paper (25) each and every point in the intervening segments was linked to the central point in the order they occurred in the matrix. Since this was an ideal case, no problems were encountered. However, when the intervening segments contain more than three points, sequential calculation of the distance between every point on the segment, and the central point would mean the calculation of fixed, and therefore, unnecessary extra distances. Because this group of distances must be calculated in a sequential order, vectorization or parallelization would not be possible. Furthermore, A7i . 7117.

i ;

iiCi iDii iEi iiFi i G i i. i Hi.

;

i

j

;

.

H

E

K

i iIi ; .] i i . i [\' 7 .7 i L i i M i i iN i . ; i ()

Figure 5: The matrix and graph for a cycle with intervening segments. The lines in the matrix delineate the groups that consititute the representative subset of each segment The thick lines in the graph represent known distances, while the thin lines in succeeding figures will represent degrees of freedom. The other conventions are as in Figure 4.

772

Hadwiger & Fox

Downloaded by [University of Toronto Libraries] at 07:46 16 March 2015

because of the numerical instability of the determinant solution of values for the bounds, this can also lead to the propagation of error. In order to avoid these complications, it is best to use as few distances in this sequence as possible. Each segment has one point that is maximally connected to the segment that immediately precedes it, and one point that is maximally connected to the segment that follows it (usually the two points are different points). In order to procede, the central point is linked to both of these points in each intervening segment, as is done in Figure 6. This structure is called the fan. The fan is a series of consecutive triangles that each share a common edge. There is an indirect bound on all the distances chosen in the fan. However, it is ignored until the last distance, where it is a direct bound. This last added distance has two sets of direct bounds on it; one formed out of choices made for the fan and the other from known distances. If these two sets of bounds overlap, choices made for the fan are within their respective indirect bounds, and if not, one or more indirect bounds are violated. In this way all the indirect bounds can be checked without the necessity of actually calculating them. If the two bounds do not overlap, then either the choices made for the fan violate the indirect bound imposed by the fixed distances in the cycle, or the fixed distances in the cycle are such that it is impossible for the bounds to overlap. Ifthe choices violate the indirect bound then a new set of choices would have to be made. Situations where the bounds on the last distance added to the fan can not overlap for any possible set of values for the fan distances often arise when one of the segments in the cycle is itself a cycle that was fixed previously. Then distances in one of the constituent segments of the cycle, preferably the segment that is a previously fixed cycle, has to be adjusted before a geometrically-valid structure can be generated. This is the principle means by which geometrical inconsistencies within the structure, whether arising from the input distances or from the added distances, can be detected. The fan connects a set .l 7 7 .

!HI. 7 7 (' 7 :l :J . 3 :J 7/J77 .

77 3 3 7

H

. lEi. :J77f'7

:1 :~

:1

:~ I

I

I

.

IUii. 7 HI. 7 7I 7 .IJII . 7 [\' 7

E

K

.7 7L 7 7M 7 7 7 i\i 7 . 7 70

Figure 6: A graph and matrix for a fan. The threes in the matrix represent triangle bounded degrees of freedom. The other conventions are as in Figure 5.

773

Explicit Distance Geometry

Downloaded by [University of Toronto Libraries] at 07:46 16 March 2015

of points that lie on the shortest path, starting with the central point, traveling by known distances through every segment in the cycle, and returning to the central point. Therefore, the fan represents the smallest number of sequentially-chosen distances needed to check either the consistency of the chosen distances in the fan with the indirect bound specified by the fixed distances in the cycle, or the geometrical consistency of the fixed distances of the cycle itself. After a valid fan is in place, all the points in the cycle will have at least a trianglebound between them and the central point, Figure 6. In other words, all points in the cycle are directly bound to the central point. The distance from the central point to all remaining points in each intervening segment is directly constrained by at least a tetrangle inequality. A choice is made for one tetrangle per segment. Now each segment is rigid with respect to the central point. After fixing the chiralities of the appropriate tetrahedra, all distances between the central point and every other point in the cycle are known, and now all points in the cycle are directly bounded to one another through the central point. These distances orient the bulk of the intervening segments about the triangles of the fan, and for this reason, they are called fan orienting distances. These tetrangle choices were independent of each other and could be made in any order. Figure 7 illustrates these fan orienting distances. The intervening segments are now merged into a single segment by sequentially completely-connecting each intervening segment to the intervening segment that immediately proceeds it. This requires the technique for linking adjacent segments given previously. When trying to connect the segment of interest to the segment that immediately precedes it, priority of connectivity should be given first to points maximally connected to the segment immediately proceeding the segment of interest, second to points of maximal connectivity with respectto the segmentthatimmediately follows, and last to unconnected points. This clusters the distances into complete .\77. 7 B i .......... 7 7 77('7 1 :J:J ·133/1 :J :J 7 4 7/)77 . . 4 7 E 7 ... . :J77F7 .. . .:l 7G7i. -1 . i Hi. :J .77/7 3 .7Jii 4 7 /\"; :3 .iiL7 :J 7M77 I I 7 N7 74 .7i0

H

E

K

0

Figure 7: A graph and matrix for a fan after addition of thefan orienting distances. The fours represent tetrangle constrained degrees of freedom. The other conventions are as in Figure 6.

774

Hadwiger & Fox

Ai i iBi i

i i T ·I :J :3 ·1 :13 4 :3 :3 i •t j'J)j' i 4 I l I I I 4 lEi 1 1 1 l 1 1 :I j' iFI I .tl l 1 4 l iCi 74 J l 4 l I 1 i Hi I I J :} 1 1 4 i j' l i 1 4 :J 1 I I 4 I T./ 7 i 4 1 1 1 1 1 1 "j [\' j' :J 1 1 1 l 1 41 i L i :! i M j' j' j' ,,. i j' 7 j' j'{) i 4

a

Downloaded by [University of Toronto Libraries] at 07:46 16 March 2015

H

j'('

E

K

Figure 8: A graph and matrix for a fan after addition of the fan folding distances. The ones in the matrix represent distances fixed after choices are made for the chiralities of the appropriate tetrahedra. These are not represented on the graph. The other conventions are as in Figure 7.

bounding structures, or submatrices, such that all bounds on subsequent distances added are direct and complete at the time of utilization. This is essential if pentangle distances are used to fix the chiralities of the structure. This will maximize the connectivity of the points in the fan, and place these distances between points on the fan. These distances control directly the folding of the fan, and for that reason will be referred to asfan folding distances. These tetrangle choices are independent of each other and can be made in any order. The fan folding distances are illustrated in Figure 8. At this stage, the intervening segments now form one single completely-connected segment, called the loop segment. Most ofthe distances placed in the matrix so far are those involved in connecting adjacent intervening segments. It would be straightforward to fix each segment in the chain with respect to its neighbor one at a time using the solution for adjacent segments. Because this is a cycle, the end segment is connected to both the segment immediately preceding it in the cycle, and the first segment, and both of these connection sets have to be considered when attempting to fix the position of the end segment, as the possibility of overspecification is increased. In order to deal with this, the matrix is divided into three regions, as shown in Figure 9. At this point, three possiblities arise: 1.) the first segment may be linked to the loop segment, at which point the first segment and the loop segment become a single segment, and all the distances in all connection sets to the right of the rightmost set of vertical double lines form the connection set between the newly combined first-loop segment and the end segment; 2.) The end segment may be linked to the loop segment, at which point all the distances in all connection sets above the highest set of horizontal double lines become the connection set between the newly combined end-loop segment and the first segment; 3.) or the end segment may be linked to the first segment, and distances between the two sets ofhorizontal double lines and to the right of the right

775

Explicit Distance Geometry

AT i ilJT

T T ·1

i

i iC i ·I :J !I 4 :J :J 4 :3 3 i DT I ·l l 1 I 1 I 4 iEi l l 1 1 I I :J i iFi I 41 l 1 :J 4 l i (: T i 4 1 1 I 1 I T Hi l 1 1 :J I I 1 T iii 1 4 :J 1 1 1 4 1 T.] T i 4 1 1 1 1 I l T }\" j :3 I I I I l 47 i L 7

Downloaded by [University of Toronto Libraries] at 07:46 16 March 2015

,,

:J i 7 T4

i

H

E

K

Mi 7 7 NT T 70

Figure 9: Delineation of the intersection zones involved in connecting the first and the end segment to the loop segment.

most set of vertical double lines, and the distances between the two sets of vertical double lines and above the highest set ofhorizontal double lines all become the connection set between the newly-combined first-end segment and the loop segment. This last case is in fact equivalent to having the cycle connection set completelyconnected. It is apparent that either a careful choice of the exact sequence of linkage must be

made, or careful consideration must go into the types of connectivity patterns allowed to exist in the connection sets. The latter is the simpler option. The types of connection sets that will be allowed to exist are three: The single connection between two segments, the case with three connections between two points on each segment, and the completely-connected case. For the tRNA segment matrix given earlier, covalent connection sets are of the first type, tertiary connection sets are of the second type, and helical connection sets are of the third type. Care must be taken in the numbers of each type allowed to exist in the matrix. Considering the three intersection zones between the first, end, and loop segments, if two of these were completelyconnected the loop could not be solved with the procedure detailed here. The order of the segments would have to be rotated in order to remove at least one of the completely-connected connection sets from the first-end, first-loop, or end-loop segment connection sets. The "three connections between the two points on each segment" case is useful, because it allows another point besides the ones that are already connected to be completely-connected to the other segment without overspecifying the connection, while, in fact, exactly specifying the connection. Mter the fan orienting distances are completed, the central point becomes completely connected to all points in every segment, such that every segment shares at least one completely connected point with every other segment. This must be taken

Hadwiger & Fox

776

Downloaded by [University of Toronto Libraries] at 07:46 16 March 2015

into account during all further linkage of segments. The selection of the central point in the example illustrated in Figures 5-9 was made to utilize the single connection in the first-loop segment connection set. That leaves no further connections between points in the first segment to points in the loop segment, other than the central point, which has already been incorporated into every segment during the construction of the fan. This allows us to ignore the first-loop connection set, and link the end segment to the loop segment considering only the end-loop connection set, Figure 10, and then the first segment to the end-loop segment considering only the end-first connection set, Figure 11. This choice of the central point effectively eliminates one

.1 I I

-

I I 1/11 7 7 (' 7 -1 :1 :1 I :J :~ ·I :J :I I ·I I

/), 7 I I I I I I

·I 11-."1 I I I I I I :I liFT I ·II I I :I ·I I I ( ,· 114 I 1 ·I I 1 I 7/JII l l :I I l I I I l l I I :I I I I I I i ./ "j "j ·I I I I I I I i 1\. i :I I I I l I 4 i i L

I I I I I I I I

I I I I 1

"j

I

l I I

I I I 1 I I I I I

H

E

I(

I I I I I J ·t J "j .ll I "j I I I I I I I I ·I lXI I I I I I I I I I I I "j "j() :~

I

I

Figure 10: Connection of end segment to loop segment Conventions the same as in previous figures.

l"ii I l 1 I I 1 I I I 1 4 I

i/11 I I I I I I I 1 I I i i i

"j (

"j

I I l l 1 1 l I I

I i l 4 I :J 1 :J 14 1 :3 1 :1 14 13

}) "j "j

H

·1:1 :J 4 :i =~ 4 3 :J I 4

1 1 11 I 1 1 1 I

iE'i 1 1 1 I l I I 1 l "jjf'

i I 4I I 1 I I I

4 1i

(_,' I -I 4 1 1 1 1 1

1 1 1 I 1

1 1 1 I 1

1 4 1 1 I

E

iHi 1 1 1 1 l 1 i i I j 14 4 1 i .] i i l 1 1 iKi 1 14; i L

1 4 I i

1 1 I 4

1 1 1 1

1 I 3 1 I 1 l I 1 4 1 "j Mi i j i 1 1 1 1 l 1 1 1 4 iN"i l i ·I 1 1 1 1 1 1 1 I 1 i i ()

4

Figure ll: Connection of first segment to loop segment and completion of cycle structure. Conventions are as in previous figures.

Explicit Distance Geometry

777

connection set when linking the first and end segments to the loop segment. Now the problem becomes one of successive linkage of a path of adjacent segments, as opposed to successive linkage of a cycle of segments. The distances added to fix first and end segments are referred to as completion distances. The structure of this particular cycle is complete, and the entire cycle may now be treated as a single completely-connected segment for the rest of the calculation.

Downloaded by [University of Toronto Libraries] at 07:46 16 March 2015

Building the Structure - Chirality of the Tetrahedra

After identification and assignment of values for the degrees of freedom in the matrix, each point in the DOF matrix is contained in a completely-connected four-tuple that shares three points with at least one other completely-connected four-tuple. The coordinates for each four-tuple are extracted from the distance matrix, a particular chirality is forced upon it, and it is oriented with respect to the rest of the structure using the three points it shares with it. Obviously, there are two possible values of chirality for each constituent tetrahedra, or four-tuple, of the DOF matrix. This provides the last degree of freedom, or choice, that must be made in order to complete a structure. This degree offreedom differs from the others in the fact that it has only two possible discrete values, whereas the other degrees of freedom have a continuous range of values. Every tetrangle degree of freedom completes at least one tetrahedra. Each tetrahedra that contains a tetrangle degree of freedom will have a chirality associated with it. In this way, there is at least one discrete chiral degree of freedom associated with each continuous tetrangle degree of freedom. Choosing a chirality for each tetrahedra (except one) is equivalent to choosing a pentangle constrained distance between pairs of tetrahedra. Working with chiralities, however, is computationally more efficient and less likely to generate error than working with pentangle distances. Discussion

The minimal covering cycle basis technique described herein coupled with the basic rules of construction described previously (25) was successfully used to determine the complete set of degrees of freedom associated with a three atom resolution model oftRNA in which base-base hydrogen bonding interactions were provided. A flow chart that summarizes the procedure is provided in Figure 12 to assist the reader in getting an overview of what has necessarily been a very detailed discussion. In the test case a tRNA was broken up into a total of34 meaningful segments and these were oriented through the solution of 13 cycles that were found to naturally characterize the problem, Figure 2. In the end a total of 206 degrees of freedom (Table I) were identified. The location and type of each degree of freedom is stored in a single matrix which can be provided to any interested reader.lt is not printed here however since the structure oftRNA is of course known and the matrix itself therefore is not of prime interest. Performing similar calculations with much larger RNAs (e.g., l6S rRNA) is unquestionably feasible and will require no new techniques. Software which facilitates calculation of the DOF matrix has been developed for the Digital Equipment Corporation (Maynard, Mass.) VAX-3100 computer in FORTRAN running under the VMS operating system. Actual calculation of the DOF matrix for

Hadwiger & Fox

778

IInput. Dat.a I ( 'onsl rud SE>gment l\·lalrix

Find Opt.imal Spanning Pat.h

Downloaded by [University of Toronto Libraries] at 07:46 16 March 2015

~

I

Explicit distance geometry: identification of all the degrees of freedom in a large RNA molecule.

An alternative approach to distance geometry ("explicit" distance geometry) is being developed for problems, such as the modeling of RNA folding in th...
1MB Sizes 0 Downloads 0 Views