Bulletin of Mathematical Biology, Vol. 41, pp. 1 20 Pergamon Press, Ltd. 1979. Printed in Great Britain © Society for Mathematical Biology

ON THE A L G O R I T H M S FOR D E T E R M I N I N G THE PRIMARY STRUCTURE OF BIOPOLYMERS

[] YA. S. SMETANI(2 and R. V. POLOZOV Institute of Biological Physics, Academy of Sciences of the USSR, Pushchino, Moscow Region, USSR

The algorithm for determining the primary structure of biopolymers from complete and partial digests are analyzed. The problem of determining the primary structure is formulated in the form of the problem of word reconstruction in the limits of which the corresponding algorithms are analyzed. Difficulties arising in constructing the algorithms for determining the primary structure of nucleic acids from a partial digest are discussed. They seem to be due to the extensive testing of variants. When there is a certain scheme of the initial data from a partial digest we propose an economical testing (searching) algorithm. The scheme of an effectivealgorithm for reconstruction of the primary structure from N complete digests is given.

Introduction. To determine the primary structure of proteins and nucleic acids is an urgent problem in modern biology. There are two radically different methods of determining the primary structures: (a) successive removal of m o n o m e r units from the end of a polymer c h a i n - - t h e stepwise degradation method, and (b) splitting of a polymer into overlapping blocks the overlap method. The stepwise degradation m e t h o d has not so far gained wide recognition due to the difficulties in selection of optimal conditions for the reaction of monomer removal (asynchronous chain cleavages, etc.). The overlap method has turned out to be a n effective one to determine the primary structures of biopolymers. It is just this method with the aid of which this task has been solved for m a n y proteins and nucleic acids (Salser, 1974; Mandeles, 1972; Sanger, 1952; Itolley et al., 1965). Two approaches can be used for reconstruction of a primary structure: (a) an intuitive-combinatorial way used by an investigator in his routine work, and (b) a mathematical m e t h o d allowing to construct the algorithms that can be realized by a computer.

2

YA. S. SMETANI(S A N D R. V. P O L O Z O V

Recent achievements in the development of new biochemical methods of determining the primary structures change the ideas about the possibilities of the overlap method (Sanger et al., 1977; Maxam and Gilbert, 1977). Nevertheless, a mathematical approach developed due to the overlap method can play its own important role, for instance, in constructing the computer programs for a word reconstruction on the basis of any initial data. The first papers where the formal aspects were introduced into this problem were: (Bernard et al., 1963; Rice et al., 1963; Tumanyan et al., 1963; Bradley et al., 1964). A number of theoretical approaches and some algorithms for reconstruction of primary structures were considered: (Merril et al., 1965; Dayhoff, 1965; Mosimann et. al., 1966; Hutchinson, 1969). And another independent approach was proposed in the works (Smetani6, 1971; Polozov et al., 1972; Smetani6, 1973). In the present paper some known algorithms for determining the primary structures are analyzed. A possibility for construction of a "good" algorithm for determining the primary structure of nucleic acids from a partial digest is considered. The difficulties arising here are due to the extensive testing of variants. When there is a certain scheme of initial data from a partial digest we propose an economical algorithm testing variants. The scheme of an effective algorithm for reconstruction of the primary structure from N complete digests is presented.

1. Formulation of the problem 1. We shall use the notions and terms similar to those considered in the papers (Mosimann et al., 1966; Hutchinson, 1969; Smetani6, 1971; 1973). The overlap method of determining the primary structures consists in generation of the overlapping fragments of a molecule and on their basis the monomer sequence of the entire molecule can be reconstructed. From the formal point of view, the primary structure of biopolymers can be considered as a word in an appropriate alphabet. Subwords of a corresponding word comply with the separate fragments generated by complete or partial digest., The distinction between a one-letter word and that consisting of several letters is clear from the context. For the 'sake of clarity, we shall use the geometrical representation of words. Now we shall introduce the term decomposition of a word. Suppose, that the word X is represented in the form X=X1X2X3, where X I = A B C , X 2 = A B D , X3 =BBB. Figure 1 shows decomposition of the word X into three subwords X1, X2, X3, each of them we shall call a section of the word. Figure 2 demonstrates another decomposition of the word X, where.X =Y~Y2Y3Y4, Y1 =AB, Y2=CAB, Y3=DBB, Y4=B.

DETERMINING THE PRIMARY STRUCTURE OF BIOPOLYMERS

3

X - - A B CI A B D B BB I I

I I

Figure 1

X--ABCABDBBB

Figure 2

Both these decompositions of the word X are given in Figure 3. Figure 4 demonstrates two complete digests o f the fragment of the E. Coli RNA molecule-5S (Mandeles, 1972) in the form of two decompositions. One digest is performed by pancreatic ribonuclease and the other one by ribonuclease T1. L e t the word X be represented in the form X--X1X2, and the word Y~ =X2X3, where X~, X2, X3 are arbitrary words. We shall say that the word X overlaps with the word Y by the word Xa, and the word X2 is the overlapping of the words X and Y. Let X =aYb, then the letters a and b will be called the boundary letters. If X = Y~aY2, then a is said to occur in the word X. The term occurrence of one and the same letter (or a subword)

X= A B C A B D B B B

Figure 3

Z = G UIAG C G C C,G A U~G G U;AG II

I

I I

Figure 4

I I

I I

j I

4

YA. S. SMETANIC AND R. V. POLOZOV

in different sites of the word is defined in a similar way. We shall use the terms covering of a word and connected covering introduced in the papers (SmetaniG 1971; 1973). Any proper subset of a system of sections constituting a given decomposition of the word X is said to be an incomplete covering of X. Figure 5 presents an incomplete covering of the word X by the sections Zi, 1 _, (aa, 5), ,, } ;

Cz[Y~] = {(ab, 3>, , (aa, 1), (bb, 3), (b, 1), , (bb, 1), (a, 1), (1, b)} ; C2[Y3] = {, , (aa, 1>, (bb, 2), (a, 1), (1, a)} ; C2[Z31 = {(ab, 2), (ba, 2>, (aa, 0>, (bb, 3), (b, 1), (1, b)}. It is schematically presented in Figure 8 by the sections II1, Y2, I13; and Z1, Z2, Z3 of the decompositions Y and Z, respectively. It is natural that we do not know in what way these sections are arranged in each decomposition. Now we have to carry out the procedures described in Section 3, point 3.

14

YA. S. SMETANI(~AND R. V. POLOZOV

(1) The first stage o f the algorithm work. F o r m the "sum" of the compositions C2[Yi], 1 _, (ba, 0>, (aa, 1>, (bb, 1}, (a, 1 >, (1, b)}. F o r m the pairs (ab, 1}, (aa, 1}, (bb, 1> from the b o u n d a r y letters of the composition C2[Y~], l < i _ < 3 . Add the elements so obtained to the difference C2[X]\C 3. Thus we obtain the set: C2[Y] = {(ab, 1>, (ha, 0}, (aa, 2>, (bb, 2}, (a, 1>, (1, b>}. F o r m in a similar way C2[Z] = {(ab, 1>, (ba, 0>, (aa, 3>, (bb, 1>, (a, 1}, (1, b}}. (2) The second stage of the algorithm work. Now we need to form the arrays My and Mz. Forming the permutations of compositions, for short we shall write Y~ and Zi instead of C2[Yi] and C2[ZJ, respectively. To the elements M r and M~ we shall assign the indexes. Let us form the array My = {rniy}, 1 < i N 6

Y~Y2 Y3 ~ ( b b a b a a)l = m~ Y2Y3 Yl ~ ( a b a a b b )2 = my2 Y2 Y1Y3~ ( a b b b aa)3=my3 Y1Y3 Y2~ ( b b a a a b )4= my4 Y3Y1Y2~ ( a a b b a b )5 = m~ Y3Y2 Yl ~ ( a a a b b b)6=m r6 . F o r m the array Mz={miz}, 1_

On the algorithms for determining the primary structure of biopolymers.

Bulletin of Mathematical Biology, Vol. 41, pp. 1 20 Pergamon Press, Ltd. 1979. Printed in Great Britain © Society for Mathematical Biology ON THE A L...
1MB Sizes 0 Downloads 0 Views