Vol.8, no 6. 1992 Pages 529-534

CABIOS

Fast statistically based alignment of amino acid sequences on the base of diagonal fragments of DOT-matrices V.B.Streletc, I.N.Shindyalov, N.A.Kolchanov and L.Milanesi' Abstract

Introduction In general, alignment algorithms are used to investigate evolutionary and functional correlations of protein or nucleotide sequences. In this context, the problem of alignment might be formulated as follows: compare two sequences and choose a set of elementary transformations that most reflect their likelihood according to some definite measure. Transformations that correspond to real mutational events, such as deletions or insertions, are considered elementary. Among the easiest visual methods, the matrix of dot homology (DOT-matrix) is widely used. In this matrix D, each element D,j reflects similarity of the symbol (according to some measure) at position i of the first sequence to the corresponding symbol at position,/ of the second sequence. Parts of diagonals involving non-zero elements reflect a certain homology of long sequence resions (Gibbs and Mclntyre, 1970). Thus, the problem of alignment may be represented as obtaining such a pain within the DOT-matrix that would have a maximal weight according to the above measure. Figure 1 gives an example of such an optimal path and the corresponding alignment of the sequences.

Characteristic of the classical Needleman—Wunsch (NW) algorithms are the following features: (i)

Effectivity—number of the L x L order operations (where L is the sequence lengths), (ii) Memory required—number of bytes necessary for the storage and treatment of the matrix, i.e. drat of the L X L order.

Of all the attempts to improve the classical algorithms, 'segment' alignment should be first mentioned. Here an optimal path is defined within the DOT-matrix. The path involves a number of small sub-areas and the correct alignment for each area is obtained conventionally (Boswell and McLachlan, 1984). Although this method certainly speeds the algorithm, it requires certain restrictions on the area of the search for potentially neighbouring elements along the optimal path within the DOTmatrix (Roitberg, 1984). A considerable attention should also be paid to the algorithms based on the so-called 'diagonal fragments' (Df), i.e. parts of diagonals of the DOT-matrix corresponding to certain homologous sequences. For example, the simplest filter imposed on the Df length (/ > 2) eliminates short homologies, thus speeding die search for the optimal paui (Maizel and Lenk, 1981; Novotny, 1982). But such speeding (by a factor of 5 -10) does not solve the storage problem and thus there arises an idea to save up only the 'best' Df (Roitberg, 1984). Using this approach along with the classical NW algorithms results in a marked loss of accuracy (here that of alignment is equal to a S e q A:

TAJLATCAGCGGO

Seq B:

TAAATCGGC

DOT-matrix TAAATCAGCOGQ

Optimal

T A A A T C A G C G O G * * • * • * • * » T A A A T C - - - G G C 1st hotnology region

2nd homology region

G C

Institute of Cytology and Genetics of the Siberian Department of Russian Academy of Sciences, Lavrenryeva avenue 10. 630090 Novosibirsk, Russia and 'istituto Fig. 1. Example of the DOT-matrix for sequences A and B and their optimal di Tecnologie Biomediche Avanzale, Consiglio Nazionale Delle Ricerche. via alignment. Marked are elements of the maximum-matched path (parts of diagonals corresponding to regions of homologous subsequences). Ampere 56, 20131 Milano, Italy

© Oxford University Press

529

Downloaded from http://bioinformatics.oxfordjournals.org/ at University of Birmingham on March 20, 2015

We present a new pairwise alignment algorithm that uses iterative statistical analysis of homologous subsequences. Apart from the classical conversion of the DOT-matrix characteristic of the Needleman - Wunsch algorithm (NW), we used only those matrix elements that corresponded to the most non-random subsequence homologies. The most reliable elements of the DOT-matrix are written to the compact competition matrices. The algorithm then searches for alignment on the base of only these matrix elements. Our algorithm has low storage and memory requirements, but provides a reliable alignment for the sequences of weak homology (or, at least for the homology regions). In such cases classical NW algorithms often produce unreliable results on the level of statistical noise due to accumulation of random matchings throughout the aligned sequences.

V.B.Streletc a at.

Statistical estimation and selection of diagonal fragments Let us consider two sequences A and B of length Lj and L^, respectively. This pair of sequences is represented in the DOT-matrix D by their elements D,v (1 ^isl^, 1 0

(9)

(i.e. only if the element Dfk describes matching of positions of sequences A and B) and k-\)

when i = 1, ..., Mmia (10)

(i.e. after the minimal weight element in column and any row x of weight matrix W is found) and (11)

531

Downloaded from http://bioinformatics.oxfordjournals.org/ at University of Birmingham on March 20, 2015

V{Df) whenD N B l + t _

Depth o-f Of storage

Length of seq B

V.B-Strdetc et aL

(i.e. only if the weight of the element Dfk from fragment Df being written is more than the minimal found in the column NAx+k-\ of the weight matrix W)

Sequence A: TAAATCAGCGGQ Sequence 8: T A A A T C G G G

1

7 12 Weight 5: AgCGGG AtCGGG 4 9

8

Weight 7: T A A A T C B G

TAAATCgG 1 8

then CxJMl+k-i

=

(12)

Fig. 5. Considered horologies of subregions for the sequences given in Figure 1 (with arbitrary weights).

(i.e. coordinate of position in the sequence B corresponding to the element Dfk is written to matrix Q .

1

(i.e. weight of the element Dfk is written to matrix W) Information on Dfk which corresponds to the absence of homology between positions of A and B is not written to the matrices (Figure 4). Thus they are not used in the alignment like elements of the DOT-matrix describing non-homologous positions: as a rule, their weight is zero, so they contribute nothing to the average weight of the optimal path. According to the above rules (9)-(13), if there are no free cells found in column NAt+k— 1 of matrix W (or C), the previously written element of the earlier analysed Dfl is 'expelled' by the more optimal element of the next Dfl. In fact, rules (9)-(13) ensure a 'competition' in columns between elements of different Df. Such competition is possible due to the number of rows in matrices C and W (A/min) being much less than L^. The choice of Mmin is based on the analysis of results obtained through various test alignments of real amino acid sequences. The analysis has shown that in order to approach an alignment obtained with the NW algorithm for the case of closely related protein sequences (homology >70%) it was sufficient to consider only two or three elements of different Df Tor each position of the first sequence (i.e. Afmin=3). And only five were required for the case of homology

Fast, statistically based alignment of amino acid sequences on the base of diagonal fragments of DOT-matrices.

We present a new pairwise alignment algorithm that uses iterative statistical analysis of homologous subsequences. Apart from the classical conversion...
2MB Sizes 0 Downloads 0 Views