MATCH-BOX: a fundamentally new algorithm for the simultaneous alignment of several protein sequences.

Vol.8, no.5. 1992 Pages 5 0 1 - 5 0 9

CABIOS

MATCH-BOX: a fundamentally new algorithm for the simultaneous alignment of several protein sequences Eric Depiereux and Ernest Feytmans etal., 1987). However, the optimal alignment of several sequences is not necessarily the intersection of the pairwise Original algorithms for simultaneous alignment of protein alignments. Other pairwise methods are based on a previous sequences are presented, including sequence clustering and clustering, the closest sequences being aligned first (Corpet, within- or between-groups multiple alignment. The way of 1988; Higgins and Sharp. 1989) or on the detection of anchor matching similar regions is fundamentally new. Complete points common to all the sequences (Vingron and Argos, 1989, matches are formed by segments more similar than expected 1991). A first approach of simultaneous alignment was based by random, according to a given probability limit. Any classic on a generalization of the original Needleman and Wunsch or user-defined score matrix can be used to express the similarity algorithm (Murata et al., 1985). Johnson and Doolittle (1986) between the residues. The algorithm seeks for complete matches also proposed a method of strictly simultaneous alignment. Both common to all the sequences without performing pairwise methods consume large amounts of CPU time and are therefore alignment and regardless of gap weighting. An automatic severely limited. All existing methods remain dependent on a screening delineates all the similar regions (boxes) that may gap penalty, which is known to be a crucial empirical parameter be defined for a given maximal shift between the sequences. (Fitch and Smith, 1983; Lipman et al., 1989; Altschul, 1989; The shift can be large enough to allow the matching of any Argos et al., 1991). Only small changes of this parameter may region of a sequence with any region of another one. It can lead to very different alignments and the user may be confronted also be short and used to refine the alignment around anchor with an unsolved dilemma. Recently Schuler etal. (1991) points. The algorithm provides the most likely optimal alignment proposed an attractive workbench to perform a multiple and a comprehensive list of the alignment dilemma. Duality alignment interactively, both simultaneously for all the between automatism and interactivity is provided. Depending sequences and independently of a gap penalty. Although this environment seems free of most of the limitations raised by the on the problem complexity, a final alignment is obtained fully automatically or requires some interactive handling to other methods, alignment is performed exclusively by hand and no information is provided on its practicability in terms of discriminate alternative pathways. requested know-how and of time consumption. We previously described a methodology for detecting structural and functional Introduction conserved regions in protein sequences (Depiereux and Much effort has already been devoted to provide a tool for the Feytmans, 1991). We showed the interest of this approach in multiple alignment of protein sequences, since it offers the most the general problem of multiple alignment, since several reliable information for delineating conserved regions in related sequences are aligned simultaneously according to their multiproteins and for building likely structural models by template variate physicochemical profile. In an extension to this work, forcing on a known structure. Nevertheless, finding an optimal we present here a set of routines that may be used with any alignment of several sequences remains a challenge, and the scoring matrix. The algorithms produce automatically an advantages of interactive alignments are claimed by several optimal or near-optimal alignment within and between groups authors, even for pairwise alignments (Rechid et al., 1989; of related sequences. Besides the main alignment, a compreSchuler etal., 1991). The main problems in the automatic hensive output of alignments dilemma allows the user to be multiple alignment procedures seem to be related to the aware of possible misalignments. Different interactions with successive pairwise alignment approach, and to the choice of the program make possible the selection of the alignment judged a gap weighting. Most of the approaches are fundamentally optimal when a non-intelligent process fails in doing so. This based on pairwise alignments (Barton and Stemberg, 1987; duality between automatism and interactivity presents two Taylor, 1987; Henneke, 1989; Subbiah and Harrison, 1989), advantages: (i) the final alignment is obtained quickly when the or on an alignment of a sequence with a consensus (Gribskov sequences are similar and free of very long shifts and gaps; and (ii) information is provided to guide an interactive handling when the sequences, or some groups of sequences, are poorly related or considerably shifted. Facullis Univenitaires Notre-Dwne de la Pair. Department of Biology, 61 me Abstract

€' Oxford University Prc.vs

501

Downloaded from http://bioinformatics.oxfordjournals.org/ at Université Laval on July 7, 2015

dc Bruxelles. B-5000 Namur, Belgium

E.Depiereux and E.Feytmans

System and methods

c

Programs were written in FORTRAN 77 on a VAX 6620 under VMS 5.4-2. The executable version is available on request ([email protected]) for academic use. On VAX, outputs are listings formatted for lineprinters. In the present work, graphics and final edition of the alignments have been handled on a Macintosh SE with Microsoft Excel 2.2. The whole package of programs has been implemented in the BIOSTRUCTURE molecular modeling software and is available at Biostructure s.a., Les Algorithmes, Bat. Euclide, Pare d'Innovation, 67400 Ullkirch-Graffenstaden, France. The seven sequences used in the example are rhabdoviruses N proteins (numbered from 1 to 7). Information about these sequences is available in: (1) M.Rossius, P.de Kinkelin and M.E.Thiry (in preparation), (2) Gallione et al. (1981) [NCAP$VSVJO], (3) Crysler et al. (1990) [NCAP$VSVSJ], (4) Banerjee et al. (1984), (5) Gilmore and Leong (1988), (6) Tordo et al. (1986) and (7) Bernard et al. (1990) [NCAPSRABVP]. The code within square brackets refers to the SWISS-PROT database. It is mentioned when the sequence is available in release 15 of October 1990.

A

iiy i c^ •

\ \

Algorithms Definitions Let us consider a set of r protein sequences of different lengths Lj (i = 1 to r) to be aligned simultaneously. A 'window' is a segment of sequence of constant length w (number of residues). Let w be large enough as to determine a structurally meaningful segment, but not too large as to overlap several motives of secondary structure, and/or several regions of deletions or insertions. In this paper, a seven-residue window is considered as the default value. A 'match' is defined as a set of two similar windows belonging to different sequences. To form a match, the two windows must have a significantly much higher similarity (or a much lower distance) than between windows selected in unrelated proteins. Finding a match requires one to (i) delineate the regions to be compared in different sequences; (ii) define a measure of similarity (or of distance) between windows; (iii) select the limit of similarity (or of distance) that will be used to decide if two windows are matching; and (iv) define criteria for selecting the best match when several matches are found in the compared regions. These steps are organized in the MATCHING algorithm. A 'complete match' is a set of r windows (one per sequence) in which each pair is a match. A complete match is associated with a 'pattern of gaps' g, a vector in which g, (i = 1 to r) represents the number of gaps introduced in each sequence to align the complete match independently of any other one. Figure l(a) represents a complete match associated with a pattern g = 0.

502

1 —- hHHC

MSI

A

11 * I B I I I I Fig. 1. Schematic view of the screening for four hypothetical sequences, (a) Complete match of seven-residue windows, (b) Box A is formed by several overlapping complete matches, (c) Boxes A and B are associated with the same pattern of gaps and form a significant box. (d) (1) Box C overlaps box A but has another pattern of gaps. (2) Residues included in C and not in A form segments of length /,, all located at the same side of the box A. The residual length of box C, given A, is the smallest value of /,. (e) Part of the box C can be aligned without disrupting the alignment of the significant box A,B. gi is the number of gaps for aligning sequence i in the box C, given that boxes A and B are already aligned, (f) Box D has another pattern of gaps. Residues included in C and not in A form segments of length /,-, located on both sides of me box A. The residual length of box D, given A, is zero.

A whole set of overlapping complete matches associated with only one pattern of gaps forms a 'box' (Figure lb). A box delineates uninterrupted similar regions in all the sequences. The length of a box is defined as the number of residues included in the box per sequence. A box can be formed by only one single complete match. A whole set of boxes associated with only one pattern of gaps forms a 'significant box' (Figure lc). The length of a significant box is defined as the cumulated lengths of all the boxes including


B

Simultaneous alignment of several protein sequences

in it, and this length must be greater than a given limit value. If the limit value is w, a significant box may be formed by a single complete match. A significant box delineates similar regions in all the sequences, interrupted by one or several residues but not by gaps. The difference between two patterns of gaps gj - gy is a vector in which (gy - gy) represents the number of gaps or deletions introduced in each sequence (/ = 1 to r) to align a significant b o x / , assuming that the significant boxy is already aligned. The distance between the two patterns gj and gj' is defined as:

~ Sir)2

(i)

The gap cost between the significant boxes j and / is defined as the square root of this distance. When two boxes A and C are associated with different patterns of gaps, the 'residual length' of C, given A, is defined in the following way. Let /, be the number of residues of a sequence i included in box C and not included in box A (with / = 1 to r). If the segments of length /, are located at the same side of the box A, the residual length of C, given A is the minimum value of /( (Figure Id). This residual length represents the part of the box C that can be aligned without disrupting the alignment of box A (Figure le). Figure 1(0 shows a box D, in which the segments of length /, are found on both sides of box A. Alignment of box D is incompatible with the alignment of box A. The residual length of D, given A, is then zero. Delineation of boxes and significant boxes, and selection of an optimal set of significant boxes are handled by the SCREENING algorithm. The aim is to align simultaneously an homogeneous group of several sequences, but also to align several groups of sequences. We describe first the within group alignment, and then the features that are specific to the handling of several groups. Matching The matching algorithm depends on two sets of parameters. The first set delineates the regions to compare (scanning parameters). The second set defines the matching criteria (matching parameters). The matching algorithm is independent of the gap cost. Scanning procedure. The sequences are scanned with two nested loops. An initial window is moved in single-residue steps in each sequence. Let ^ (/ = 1 to L, - w + 1) be the position of the first residue of the initial window in the sequence i (i = 1 to r), as shown in Figure 2. For each position jry of the initial window, a running window is moved in single-residue steps in segments of length (2m + 1) of all the other sequences. Let X/x (s =OTQ- m to mo + m) be the position of the first residue of the running window in the sequence k (k = 1 to r with k

Initial window Running window Fig. 2. Schematic view of the scanning procedure for two hypothetical sequences. It the gaps are included in the daUifile, m^ * j . If not, m0 = j .

=£ 0. In a first run of the matching algorithm, mo is the position corresponding to the first residue of the initial window ('"o = j)< juxtaposing the sequences / and it from the first residue. When refining a previous alignment, mo is the position corresponding to the first residue of the initial window, corrected for the cumulated shifts and gaps stored in the data file (mo ^ j). The value of m defines the length of the segment to be scanned by the running window in the sequence k, and has to be fixed by the user. Of course, the scanning is asymmetric at the N- and C-terminal regions of the sequences (1 < s s Z* - iv + 1). Matching procedure. A matching is performed between the w pairs of residues that are obtained by putting an initial and running window side by side. Of course, one pair of residues is formed by one residue of the initial window and by the residue facing it in the running window. Two criteria are taken into account for matching: the number of identical pairs in the two windows (Id) and the sum of the distances observed between the residues of each pair (D). D is obtained in reference to a scoring matrix, i.e. a matrix that defines a distance for any pair of residues. A standard scoring matrix such as the PAM250 scoring matrix of Dayhoff (1983), or a user-defined scoring matrix such as the physico-chemical scoring matrix of Depiereux and Feytmans (1991) can be chosen by the user. If the scoring matrix to be used is a similarity matrix, it is easily transformed in a distance matrix by changing its sign. Let Y be the 20 x 20 scoring matrix. The cumulated distance for the pairwise comparison of two windows is easily obtained by D

=

(2)

503


E

gaps


Screening The screening algorithm has two main steps. In a first step, complete matches are screened in the database of matches, and organized in boxes and significant boxes according to their patterns of gaps. A cutoff limit value fixes the minimal length of a significant box. Matches that are not combined into a significant box are rejected as random noise. This step produces the whole set of significant boxes that can be obtained within the limits of the scanning and matching parameters chosen by

504

the user. Significant boxes and patterns of gaps are stored in a file. Let this file be the 'database of significant boxes'. The significant boxes cannot always be combined all together in one alignment. Significant boxes associated with incompatible patterns of gaps may result from random noise or from alternatives between different possible alignments in some regions. The second step of the screening performs a selection of the significant boxes leading most likely to the optimal alignment. The probability of finding a significant box between unrelated regions decreases dramatically as a function of its length. Thus the biggest significant box is the most reliable and is first selected. The selection of a significant box includes the selection of all the boxes forming it, here named the selected boxes. When a significant box is selected all the non-selected boxes of the database are compared with the selected boxes. The lower residual length of a non-selected box, given each selected box, is computed and stored as its new length. A non-selected box with a residual length lower than w is definitely deleted. The significant boxes are then selected again, according to these changes. The gap cost of a non-selected significant box, given each selected significant box, is computed and the ratio between the residual length and the gap cost is stored. The significant box obtaining the highest value for this ratio is selected. This way advantages long significant boxes and penalizes additional gaps. This process is resumed until all the boxes are either selected or rejected. SIMIL A multiple alignment often involves several groups of sequences more similar within than between groups. Algorithms are proposed to determine groups of sequences, by cluster analysis or principle coordinates analysis. These procedures can be omitted if the groups are defined according to an a priori knowledge of the sequences. The algorithm of clustering is described in detail elsewhere (Depiereux and Feytmans, 1991). Briefly, an r x r similarity matrix S is computed from the database of matches, each element 5,y being the total number of matches between initial windows defined in sequence i and running windows defined in sequence j . As the procedure of matching is not symmetric, S is not symmetric. If a principal coordinates analysis is requested, the algorithm stops after obtained and storing the matrix S. If a cluster analysis is requested, the two sequences with the highest similarity are grouped. The an r— 1 x r— 1 similarity matrix S' is computed, each element S'y being the total number of matches between the sequences, or the number of complete matches between the sequences and the group. According to the highest similarity found, a third sequence is added to the group, or a new group is initialized. The procedure is resumed, always taking into account only the complete matches to evaluate the similarity between groups. At each step, the size of S' is reduced by one; the references of the sequences grouped and the similarity of the group formed are stored. The procedure ends when all the


where k(k = 1 to w) represents one position in the windows, i and j (ij = 1 to 20) represent the pair of amino acids at position k, and y^ is the distance defined for the pair ij in the matrix Y. The two criteria Id and D are associated with limit values, or cutoffs. In the text, we use the term cutoff indistinctly for both criteria, assuming that it refers always to the corresponding criterion. The problem of selecting a suitable cutoff will be discussed later, Let us say, in general, that a cutoff is a limit value unlikely to obtain when matching u'-residue length windows in unrelated proteins. The matching procedure is executed for each initial window. The algorithm compares one initial window of sequence i (i = 1 to r) with the corresponding running windows and selects the best match in each sequence Jt (Jt = 1 to r and k ^ I). The selection is performed according to the minimum distance method, which uses the following criteria, (i) If the greatest value of Id (Wmu) is greater than the cutoff, the corresponding running window is selected; if several running windows are selected because they have the same Idma, the running window with the lowest value of D is selected among them, (ii) If Id^ is lower than the cutoff and if the lowest value of D is lower than the cutoff, the corresponding window is selected, (iii) If one of these criteria is satisfied by several running windows because they have the same value of D, the running window for which |jc,y - JT/J is minimum is selected, (iv) If Id is lower than the cutoff and if D is greater than the cutoff for all the running windows of a sequence, no match is selected in this sequence. The selected running windows are then compared two by two, because similarity is not transitive. They match if Id is greater than the cutoff or if D is lower than the cutoff. Two by two comparisons fill up an r x r truth matrix (1 for a match, 0 otherwise), which is stored in a file with the references of the windows. Each sub-matrix filled with 1 defines a complete match. The same matching procedure is then reexecuted with the next initial windows, until the last of them. Each sequence is scanned against all the others, because reciprocal comparisons are not redundant. Indeed, when a running window B is selected among several running windows Bl, B2, B3, . . . as the best match of an initial window A, the running window A, considered among several running windows Al, A2, A3, . . . is not necessarily the best match of the initial window B. Let the file created by the matching procedure be the 'database of matches'.

Simultaneous alignment or several protein sequences

I

'1

.,0°°

aooooauuuuc

£? 4

8 ++-

3-

3,2 21 Cutoff

0 0

Fig. 3. Flowchart of the main procedures implemented on a VAX. Shadowed boxes represent automatic procedures. Round-edged boxes represent files. Arrows indicate the input —output flow.

sequences form a single group or when no complete match may be found between the groups already formed. Results are printed in a file and may be presented in a specific diagram or dendrogram as shown in Figure 6(a) (not handled by the program).

100

Fig. 4. Cumulated frequencies of the physicochemical distances. The y-axis represents the number of matches pertaining to a class of distances; frequencies are cumulated and the sum + I is transformed in log| 0 . The j-axis represents the distance derived from 10 physicochemical factors. All the pairs of sevenresidue windows ( - 500 000) are considered twice, for the sequences described in the text ( • ) and after randomization of the residues (D).

Groups of similar sequences may be represented on two dimensional graphs, in which the two axes are the eigenvectors associated with the highest eigenvalues. The first factor is generally trivial and is not plotted.

Within- and benveen-groitps matching and screening When groups of sequences are formed, matching and screening can be performed within and between groups.'In the matching procedure, the scanning parameter in of the running window SYMMETRIC can be fixed independently for the sequences pertaining to the same group (wwilhin) or to different groups (m^^ecn). This This procedure reads the matrix S written by SIMIL. An r X r allows groups to be shifted to a considerable extent, without normed matrix R is computed in the following way: disrupting the within-groups alignments. The screening min procedure can then be executed several times on the same (3) database of matches, each time with a different group of sequences. In each truth matrix, only the rows and columns where 5,-, and 5^- are the numbers of initial windows in corresponding to the group of sequences considered are selected. sequences i and7 (S,, = L,—tv+1 and Sy = L,-w+1) and Ry A complete match corresponds to a sub-matrix filled with 1 (0 Ry ^ 1) is a measure of similarity between the values. Depending on the group of sequences, each truth matrix sequences 1 and./, independent of the length of the sequences. will reveal different complete matches, within or between the It estimates the proportion of residues that may be included in groups. boxes for this pair of sequences. This estimation is influenced by the scanning and matching parameters: for a given cutoff, Implementation the correct value may be obtained only for aligned sequences scanned with m = 0. Matrix R is symmetrical. A flowchart of the main procedures implemented on VAX is presented in Figure 3. It is briefly commented hereafter. CPU FACTOR is expressed for a VAX 6620. This program performs a principal coordinates analysis of the matrix R (Gower, in Sneath and Sokal, 1973). The eigenvalues and eigenvectors decomposition of R is performed by the algorithm of Sparks and Todd (in Hill and Griffiths, 1985). Similar sequences are represented by close values in an eigenvector, each eigenvector being associated with an eigenvalue proportional to the similarity of the sequences.

Input files Sequences are stored in separate files under the standard UWGCG format (University of Wisconsin Genetics Computer Group; Devereux etal., 1984). Gaps are allowed in the sequences and are taken into account in the alignment. Up to 15 sequences of up to 999 residues each can be analyzed

505


Interpretation for en optimal alignment

20 40 60 80 Physicochemical distance [D]


20 T

4O Y

3O Y

: y P * h • 1- A | « N P t T O T Q Y"V A A L r » l fi QhKLPANL r. Y p A D Y F E rs S t It 1 I •• D .' : « 1 I J M ' I « r I L F , » y i 3 H - • -: Y ? M D A D K 1 V F I > •• s' J j v V I I C P t- I I !•' 7) C Y B Y El

100 Y

r. E 1

1 P V y

L VV

K> Y

60 Y

SO Y

: nkni D E L » o

E I

Y

: Hi » 3 Y L Y kv

Y V Y M U 1 T T jC y\v

R T " E ~ T 1 > A 1N U L II A Y V T I | O I

to

! :

5 S P

II 1 !t A Y L Y A A li v N i Y L

o n'v c i-v :.

• p .

A A-A

110 T

J A I n s w". .1 F £ v T 1 A [A I j T T v 1 1 [ D H L 1,V. E ;•: Y A :; L v p DO 1 RH r QT V Dh V L F N V I 1. (; L I I .V i: U . L ikptvncni ci.i r F s n i- D LC> I L K IA 1. r.ki v 1. P n o " 3 n A E'L 3 I D V I I I H ! M nit A 11 DT I 0 1 L E I.SklLTIiCIIILTlDrTlPrilA _AJ; I C D I I r P

ME

o

MT I

Q ifwHfl*^ t . I Q I . I J U V I l p

P T O O E 0|H

L d l i r I RfA.V^ M' >O

i l J t V L L i;.V V L ( } | 0 N T Q E D L E T I C K V L T D M G P K V T O A V I A T I i r A O l g o u x c

A T V I*AIB T V K I O I A I I F K F W A I C « ! L * I L » U G I N I » ' I V I MO 270 Y Y

A ^V ^ w ^ ^ Q A ' l S i y M AUI L V.K- .• T , 0 M P. V, r P. V M.T W V I I I F .V . i : A A t r W j F * U T H L l W v * ' j 0 L*I I . U U [ • T r T ' w v ' O V i I I V . .

J" D P. V r|K M MT P Q Q K 1 D E A

A?L AT'Pf.'ll I. f l t V r "Mil T'r.'l) V T ' T ' » l ' l . H • K V • •

t _j A :

1:

I A il A I H I I V P D D Q f D 3

II

7.TTT

1

* A i l N < l l i Q P D 3

V

•

I I

S < T V | ' A A ( .

r. c r ~

A T 1

I

D R I. VT: I V K C T L|M r C!

DFaL

P U Q k I D t A Dl T k T v

no c v i r o o n T T * u P I v s F M A H » a v|: A • I ;|A V L I F U O I L T H

: i > H u i r i r i i » i L F i L i n c i i i MO Y

Y P P »

L DP M t W M M l

L Y F I II
15 (default value).

GAP

Example

This program provides a formatted output of the sequences, and a new input file which can replace the file PROT.DAT in order to refine the alignment. Gaps can be included from the file GAP.INP which contains the patterns needed to align all the boxes selected by the screening. Aligned sequences are printed in the file GAP.LIS, vertically and side by side, the residues being represented by lower-case letters if they are included in a box and by upper-case letters if not. A new data file containing the gaps is prepared, allowing one to start a new matching procedure and to refine the alignment. Parameters can be changed (usually the parameter m is lowered and the cutoff limits are relaxed) and a new matching can be performed. Alternatively, gaps can be introduced in the files of sequences by any program compatible with the UWGCG format. The new input file is then built by the program SEQUENCE. When the files of sequences contain gaps, the program GAP can be used

The multiple alignment of seven rhabdovirus nucleoproteins was performed from scratch and blindly, without any a priori knowledge of the sequences. A seven-residue window is defined for the scanning. This value has been fixed empirically as the most convenient in many alignments. In a range of 5 — 10 residues, its value does not modify substantially the final results. Distance (£>) was computed with equation (2), with y^ derived from the 10 physicochemical factors of Kidera et al. (1985). Thus D is the distance between two physicochemical profiles and follows approximately a x2 distribution with 70 degrees of freedom when sequences are randomized. The parameter Id (number of identities) is fixed independently to force the matching of patterns of identical residues, regardless of the physicochemical distance between the non-identical residues. If Id is set equal to w, it does not influence the matching. The probability of finding a given number of identical residues in

507


Screening can be executed several times on the same database of matches. A short interactive session allows the user to select sequences and to specify a size cutoff limit for significant boxes. The CPU time depends on the size of the database of matches and on the number of significant boxes screened. Average CPU time for seven sequences is — 1 min. In the output, the selected boxes are sorted according to the sequences, the one-letter code of the residues being printed side by side for all the boxes. The rejected boxes are printed in the same way, but not sorted. A critical inspection of the listing allows the user to evaluate the consistency of the selection. A box judged inappropriately selected can be deleted from the database of significant boxes and the second step of the screening can be performed, again in a few seconds of CPU. When accepted, final selection can be handled by the procedure GAP.

§ -0.2

52 ™»

Matching

Screening

4::2 3

100-

400-

Matching is performed according to the parameters stored in the file PARAM.DAT. The CPU time depends on the number and the length of the sequences and on the scanning parameters. The range is between — 10 s (three sequences, m = 15) and 3 min (15 sequences, m = 150). The database of matches is stored in the file MATCH.DAT.

1

0.2

o 3

E.Depiereux and E-Feytmans

Within- and between-groups screening is then performed. Several significant boxes are detected, and among them an additional box including all the sequences, with a conserved cysteine (Figure 5, position 240). Two additional long gaps are introduced, at position 183 and 269 respectively. Alignment is refined with m = 5 and with relaxed cutoff limits (D = 50, Id = 3). Finally, some tests are performed in order to confirm that the A'-terminal regions of sequences 5 - 6 do not match with the corresponding regions of the other sequences. Figure 5 shows the final results. It is of great interest that the most highly conserved region detected by the complete matches between the physicochemical profiles (positions 209—347) corresponds

508

to the region discussed by Crysler et al. (1990). These authors suggest that this region of the proteins is involved in the interaction with the RNA. This conclusion, based on sequence similarities, is in accordance with electron microscopic observations obtained by Thomas et al. (1985).

Discussion Standard examples of alignments performed with this method (serine proteases and immunoglobulins) are discussed elsewhere (Depiereux and Feytmans, 1991). We choose the present alignment because it is not trivial to obtain. Some of the sequences are poorly related and few regions are conserved over all the sequences. Moreover, a group of sequences is considerably shifted with respect to the others, showing how crucial the problem of gap weighting can be when using other algorithms. These sequences have been handled intensively with CLUSTAL (Higgins and Sharp, 1989) but no comlete intersection could be found. At this stage, no advantage has ever been found in modifying the window size and the user can use the default value w = 7. The only critical step of the MATCH-BOX method is to fix adequately the parameter m. If it is too short in the first steps, long shifts and gaps remain undetected. It is is too long in the final steps, a high level of random noise impedes an optimal delineation of the significant boxes. Nevertheless, the handling of this parameter is much less empirical than a gap penalty. By choosing a given value for m, the user gives the limits within which it is relevant to match segments and accepts that gaps longer than m that were already included in the sequences will be retained. Then, the matching performed within these limits depends only on the scoring matrix, which is not the case in the methods involving a gap weighting for computing the score. The cutoff values are less critical to fix. Matching randomized sequences provides a guideline to fix limit values corresponding to a given probability. Within the range of l%o —10%, the cutoff limits modify only slightly the limits of the boxes. On the other hand, the screening parameters (size cutoff for boxes and significant boxes) do not influence at all the matching of segments. These additional tools help the user to perform an alignment interactively from the database of significant boxes. This is sometimes necessary when matching distant sequences that generate several possible alignments. Ultimately, the fundamental limitation of the alignment method is the scoring matrix used to match the windows. This is also true for any other method. In the context discussed here, Dayhoff s scoring matrix, PAM250, associated with its specific cutoff limit, produces good results. Alignments of serine proteases and immunoglobulins (Depiereux and Feytmans, 1991) illustrate that in some conditions the physicochemical distances allow an accurate prediction of the structurally conserved regions in


unrelated segments of size w can be estimated from a binomial distribution Bi (w; 0.05). For w = 7, the probability of finding at least four identities is P < 0.001. These statistical considerations have been developed elsewhere (Depiereux and Feytmans, 1991). A first scan is performed with a large value of m = 150 for the actual and randomized sequences. This allows all the pairwise comparisons of windows (— 500 000) to be performed. The distribution of cumulated frequencies (in logarithm) of D is given in Figure 4 for actual and randomized sequences. It clearly indicates that at least some regions of the sequences present a similarity higher than expected by random. The suggested cutoff is the limit of superposition of the two distributions (D = 45). As XTO-.O.OI = 4 ^. the probability of matching unrelated segments is ~ 1 %. A first matching is then performed with m = 150, D = 45 and Id = 4. The limit for a significant box is set to seven residues in order to detect all the possible anchor points. This search only detects one reliable anchor point (Figure 5, positions 290-297). Residues 'PGQ' are conserved in all the sequences and other residues are conserved in several pairwise comparisons. The alignment of this significant box requires a very long shift (116 residues) for sequences 5 and 6 with respect to the others. This shift is introduced at the N-terminal extremity of the sequences and a new matching is performed with m = 15. This reduces the random noise considerably and detection of complete matches is more efficient. After matching, a cluster analysis and a principal coordinates analysis are performed. The groups of sequences obtained are presented in Figure 6. The dendrogram (Figure 6a) shows three pairs of relatively close sequences ( 3 - 2 , 1-4 and 5 - 6 ) . Between 300 and 400 matches are counted between them. A group of four sequences can be detected (1 - 2 - 3 - 4 ) , with - 2 5 0 complete matches. Sequence 7 is less related, but close to the group of four sequences, with about 80 complete matches. Finally the linkage of all the groups reveals some complete matches between all the sequences. The clustering is confirmed by the principal coordinates analysis (Figure 6b). The plan formed by the second and the third factor (respectively 18 and 7% of the total variability) clearly shows two groups of sequences, and sequence 7 in an intermediate position.

Simultaneous alignment of several protein sequences

homologous proteins. This option is open for the user, who can easily incorporate any scoring matrix into the algorithm. Acknowledgements We thank M.Rossius, from the Laboratoire de Biologic Moleculaire et de Genie Genetique, University of Liege, Belgium, for suggesting the sequences of the rhabdoviruses N proteins that illustrate our method of simultaneous alignment.

References

Received on October 3, 1991; accepted on March 2, 1992

Note added in proof Version 2.0 of MATCH-BOX will now handle 200 sequences simultaneously with a maximum of 100 000 residues.

Circle No. 10 on Reader Enquiry Card

509


Altschul.S.F. (1989) Gap costs for multiple sequence alignments. J. Theor. Biol.. 138,297-309. Argos.P., Vingron.M. and Vogt.G. (1991) Protein sequence comparison: methods and significance. Protein Engng., 4, 375-383. Banerjee,A.K., Rhodes.D.P. and GU1J3.S. (1984) Complete nudeotide sequence of the mRNA coding for the N protein of vesicular stomatitis virus (NewJersey serotype). Virology, 137, 432-438. Barton.G.J. and Stemberg.M.J.E. (1987)-A strategy for the rapid multiple alignment of protein sequences. Confidence levels from tertiary structure comparison. /. Mol. Biol., 198, 327-337. Bernard J., Lecocq-Xhonneux.F., Rossius.M., Thiry.M.E. and de Kinkclin.P. (1990) Cloning and sequencing the messenger RNA of the N gene of viral haemorrhagic septicaemia virus. J. Gen. Virol., 71, 1669-1674. Corpet.F. (1988) Multiple sequence alignment with hierarchical clustering. Nucleic Acids Res.. 16, 10881-10890. CryslcrJ.G., Lee,P., Reinders.L. and Prevec.L. (1990) The sequence of the nucleccapsid protein (N) gene of Piry virus: possible domains in the N protein of vesiculoviruscs. J. Gen. Virol., 71, 2191-2194. DayhofT.M.O., Barkcr.W.C. and Hunt.L.T. (1983) Establishing homologies in protein sequences. Methods Enzymol., 91. 524-545. Depiereux.E. and Feytmans.E. (1991) Simultaneous and multiple alignment of protein sequences. Correspondence between physicochcmical profiles and structurally conserved regions (SCR) Protein Engng.. 4. 603-613. DevcreuxJ., Haeberli.P. and Smithies.O. (1984) A comprehensive set of sequence analysis programs for the VAX. Nucleic Acids Res., 12, 387 — 395. Fitch.W.M. and Smith.T.F. (1983) Optimal sequence alignments. Proc. Nail. Acad. Sci. USA. 80, 1382-1386. Gallione.C.J., GreeneJ.R., Iverson.L.E. and RoseJ.K. (1982) Nudeotide sequence of the mRNA's encoding the vesicular stomatitis virus N and NS proteins. J. Virol.. 39, 529-535. Gilmore.R.D. and LeongJ.C. (1988) The nucleccapsid gene of infectious hematopoietic necrosis virus: a fish rhabdovirus. Virology, 167, 644-648. Gribskov.M., McLachlan.A.D. and Eisenberg.D. (1987) Profile analysis: detection of distantly related proteins. Proc. Nail. Acad. Sci. USA, 84. 4355-4358. Hcnneke.C.M. (1989) A multiple sequence alignment algorithm for homologous proteins using secondary structure information and optionally keying alignments to functionally important sites. Comput. Applic. Biosci., 5, 141-150. Higgins.D.G. and Sharp,P.M. (1989) Fast and sensitive multiple sequence alignments on a microcomputer. Comput. Applic. Biosci., 5, 151 — 153. Hill,I.D. and Griffiths,P. (1985) Applied Statistics Algorithms. John Wiley, New York. Johnson,M.S. and Doolittle,R.F. (1986) A method for the simultaneous alignment of three or more amino acid sequences. J. Mol. E\vl., 23, 267-278. Kidera.A., Konishi,Y., Oka.M., Ooi.T. and Scheraga.H.A. (1985) Statistical analysis of the physical properties of the 20 naturally occurring amino acids. J. Protein Chem., 4, 2 3 - 5 3 . Lipman.DJ., Altschul.S.F. and KececiogluJ.D. (1989) A tool for multiple sequence alignment. Proc. NatL Acad. Sci. USA, 86, 4412-4415. Murata.M., Richardson J.S. and Sussman J.L. (1985) Simultaneous comparison of three protein sequences. Proc. Nail. Acad. Sci. USA, 82, 3073-3077. Rechid.R., Vingron.M. and Argos.P. (1989) A new interactive protein sequence alignment program and comparison of its results with widely used algorithms. Comput. Applic. Biosci., 5, 107—113. Schuler.G.D., Altschul.S.F. and Lipman.D.J. (1991) A workbench of multiple

alignment construction and analysis. Proteins: Struct. Fund. Genet., 9, 180-190. Sneath.P.H.A. and Sokal.R.R. (1973) Numerical Taxonomy. W.H.Freeman, San Francisco. Subbiah.S. and Harrison.S.C. (1989) A method for multiple sequence alignment with gaps. J. Mol Biol., 209, 539-548. Taylor.W.R. (1987) Multiple sequence alignment by a pairwise algorithm. Comput. Applic. Biosci., 3, 81-87. Thomas.D.. Newcomb.W.W., BrownJ.C, WallJ.S., HainfeldJ.F., Trus.B.S. and Steven.A.C. (1985) Mass and molecular composition of vesicular stomatitis virus: a scanning transmission electron microscopy analysis. J. Virol., 54,598-607. Tordo.N., Poch.O., Ermine.A., Keith.G. and Rougeon.T. (1986) Walking along the rabies genome: Is the large G - L intergenic region a remnant gene? Proc. NatL Acad. Sci. USA, 83, 3914-3918. Vingron.M. and Argos.P. (1989) A fast and sensitive multiple sequence alignment algorithm. Comput. Applic. Biosci., 5, 115-121. Vingron.M. and Argos.P. (1991) Motif recognition and alignment for many sequences by comparison of dot matrices. J. Mol. Biol., 218 3 3 - 4 3 .

Simultaneous alignment and folding of protein sequences.

An enhanced algorithm for multiple sequence alignment of protein sequences using genetic algorithm.

Simultaneous comparison of several sequences.

Fast alignment of DNA and protein sequences.

SiPAN: simultaneous prediction and alignment of protein-protein interaction networks.

Progressive alignment and phylogenetic tree construction of protein sequences.

A comparison of several similarity indices used in the classification of protein sequences: a multivariate analysis.

A multiobjective memetic algorithm for PPI network alignment.

MSuPDA: A memory efficient algorithm for sequence alignment.

An efficient algorithm for pairwise local alignment of protein interaction networks.

Simultaneous Bayesian estimation of alignment and phylogeny under a joint model of protein sequence and structure.

Net2Align: An Algorithm For Pairwise Global Alignment of Biological Networks.

An Adaptive Hybrid Algorithm for Global Network Alignment.

An Improved Inertial Frame Alignment Algorithm Based on Horizontal Alignment Information for Marine SINS.

A piece of my mind. Matchbox cars.

Compilation and alignment of DNA polymerase sequences.

PicXAA: a probabilistic scheme for finding the maximum expected accuracy alignment of multiple biological sequences.

[Median incisional hernias and coexisting parastomal hernias : new surgical strategies and an algorithm for simultaneous repair].

Pairwise sequence alignment for very long sequences on GPUs.

A New Algorithm for the Diagnosis of Hypertension in Canada.

BitPAl: a bit-parallel, general integer-scoring sequence alignment algorithm.

Chicken lysozyme gene contains several intervening sequences.

FARMS: A New Algorithm for Variable Selection.

A new algorithm for the management of COPD.