Vol 8 no 5. 1992 Pages 451 - 4 5 9

ADSP-a new package for computational sequence analysis D.J.Parry-Smith and T.K.Attwood1 sequence alignments is of central importance because these provide the foundation on which to build sensitive analytical A new protein sequence analysis package, ADSP, is described, strategies, from which to attempt rational discriminator design. of which the SOMAP Screen-Oriented Multiple Alignment There are two distinct components to alignment analysis: the Procedure forms an integral part. ADSP (Algorithms and Data first is the qualitative examination of the alignment for patterns Structures for Protein sequence analysis) incorporates facilities of critically conserved residues or residue groups; the second to generate potent pattern-recognition discriminators and offers involves using such features for database exploration, to seek four algorithms with which to scan any NBRFformat sequence possible clues to their interpretation in terms of structure and database: the package has been designed, in particular, to function. Although the LUPES package was designed for this interface with the OWL composite sequence database, one of purpose, it provides a cumbersome line-mode alignment routine the largest, distributed non-redundant sources of sequence data for only 10 sequences and its emphasis is on subjective of its kind. The system incorporates a powerful method for discriminator design; the GCG package allows alignment of up compound feature analysis, which provides the basis for to 30 sequences but the main thrust of its analytical options is characterizing and predicting the occurrence of complete protein for nucleic acids; and similarly, the Intelligenetics Suite provides superfamilies and for pinpointing the emergence of related sub- relatively few facilities for protein sequences, the emphasis of families. Used iteratively, the approach allows diagnostic its analytical options being on rapid pattern-searching performance to be rigorously refined and its efficacy to be (Abarbanel etal., 1984) rather than on the construction of assessed both qualitatively and quantitatively, and results in the definitive feature discriminators. generation of refined structural or functional features suitable In addressing some of these problems, we have developed for entry into a database: this compilation of characteristic powerful, quantifiable methods for feature definition and signatures is distinct from, but complementary to, widely used refinement, known collectively as ADSP. Features may consist compendia of pattern templates such as PROSUE. of any conserved element or elements of an alignment: they Abstract

There is currently a substantial dearth of protein structural information compared with the abundance of available sequences, such that entries in the OWL composite sequence database (Bleasby and Wootton, 1990) now outweigh those in the Brookhaven structural database (Bernstein et al., 1977) by - 70:1. This disparity has created an urgent need to develop analytical methods that exploit primary sequence data itself, in particular to allow pattern recognition, structural homology detection and structural or functional motif prediction. At the heart of the methods for searching out characteristic motifs lies the task of multiple sequence alignment, and there have been numerous efforts to address this problem, both manual and automatic, e.g. GCG's LINEUP (Devereux etal., 1984), HOMED (Stockwell and Petersen, 1987), MASE (Faulkner and Jurka, 1988), CLUSTAL (Higgins and Sharp, 1988), LUPES' MANAUGN (Akrigg etal., 1988), MSA (Lipman etal., 1989), ESEE (Cabot and Bechenbach, 1989) and SOMAP (Parry-Smith and Attwood, 1991). Development of sound Department of Genetics and 'Department of Biochemistry and Molecular Biology. University of Leeds. Leeds LS2 9JT, UK

© Oxford University Press

may constitute fingerprints or signatures corresponding to known functional or structural motifs, or their significance may be unknown. It is sufficient that the patterns they characterize have been resolved rigorously, and that they are consequently both descriptive of their own true set and predictive of any subsequent occurrence of that pattern. Terms such as feature, motif, pattern, profile and template tend to be used interchangeably, and for this reason the nomenclature can be rather confusing. We generally use the terms motif or feature to describe a single conserved region in a multiple sequence alignment; groups of motifs or features, which together characterize a particular family of sequences, are then referred to as patterns or compound features. A motifset is the actual set of equivalent aligned sequence motifs excised from a given position in an alignment and used as a database discriminator; and groups of motif-sets, which together encode a pattern, are then referred to as composite discriminators. As we will see, the scanning algorithms provided in ADSP convert the sequence information contained in motif-sets into residue frequency data—they are thus identity scans. Since no other information is used routinely (e.g. similarity data), the discriminators are not profiles (e.g. Gribskov et al., 1987). The methods used in ADSP work on the principle of iterative 451

Downloaded from http://bioinformatics.oxfordjournals.org/ at OCLC on July 26, 2015

Introduction

DJ.Parry-Smith and T.K.Attwood

System and methods ADSP was designed to operate on DEC VAX processors running the VMS operating system. The system is written in VAX C with a user interface implemented in Digital Command Language (DCL). The programs PLOT and ROC both produce graphical output by calling Graphical Kernal System (GKS) routines, allowing output to be directed to a wide range of display devices including Tektronix terminals and laser printers. A terminal offering VT100 functionality is required to support SOMAP's screen-mode capability, which is itself implemented by calls to the standard CURSES screen library. Single key presses are read from the keyboard via function calls to the VMS screen management library. In order to provide a user interface attractive to potential users, an X-windows implementation is currently in progress. The system interfaces with the OWL composite sequence database (Bleasby and Wootton, 1990) and is compatible with NBRF sequence file format. Algorithms and implementation The general scheme of alignment and analysis encompassed by the ADSP system is shown in Figure 1. Principal facilities include: SOMAP, an interactive multiple sequence alignment procedure; SCAN, for database-searching; COMPARE, for hitlist correlation; PLOT, for qualitative assessment of diagnostic performance; ANALYSE, for producing residue frequencies and propensities, and for listing true-positive and false-positive hits; ROC, a procedure to allow quantitative evaluation of discriminating power; and CONVERT, for providing interfaces to automatic alignment methods and to alternative sequence analysis packages. The manner in which ADSP is used to construct a composite

452

Fig. 1 General outline of the ADSP system, illustrating the relationships between the various package components. Sequences are aligned using SOMAP, sets of aligned motifs are output and used to SCAN the OWL composite sequence database, the results are correlated using COMPARE, and new motif-sets are fed back into the system for iterative refinement. Diagnostic performance is assessed qualitatively using the single-sequence graphical PLOT option, or quantitatively by using ANALYSE to derive statistical data suitable for input to ROC.

discriminator follows the scheme outlined above: an alignment is first constructed using SOMAP (either from scratch, or by prior use of the CLUSTAL automatic alignment system, using CONVERT); conserved features are identified with the aid of SOMAP's symbolic display of identities and similarities, and these are output in the form of a number of motif-sets; motifsets are converted into a form suitable for database scanning (this is the table-file format), and the database is searched independently with each motif-set/table-file using SCAN; resulting hitlists are correlated with COMPARE to determine which sequences have matched all elements of the composite discriminator and motif-sets from these alone are automatically output for a second database scan; database searching is continued until no more new sequences matching all elements of the discriminator are identified by COMPARE—this is the point of convergence for that particular database. The composite discriminator may also be used to scan any individual sequence for the compound feature of interest and the results displayed graphically using PLOT. Finally, a value for the discriminating power of individual elements of the discriminator can be ascertained and the results output graphically using ANALYSE and ROC. To explain this scheme in more detail, each of these components will be described in turn. SOMAP SOMAP is a screen-oriented, menu driven procedure that permits interactive alignment of any number of sequences of any length (limited only by the operating system process quotas). Facilities provided to aid alignment include symbolic displays of identities and PAM250 (Point Accepted Mutation) similarities (Dayhoff, 1978), and strict and fuzzy (ambiguous) patternmatching. A variety of output options is also afforded, notable among which is the ability to save selected parts of the alignment

Downloaded from http://bioinformatics.oxfordjournals.org/ at OCLC on July 26, 2015

maximization of sequence data, and hence motif-sets also differ from pattern templates, which encode single motifs in the form of simple consensus sequences derived from multiple alignments (e.g. Bairoch, 1991). The rationale behind many pattern-recognition methods is to provide rapid searching and data retrieval with a single database query. To achieve this, such methods tend to require either explicit definitions of secondary structure, application of gap penalities, assignment of flexible gaps between pattern elements, definition of residue groups, manual alteration of residue weights or various combinations of these (e.g. Abarbanel et at., 1984; Lesk et al., 1986; Taylor, 1986; Gribskov et al., 1987; Lathrop etal., 1987; Akrigg etai, 1988; Staden, 1988; Cockwell and Giles, 1989; Barton and Sternbcrg, 1990; Sibbald and Argos, 1990; Smith et al., 1990). ADSP departs from these methods by demanding no additional structural information and in applying no such rules to database searches: the results can therefore never be compromised by the application of overstringent rules, inappropriate choice of residue weights, gappenalties and so on.

Computational sequence analysis

SCAN SCAN offers four methods for scanning any NBRF format database. Motif-sets and matrices output from SOMAP are first converted into table-files: these contain the frequency data held within conventional frequency matrices, but also accommodate residue distance information. Two types of table-file are exploited: the first corresponds to single amino acid frequencies, where residue separation is zero; and the second to paired residue frequencies, where the residue separation increases from 1 up to the length of the motif. Thus the first two scanning methods are merely frequency searches, the first relating to the occurrence of single amino acids and the second to amino acid pairs, and are referred to as SINGLE and PAIR respectively. A further, very simple, procedure is also provided that improves the discriminating power of a search by enhancing the score of any match with the probe motif-set in proportion to the number of shared identities—a perfect match would therefore have its score multipled by a factor equal to the length of the probe. This has the effect of helping genuine matches to cluster nearer the top of the search hitlist. The option can be applied to both the singlet- and pairwise-frequency searches, thereby yielding the final two scanning methods, known as NSINGLE and NPAIR. Thus, NSINGLE and NPAIR are both frequency scans that offer improved signal-to-noise ratio over their SINGLE and PAIR counterparts. The increase in discrimination achieved by the use of this procedure is gained without significant sacrifice to scanning speed, and is therefore the recommended method here. In choosing between the use of pairwise- or singlet-frequency searches, it is important to understand the type of information represented in the motif-sets. Pairwise scans are the more stringent and inherently less noisy option because, by definition, they examine all residue pairs in a motif-set. However, they will only provide better discrimination than singlet-frequency scans if the motif-sets actually contain conserved pairs of residues, as might be found for example in a well-conserved structural motif. Singlet scans are best applied to situations

where the residue information in motif-sets is more diverse and distant sequence relationships are being sought. SCAN will search NBRF format databases interactively, if the database contains tens or hundreds of sequences, or in batch if thousands of entries are to be scanned. Absolute scan times cannot easily be given because the duration of any particular search will depend on a number of factors, including: the current size of the database; the length of the motif; the number of motifs in the motif-set; the number of features (for composite scans); the number of hits requested; and the nature of the search itself, i.e. whether singlet or pairwise. Of these factors, the size of the database is the most significant and, in view of its continuing growth, we are now looking at new methods for improving search efficiency. Nevertheless, to give some idea of performance, a singlet table-file derived from a motif-set containing 10 aligned 15 residue motifs takes —50 min to scan 40 000 sequences on a microVAX3600 (and — 5 min on a Stardent Titan single-processor Unix system) using the NSINGLE method—the equivalent SINGLE scan takes the same time, and the PAIR and NPAIR scans ~ 5 h longer. Doubling the number of motifs in the motif-set has little effect on the single methods, but increases the scanning times for the pairwise methods by - 70 min. A composite scan will take a little less than n times the duration of a single-component discriminator scan. It is important to stress that singlet methods, principally NSINGLE, are used for routine sequence analysis, while pairwise methods are reserved for situations where particular structural relationships are being sought. Results from SCAN are output into hitlists, which may be any user-defined length and contain the best rank-ordered matches with the discriminator. In the case of composite scans, output is directed into individual hitlists, one for each element of the discriminator. This allows the user to discover whether any of the separate features of interest occurs in any other sequence in the database. COMPARE COMPARE is provided to allow systematic comparison of sets of hitlists produced from a composite database search. Essentially, it allows the user to determine immediately which sequences have matched with all of the features of the composite discriminator, and which have matched with only some of them. The comparison imposes two constraints on the hitlist analysis, i.e. matches with each element of the discriminator must appear in the correct order along the sequence, and they must not overlap to any appreciable extent. Results are displayed in the form of a Compound Feature Table (CFT): this simply lists those sequences that occur in all of the hitlists and, in addition, it also details which sequences have occurred in only n — 1 of the hitlists, which have occurred in only n — 2 hitlists, and so on down to those occurring in just two hitlists. The performance of individual elements of a discriminator is detailed

453

Downloaded from http://bioinformatics.oxfordjournals.org/ at OCLC on July 26, 2015

in the form of either motif-sets, frequency matrices or PAMweighted matrices: a frequency matrix is a matrix that reflects the frequency of occurrence of every amino acid at each position in a motif-set, and a PAM-weighted matrix is a weighted frequency matrix that reflects the degree of evolutionary relatedness of every amino acid at each position in a motif-set— the value for the degree of relatedness or similarity is assigned with reference to the Dayhoff PAM250 Mutation Data Matrix. Motif-sets and matrices may be derived from any conserved features of an alignment, whether these are believed to have some known structural or functional significance or whether their significance is unknown. Families of such motif-sets or matrices may then be used as the framework for constructing composite discriminators.

DJ.Parry-Smlth and T.K.Attwood

in the form of a Compound Feature Index (CFI), which indicates the number of matches found with n features, the number found with it n — 1 features, and so on—see Figure 2. COMPARE is intended as a tool to aid rigorous, iterative discriminator refinement. Use of the CFT allows identification of 'true' members of a family, true hits being regarded as only those sequences that match with all n features of a composite discriminator; motif-sets from this set alone are therefore Alpha-lactalbumin family inTolving 8 features inrolring 5 femtnra inrolring 4 features inrolring 3 features inrolring 2 features

Compound Feature Table - Iteration 2 (100)

4

6 5 LAGT LCASCAPHI LCASBOVIN LCAISHEEP LABO LAHU LAHO LCABSHORSE LAGP LACM EZEC228 LCAIPAPCY GP1LACTAL LART2 LART LARB LAKGAW

3 LYCSEQUAS LYCIHORSE LYCUPIG LYC2SP1G LYC3SPIG

2 LZPY LZQJEC LZQJEB LZUH LZRT LZBA LZDK3 LZOVE

Compound Feature Index 61 51 41 31 21

17 0 0 5 7

I

1

17 0 0 0 0

17 0 0 5 8

17 0 0 5 1

17 0 0 0 0

17 0 0 0 0

Fig. 2. Typical form of output from COMPARE, showing a brief summary of the analysis, a Compound Feature Table (CFT), with an indication of the iteration number and length of hitlist sampled, and a Compound Feature Index (CFT). The example shows the CFT of a composite discriminator for the family of a-lactalbumins (D.Perkins, in preparation). The discriminator comprises six features, which have been used to scan OWL 11.0, with hitlists of length 100. The result reveals that 17 a-lactalbumins have matched with all six elements of the discriminator (i.e. have been found in all six hitlists), and can therefore be regarded as true hits. By contrast, no sequences have been identified that match with only five or only four elements of the discriminator—there is thus an apparent cut-off for discrimination immediately after the identification of the true set. In this case, however, the analysis has also diagnosed the presence of a subfamily, i.e. the five lysozymes thai match with three features of the discriminator. Inspection of the CFI, which allows the diagnostic performance of individual elements of the discriminator to be assessed, reveals that these five lysozymes have all matched with elements 1, 3 and 4. Thus features 2, ? and 6 together match only a-lactalbumins, while elements 1, 3 and 4 also match lysozymes.

454

PLOT By contrast with SCAN, PLOT searches a single user-defined sequence, rather than a database, for the conserved feature or features described by a discriminator. This provides an immediate qualitative evaluation of diagnostic performance in that sequence. The form of output produced by PLOT is shown in Figure 3. The horizontal axis is the full length of the sequence; the vertical axis is divided by the n elements of the composite discriminator and represents their percentage score (0-100 per element)—the number of elements that can be displayed by PLOT is limited only by the number that can be realistically viewed on the screen. A peak in the plot denotes a residue by residue match of an element of the discriminator with the sequence, its leading edge marking the first position of the match. PLOT can be used to determine whether a sequence possesses the whole of a compound feature, just some part of it, or none of it. In the example shown in Figure 3(b), the sharpness and uniqueness of the peaks in each element of the plot indicates

Downloaded from http://bioinformatics.oxfordjournals.org/ at OCLC on July 26, 2015

17 coda 0 coda 0 codes S codes 8 codes

automatically output for subsequent database scans. Following a second composite search, the hitlists are compared to produce a new CFT, from which the new true set can be diagnosed and hence a further database search performed. This process is repeated until convergence, i.e. the point at which the true set remains constant. A good discriminator should then show all true hits in the n feature column, 0 hits in lower columns of the table, and perhaps some matches in the two feature column, which should be attributable only to noise (random matches with two motifs)—i.e. there should be a clear cut-off for discrimination. Thus, with successive iterations, true hits can be seen to migrate from the level of noise towards the axis of true discrimination (the n feature column), until convergence. Any interpretation of diagnostic performance made from the CFT must be placed in the context of the length of hitlist sampled. Diagnostic performance of composite discriminators carries a slightly different connotation from that of singlecomponent discriminators: in the latter case, perfect discrimination is achieved if all true hits are identified and isolated from all others; this also holds for a composite discriminator, except that here perfect discrimination can be achieved even when its individual elements peform less than perfectly. Composite discriminators are inherently more powerful than their constituent parts because the recognition of individual elements is made mutually dependent: relating composite discriminating power to the length of hitlist gives an index of how well these elements have actually performed. A further aspect of COMPARE is the incorporation of a postconvergence option to reduce noise from the CFT. This imposes distance criteria between the discriminator elements, and has the effect of eliminating those hits in which the distances between features is inconsistent with known intervals between motifs in the final alignment.

Computational sequence analysis

100

100

J,, J.., j.

J. Downloaded from http://bioinformatics.oxfordjournals.org/ at OCLC on July 26, 2015

.I 0 0

100

200

0

300

100

a

200

300

b

100

100

.. 1

1 .L

....

- 1. .

1. .

%Score

. 1 .. . 0

100

200 C

300

i

.

... 100

200

300

d

Residua numbor

Fig. 3 . Examples of the type of output produced by PLOT: the figure is a comparison of ADSP"s four scanning methods, in which the sequence of ovine rhodopsin has been searched with a composite discriminator specific for G-protein-coupled receptor transmembrane helices—each of the seven peaks in the four plots corresponds to a match with one of the seven transmembrane helices. The separate graphs illustrate scans with: (a) the SINGLE method; (b) the NSINGLE method; (c) the PAIR method; and (d) the NPAIR method. The comparison reveals the improvement in signal-to-noise ratio in using pairwisefrequencies rather than singlet-frequencies, but further improvement is achieved in applying the NSINGLE and NPAIR scans. Pairwise scans are more stringent and inherently less noisy than singlet scans, because they examine all residue pairs in a motif-set, and consequently take longer to run. In spite of the increased rigours of the search, however, the pairwise option may not always be the most appropriate, as can be seen by comparing (a) and (b) with (c) and (d). In this example pairwise relationships are demonstrably less important and the singlet searches provide the best discrimination.

455

DJ.Parry-Smlth and T.K.Attwood

that the composite discriminator is potently diagnostic of all seven features in the sequence against which it is scanning.

Fractal o( True Positives

1.00

ANALYSE ANALYSE is a facility with which to derive database, hitlist and motif-set statistics. For example, a motif-set may be 'analysed' to give a list of residue frequencies and propensities for each position along its length—propensity values, Pv for the ith amino acid are defined as the ratio of the fractional occurence, Aj, of an amino acid within a motif and the fractional appearance, Th in the whole sequence sample:

o.so

Pi = AilTi

F = DBRES + (1 - L)DBENT - T where DBRES is the total number of residues in the database on which SCAN was used; L is the length of the motif; and DBENT is the total number of database sequence entries. To chart the change in discrimination power following each database scan, an analysis of true- and false-positive hits must be performed on each new hitlist. Output from this facility, in the form of a scaled list of coordinates denoting fraction of trueand false-positive hits, is compatible with ROC input format. ROC A quantitative measure of diagnostic performance is obtained using the ROC (Relative Operating Characteristic) facility (Mete, 1986; Swets, 1988; Parry-Smith, 1990). ROC utilizes the fractional data generated by ANALYSE, together with a value for n, the ratio of the fraction of false-positives to truepositives. With this data, the facility plots a graph of fraction of true-positive hits against fraction of false-positive hits and calculates a numerical value for the power of discrimination, Dn. The Dn value is defined as the fractional area occupied under the discrimination curve, calculated from the equation: Dn= 2//i - 1 where Ac is the area under the discrimination curve. Figure 4 shows a set of typical ROC graphs, in which truepositive proportion is plotted against false-positive proportion: the solid line indicates random discrimination. In this example, a single-component discriminator derived from a structural alignment of a-helix n-terminal sequences has been used to scan 456

0.00 Fraction of Falsa Positive*

Fig. 4. ROC curves for a single-component discriminator scanning the database with each of ADSP"s four scanning methods (the discriminator describes a specific type of a-helix n-terminal sequence). The graph shows the fraction of true-positive hits plotted against the fraction of false-positive hits at each scoring interval in the hitlist: the solid line indicate!) the random discrimination level. The curve for the NPAIR scan shows the greatest deviation from the random level and thus has the highest value for discriminating power, £>„.

the database with each of ADSP's four scanning methods. A qualitative feel for the effectiveness of discrimination is gained by noting the deviation of each curve from the random level: the random line represents the situation where there are equal proportions of true- and false-positive hits at each scoring level in the hitlist. The curve showing greatest deviation from the random line demonstrates the greatest power of discrimination: in this case, the NPAIR method discriminates best for the structural motif under investigation. The discrimination power is calculated for a given length of hitlist and applies only in the context of the database in which that calculation has been made. If all true-positive hits are identified within a scoring band that does not include any falsepositive hits, Dn - 1.0, i.e. this represents perfect discrimination. ROC is used predominantly in the case of single motifs or features, where discriminating power cannot be related to the ability to recognize several elements of a pattern. CONVERT The CONVERT option has been incorporated to enhance the flexibility of the package by providing interfaces to other systems. Facilities are included: (i) to translate automatically generated multiple alignments (notably from CLUSTAL) into SOMAP input format—this allows interactive fine-tuning of such alignments where, for example, disparate sequence lengths

Downloaded from http://bioinformatics.oxfordjournals.org/ at OCLC on July 26, 2015

This calculation uses residue frequencies derived by 'analysing' the database, i.e. the sequence set from which motif-sets have been generated. Finally, 'analysing' hitlists requires a list of known truepositive matches to generate data concerning fractions of true- and false-positive hits at each scoring level within a hitlist. Because the total number of true examples, T, is known, this can be used to calculate the total number of false examples, F, using the equation:

Computational sequence analysis

100

100

Downloaded from http://bioinformatics.oxfordjournals.org/ at OCLC on July 26, 2015

wi^^0^ 100

100

200

200

100

100

XSeore

100

200

100

200

Residue number Fig. 5. PLOTs showing the influence of similarity-weighted data and of user-specified weights on discrimination power. The example shows a three-element composite discriminator for the 0 - a - j 3 motif of NAD-binding sites, scanning an insect alcohol dehydrogenase sequence with: (a) singlet-frequencies (the SINGLE method); (b) corresponding PAM250-weighted frequencies; (c) user-weighted frequencies; and (d) singlet-frequencies (the NSINGLE method). Overall, the form of the plots in (a) and (d) is consistent, indicating matches with the two ^-components of the discriminator, out no« with the a-component: the improved signal-to-noise ratio renders the NSINGLE method, (d), the better discriminator. By contrast, the background noise in (b) is so high that the true signal is almost completely masked. Noise is also high in (c), but here the true signal has been overwritten by one that has been subjectively manufactured, giving an erroneous impression of good discrimination for the a-component of the discriminator.

or inappropriate gap penalties have caused ambiguities, and is particularly useful for providing a quick, 'rough and ready' starting point for further alignment and analysis; (ii) to convert SOMAP PAM-weighted matrices, and pattern-recognition

matrices developed by packages such as LUPES, into tablefile format—this allows comparison of the diagnostic performance of discriminators for similar features derived from radically different approaches; and (iii) to translate table-files 457

DJ.Parry-Smith and T.K.Attwood

into conventional matrix form for use with other sequence analysis packages. Results and Discussion The choice between frequency and similarity data

458

Pattern templates versus compound feature discriminators ADSP is not a method for rapid pattern searching, such as might be offered by a database query language, where interactive speed is paramount; rather, it is a method for systematic development of definitive pattern discriminators for simple and compound features. The objective, and therefore quantifiable, approach renders these discriminators suitable for entry into a database of features, which itself will allow rapid, rational analysis of sequence data in the future. Sequence analysis is moving more and more in the direction of motif or feature definition, as witnessed by the emergence of various compendia or dictionaries of sequence patterns (e.g. Akrigg et al., 1988; Hodgman, 1989; Bairoch, 1991). The best and most popular of these is undoubtedly that of Bairoch (PROSITE), which represents a database of pattern templates, the design of which is radically different from the methods described here. An example of the application of ADSP can be seen in the development of a potent composite discriminator for the superfamily of G-protein-coupled receptors (T.Attwood and J.Findlay, in preparation). Motif-sets corresponding to the seven putative transmembrane domains were excised from an alignment of 11 opsins and used as the basis for a composite database search. The final converged discriminator identified 53 true-positive family members within the OWL composite sequence database (v.8.1), with a well-defined cut-off for discrimination and showing no other matches, except at the level of noise. The composite discriminator was thus potently diagnostic of all sequences of this type within OWL: the number of true members has grown close to 150 in OWL 13.0. This result can be contrasted with the November 1990 PROSITE entry for G-protein-coupled receptors, which identified 66 hits, 64 of which were indicated as true-positive matches and two as false-positives: in addition, the pattern identified four false-negatives. Comparison of the actual number of true hits identified by ADSP and PROSITE is irrelevant because identical databases were not scanned. What is significant is that the ADSP discriminators identified no false-

Downloaded from http://bioinformatics.oxfordjournals.org/ at OCLC on July 26, 2015

A common, and perhaps justified, criticism of the use of frequency information in database scanning is that the sensitivity of such scans is proportional largely to the number and, to a degree, the type of sequences that contribute to this information—so, for example, the discriminating power of frequency data derived from one or two similar sequences will be very poor. Many approaches have therefore been designed to exploit 'similarity' data, and/or advocate user-manipulation of residue weights within frequency matrices, in order to improve discriminating power. A consequence of using similarity information, particularly such as provided in the PAM250 Mutation Data Matrix, is, however, that background scores are boosted, very often to the detriment of the true signal. Subjective manipulation of residue weights may similarly increase background noise and, at worst, may incorporate spurious detail by allowing synthesis of a signal where none actually exists. These situations are illustrated in Figure 5, where a threeelement composite discriminator has been used to predict the occurence of the compound feature it describes in a sequence from another family. The result suggested by plots derived from (a) the SINGLE method and (d) the NSINGLE method, is that element 3 is predictive, 1 is weakly predictive and element 2 not at all. By contrast, in the PAM25O-weighted scans of plot (b), the increase in noise induced by incorporation of similarity information has all but masked the true signal. Noise is similarly increased in the user-designed 'discriminator' scans shown in (c), but the result now suggests that all three elements predict well—the discrimination of elements 1 and 2 has simply been manufactured to fit the desired picture. Similarity and user-manipulated data may mislead because, inappropriately used, they may compromise signal-to-noise ratio. For this reason, the methods supported in ADSP rely principally on frequency information—singlet or pairwise. The sacrifice in discrimination power made to reliance on frequency data is compensated for in the provision of the modified method for database searching (NSINGLE or NPAIR), the power of which is augmented at the level of nitlist correlation by the incorporation of a systematic method for compound feature analysis. By identifying true matches with a composite probe, this method furnishes the means for rigorous, iterative discriminator refinement. Important attributes of the approach are found in the ability: (i) to discern a cut-off for discrimination, which is clearly preferable to mere reliance on true hits appearing at, or near, the top of a hitlist; (ii) to diagnose the presence of protein subfamilies, i.e. when groups of sequences have not migrated beyond lower columns of the Compound Feature Table at convergence, and are thus characterized by

only part of the original signature; and (iii) to quantify discriminator diagnostic performance. The use of frequency data in ADSP is not exclusive. It is recognized that situations will occur where protein families are not sufficiently well represented in the database to afford the luxury of iterative discriminator refinement using residue frequencies alone. In such circumstances, exploitation of similarity data may provide the only realistic method for investigating distant relationships. Accordingly, SOMAP furnishes the means to output motif-sets in the form of PAMweighted matrices, which may be used in place of singlet or pairwise table-files. Such matrices, however, must be used in an appropriate manner: for example, it is clear that PAM250 is not a universal discriminator, so different matrices should be applied to different problems.

Computational sequence analysts

Future directions Developments are currently in hand to produce an equivalent package for nucleic acids—SOMAP and SCAN have already been modified for this purpose. A comprehensive sequence analysis system for nucleic acids already exists in, for example, the widelyKlistributed GCG package, but this does not provide methods for rigorous feature definition and refinement. It is currently possible to exchange data between these two systems, since ADSP uses NBRF sequence file format and the GCG system provides a facility to convert this into its own format. This is particularly beneficial for users wishing, for example, to take advantage of GCG's nucleic acid translation facilities and then to make use of ADSP's more flexible multiple alignment option. ADSP is not the final word in protein sequence analysis, but it does provide powerful new approaches to sequence alignment and pattern recognition in a convenient package: in essence, it forms a bridge between the OWL composite sequence database, which it has been designed to explore, and the features database, which is has been designed to construct. Together with full documentation, test sequence and feature databases and appropriate query languages, the system forms part of the SERPENT Sequence Exploration and Retrieval Protein ENgineering Tools (D. Akrigg et al., in preparation) currently accessible via the Daresbury SEQNET service, running under VAX VMS. A commercial adaptation of the software is also being prepared by Oxford Molecular Ltd for a variety of VMS and UNIX platforms.

References Abarbanel.R.M., Wieneke.P.R., Mansfield.E., JafTe.D.A. and Brutlag.D.L. (1984) Rapid searches for complex patterns in biological molecules. Nucleic Acids Res., 12, 263-280. Akrigg.D., Bleasby.A.J., DU.N.I.M., HndJayJ.B.C, North,A.C.T., ParrySmith.DJ., WoottonJ.C, BlundeJl.T.L., Gardner.S.P., Haycs.F.. lslam.S., Stemberg.M.J.E., ThomtonJ.E., Tickel.I.J. and Murray-Rust.P. (1988) A protein sequence/structure database. Nature, 335, 745-746. Bairoch.A. (1991) Prosite: a dictionary of protein sites and patterns. Nucleic Acids Res., 19, Suppl. 2241-2245. Barton.GJ. and Stemberg.MJ.E. (1990) Flexible protein sequence patterns. A sensitive method to detect weak structural similarities. J. Mai. Biol., 212, 389-402. Bernstein.F.C, Koetzle.T.F., Williams.G.J.B., Meyer.D.F., Brice.M.D., RodgersJ.R., Kennard.O., Shimanouchi.T. and Tasumi.M. (1977) The protein data bank: a computer-based archival file for macromolecular structures. J. Mol. BioL. 112, 535-542. Bleasby.A.J. and WoottonJ.C. (1990) Construction of validated, non-redundant composite protein sequence databases. Protein Eng., 3, 153-159. Cabot.E.L. and Beckenbach,A.T. (1989) Simultaneous editing of multiple nucleic acid and protein sequences with ESEE. Comput. Aplic. Biosci., 5, 223-234. Cockwcll.K.Y. and Giles.I.G. (1989) Software tools for motif and pattern scanning: program descriptions including a universal sequence reading algorithm. Comput. Applic. Biosci., 5, 227 — 232. Dayhoff.M.O. (1978) A model of evolutionary change in proteins. Matrices for detecting distant relationships. Alias of Protein Sequence and Structure, NBRF, Washington. DC, Vol. 5(3). Devereux.J., Hacberli.P. and Smithics.O. (1984) A comprehensive set of sequence analysis programs for the VAX. Nucleic Acids Res., 12, 387-395. Faulkner.D.V. and JurkaJ. (1988) Multiple aligned sequence editor. Trends Biochem, Sri., 13. 321-324. Gribskov.M., McLachlan.A.D. and Eisenberg.D. (1987) Profile analysis: detection of distantly related proteins. Proc. Nail. Acad. Sci. USA, 84, 4355-4358. Higgins.D.G. and Sharp,P.M. (1988) CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene. 1, 237-244. Hodgman.T.C. (1989) The elucidation of protein function by sequence motif analysis. Comput. Applic. Biosci., 5, 1 — 13. Lathrop.R.H.. Webster.T.A. and Smith.T.F. (1987) Ariadne: pattem-directcd inference and hierarchical abstraction in protein structure recognition. Conunun. ACM, 30, 909-921. Lesk.A.M., Levitt.M. and Chothia.C. (1986) Alignment of the amino acid sequences of distantly related proteins using variable gap penalties. Protein Engng.. 1, 7 7 - 7 8 . Lipman.D.J., Altschul.S.F. and KececiogluJ.D. (1989) A tool for multiple sequence alignment. Proc. Nail. Acad. Sci. USA, 86, 4412-4415. Metz.C.E. (1986) ROC methodology in radiologic imaging. Invest. Radio!, 21, 720-733. Parry-Smith,D.J. and Attwood.T.K. (1991) A novel approach to multiple sequence alignment. Comput. Applic. Biosci,, 7. 233-235. Parry-Smith,D.J. (1990) Algorithms and data structures for protein sequence analysis. Ph.D. thesis. Sibbald.P.R. and Argos.P. (1990) Scrutineer: a computer program that flexibly seeks and describes motifs and profiles in protein sequence databases. Comput. Applic. Biosci., 6, 279-288. S'mith.H.O., Annau.T.M. and Chandrasegaran.S. (1990) Finding sequence motifs in groups of functionally related proteins. Pro). Natl. Acad. Sci. USA. 87, 826-830. Staden.R. (1988) Methods to define and locate patterns of motifs in sequences Comput. Applic. Biosci., 4. 5 3 - 6 0 . Stockwcll.P.A. and Petersen.G.B. (1987) HOMED: a homologous sequence editor. Comput. Applic. Biosci., 3, 3 7 - 4 3 . SwetsJ.A. (1988) Measuring the accuracy of diagnostic systems. Science, 240, 1285-1293. Taylor.W.R. (1986) Identification of protein sequence homology by consensus template alignment. J. Mol. Biol., 188, 233-258. Recei\*d on August 19, 1991; accepted on February 24, 1992

Circle No. 4 on Reader Enquiry Card 459

Downloaded from http://bioinformatics.oxfordjournals.org/ at OCLC on July 26, 2015

positives, and allow us to predict with confidence that only two of PROSITE's four false-negatives really are false—two are genuine negatives. The ADSP composite discriminator was thus more powerful and accurate than the PROSITE template for this particular superfamily, which could neither reliably distinguish between true- and false-positives, nor between trueand false-negatives. Dictionaries of pattern-templates and databases of sequence features are of invaluable use to molecular biologists seeking quick functional/structural/evolutionary diagnoses of newly sequenced proteins. All such systems, however, represent a compromise between speed of pattern derivation and accuracy of pattern recognition: templates sacrifice accuracy for speed; iteratively refined discriminators sacrifice speed for accuracy. Templates find their particular strength in being readily applicable to sequence patterns that are family independent (such as glycosylation sites, phopshorylation sites, etc.); refined discriminators, on the other hand, are better suited to sequence features that are family dependent, i.e. where a true set can be recognised (such as enzyme active sites, substrate- and ligand- binding motifs, etc). We therefore regard the approaches embodied within ADSP as complementary to those of PROSITE, and feel that there is a place for integration of pattern-template and discriminator techniques.

ADSP--a new package for computational sequence analysis.

A new protein sequence analysis package, ADSP, is described, of which the SOMAP Screen-Oriented Multiple Alignment Procedure forms an integral part. A...
778KB Sizes 0 Downloads 0 Views