Natural sequence code representations for compression and rapid searching of human-genome style databases.

Vol.8, no.3. 1992 Pages 283-289

CABIOS

Natural sequence code representations for compression and rapid searching of human-genome style databases B.Robson1'2 and P.J.Greaney1 quantifying or exploring residue and sequencerelationships(see, for example, French and Robson, 1983). Numeric descriptions ('bio-informatic descriptions') ofamino We are particularly interested in the more subtle sequence acid residues have been developed which will be of value relationships, which we refer to as 'cryptic homologies'. Our whenever the quality and quantity of information in very large studies relate not only to sequence relationships associated with (i.e. 'human genome style') gene and protein sequences is to sub-domain folding patterns (see, for example, Fishleigh et al., be compared or manipulated. These codes are as natural as 1987), but in using them in a practical automatic way for protein possible by our criteria (the same principles could be used in modelling (Robson et al., 1987). Most particularly, we have revision of the criteria). In particular, in storing and searching been led to representations that seem to us the most convenient large amounts of sequence data, natural codes—which relate in linking and working with all these aspects, in a highly to the properties ofamino acids—can be combined with existing integrated type of bio-computation environment (Ball et al., fast-search algorithms but introduce several advantages. The 1990). Our particular current interest,representedby the present code can be assigned such that sub-selection of bits leads to report, is in applying these ideas for efficient storage, transmiscompressed databases with residues defined less specifically, sion and scanning of very large protein sequence databases, by classes of properties. The most compressed representation potentially on human genome scale and larger. leads to the specification of a residue as polar or non-polar, The special problems presented by large protein and nucleic while the most extended representation used at present also acid databases has long encouraged development of list-sorting allows specification of, for example, glyco-asparagine and or 'hashing' approaches (Dumas and Ninio, 1982) or by prior phosphoserine. Preliminary studies on both a supercomputer classification of sequences by the characteristic amino acid and smaller machines suggest a 'worst-case' speeding of triplets they contain (E.Platt and B.Robson, unpublished work). — 4.5-fold. For more intelligent searching, coding extensions Such notions appear in standard searching algorithms such as mixed with the basic sequence data give the sequence data some FASTP and BLAST (see, for example, Lipman and Pearson, of the character of a computer program. 1985). The present method relates to the efficient choice of the binary representation of 'symbols' for residues rather than the Introduction search algorithm per se, and it can be combined with such methods, though as discussed here it also suggests important The storage and analysis of nucleic acid and protein sequences further opportunities for efficient searching. is of considerable importance to molecular biology, pharmacology and molecular medicine (for extensive multi-author The idea of encoding sequence information in a binary format reviews, see Bishop and Rawlings, 1981; Doolinle, 1987). In is not new and in some form or another is an inevitable this, rationally based standards are required, as for other consequence of the constitution of the digital computer and its chemical systems (Warr, 1989), in order to adhere to principles storage devices. Further, principles for efficient compression for rapid query and efficient storage (Martin, 1985). Our of codes are well known and are illustrated in a simple way laboratory has been actively involved in the storage, informaby the Morse Code. That is, the more probable the symbol is tion content analysis, scientific distribution and use of sequence likely to be encountered, the fewer bits are disposed for its use. databases from the earliest days (Pain and Robson, 1970; Relatedly, rare cases may be characterized by special qualifier Robson and Pain, 1971; Robson 1974; Gamier et al., 1978; 'words'. Gibrat et al., 1991), and in developing rigorous techniques for However, such codes are not 'natural', i.e. they relate only to statistical rather than physicochemical properties, and lend themselves less directly to aspects of data manipulation processes (such as sorting, searching, and abstracting sub-sets 'Proteus Molecular Design Limited, Proteus House. 48 Stockport Road, Marple, Cheshire SK6 6AB, UK and 2Department of Molecular Biology and or simplified forms of data for preliminary enquiry) where more Molecular Biology, University of Manchester M13 9PT, UK and Department biological or chemical considerations are important. In pracof Structural Properties of Materials and Biophysics. Danmarks Teknikske tice, aspects of efficiency are inevitably topic linked, which is Hojskole. Lyngby DK-2800, Denmark

Abstract

© Oxford University Prcis

283

B.Robson and PJ.Greaney

to say that the most appropriate coding should be determined by the nature of the subject matter and the relative abundances of classes of enquiry and data manipulation typical of the subject matter. 'Natural' codings deliberately seek to blur the distinction between the biological objectives and implementation details. Such 'natural' considerations of efficient storage or of natural searching are not exploited by any of the major protein sequence databases known to us. Typically, as in the case of the SwissProt protein sequence database distributed by the European Molecular Biology Laboratory in Heidelberg, Germany, amino acid residues are represented by the ASCII character codes for their one-letter IUPAC approved name. That is, alanine is stored as the ASCII code for A, for example, and apparently almost always as an 8 bit byte. One reason is that an efficient compression code might be difficult to manage and require more complex algorithms for interpretation, decompression, errorchecking, updating and editing, and searching. However, there is obviously still no reason for the arbitrary choice of ASCII codes to represent chemical entities which just happen to represent abbreviated forms of their names in English. Indeed, a more natural choice might lend itself both to more efficient storage, and facilitate these data-manipulation operations. We have been pursuing and working with representations of amino acid residues and nucleotides (and hence their sequences) that arc self-consistent and have a number of broader applications in the discipline of 'bio-informatics'. While they are not as optimally efficient for pure storage as truly optimal compressed codings based on statistical properties, they are close to being so and relate naturally to the data-manipulation processes as mentioned above which are required for most protein engineering, genetic engineering and molecular biology usage. This naturalness compensates for a slight loss of storage efficiency compared with hypothetical idealized codes, in that searching can be performed in a more intelligent, structured, and hence faster, manner. In particular, they allow sequences to be searched at different levels of detail in large "human genome' style databases. The key concept is that sequences which can be eliminated from the search at a crude informational level of representation are necessarily also those which would be eliminated at a higher, more detailed level of description. The 'crudest' or lowest level represents amino acids as polar or non-polar. This 'crude' level of information can be scanned and processed much more rapidly than a representation at a more detailed or 'higher' level. The highest level used at present distinguishes glyco-asparagine from asparagine, phospho-serine from serine, for example (notation and assignments at this level are preliminary proposals only). A match eliminated in that asparagine does not match with valine can be pre-empted by observing that a polar residue at that location does not match with a non-polar residue. However, an even more subtle approach which progressively eliminates sequences from consideration, according to

284

progressively finer distinctions, is possible as a natural extension of the polarity example. Intermediate levels relate to intermediate levels of description such as polarity and charge. A preliminary scan for homology, for example, can be carried out on a cruder but physically meaningful description of these sequences, compressed into much less storage. Subsequent scans then focus in only on the more detailed representations. The proposed method allows five different 'focusing' steps, corresponding to five levels of descriptions of amino acids. A pilot version of this five-step approach has successfully been implemented in a unified protein engineering and drug design environment (Ball et al., 1990). It is important to note that this structured approach to searching of the sequence database does not necessarily depend on directly using the bit code approach. For example, internal key addressing within a standard programming language allows the computer 'to know' that, for example, valine is non-polar without examining any bit code for that amino acid. That is to say, such information is immediately accessible from, by virtue of being implicitly associated with, the codes used. Such approaches can be described as 'virtual' use of the coding concepts developed here, and almost certainly the first programming attempts of interested readers will represent this level of approach. In a sense, the bit code is emulated rather than exploited directly. The virtual method is feasible, if not optimal, because the primary direct involvement of bit manipulations can be to set up the requisite databases rather than explore them. However, the code as actually develolped and used is described here, for three reasons. First, the 'bit-theory' approach necessarily provides a structured, underlying theory for any method of this class. Second, there are available hardware systems which can directly carry out bit operations of the required type and in a way that can be addressed from highlevel code. Vectorizable supercomputers, for example, often have logical array processors in order to control the corresponding numeric array operations. These can potentially lead to great efficiency. Third, it is convenient to demonstrate that efficient storage and good rates of information flow are possible, and indeed are fundamentally consistent with, the method developed. For even more intelligent searching of the database, it is convenient if different parts of sequences can be cross-referred to their files of be data-compressed to different extents. For example, known important features of a sequence such as a consensus sequence can be expressed at a level closer to specific amino acid type, or even expressed as a nucleotide base sequence, while the surrounding sequence indicates only polarity. To facilitate this, a form of specification known as 'internal specification' allows each byte to carry information about the level of description per residue or sequence of residues. For greater data compression and control over sending and to control the reference to subsets of larger, more detailed databases, the sequence data itself has some aspects of a

Natural sequence code representations

computer program. Particularly, it has the flavour of the data tape of the idealized 'Turing Machine' and in principle could be further developed as such. By this we mean that unified and efficient codes can act as instructions within the database itself for 'winding the tape' and using the information, rather than relying on a fixed read format (e.g. one byte per data item). In more sophisticated form, this converges to the approach of object-orientated file structures, i.e. higher levels of abstraction which tell the system how the data is structured, and help in searching and interpreting the database. Such a facility is, in our approach, optional. Although preliminary, the unified notation and codes have been worked out in some detail, and should provide a sound basis for more elaborate approaches. Finally, the particular interest to us is that the general principles can be usefully applied quite generally in studying homology scanning, evolution of sequences and identification of domain folding patterns from sequence without regard to the data-compression aspect.

ABCDEFGH 10001100 It is important to note that, for amino acid work, it is convenient to speak of the 'first bit' as the most significant bit and the leftmost as read, both from the point of the human reader and the machine. All the bits in such a byte could represent sequence information, but in such a case other bytes elsewhere must tell what kind of sequence information it is. Such a byte is thus said to be externally specified. Alternatively, some of the bits in the byte could represent sequence information, and other bits in the same byte could describe what kind of sequence information it is. Such a byte is said to be internally specified. A few bytes do not contain sequence information at all and are purely control bytes. An example of a control byte would be one that specifies what kind of sequence is in an externally specified byte. One kind of byte simply contains numeric information for a control byte and is called a control parameter byte.

Theory and method

Internally specified bytes

The following relates particularly to the efficient storage and recovery of the gene and protein sequence data for the purpose of comparisons with a given sequence. An example would be the search for a homologous sequence. It is suggested that the coding principles be adhered to in applications other than searches, for reasons of consistency and standardization. An example of another application would be the storge of sequences for the generation of molecular models. Although the principles described remain valuable irrespective of computer type and word lengths, it is assumed for simplicity of explanation that the computer employed uses 8 bits/byte. This is the situation in FORTRAN, which still remains the language for which the most widespread vectorizable supercomputer compilers are available. In the event of the machine using fewer or greater number of bits per byte, pooling or peeling of bytes respectively will be required to maintain the general principles employed below, though the required operations are fairly standard and should be self-evident to programmers. The assumption is also that 8 bits/byte are written when an unformatted write takes place, and read when an unformatted read takes place, as in our implementation unformatted data transfer operations are employed for speed. There is merit in considering the following principles at a variety of programming levels, but as stated above execution is naturally more efficient when there is a correspondence between the binary characteristics of the code and that of the machine and this is exploited.

The bits that specify the nature of the byte appear in the rightmost two, three or four bits of that byte, as appropriate. Using X to indicate an undefined bit within a byte, we have as an example

A byte of 8 bits is exemplified by 10001100 and the bits can be referenced as being bit A, bit B, etc.

ABCDEFGH XXXXXX01 which is an internally specified byte containing six conjugated elements of one bit code, designated as 6 * I bit code for short. This is most typically a section of sequence containing six residues, each residue being of two types, polar or non-polar (1 or 0 in the code defined below). At the other extreme of amino acid representation, 5 bits could be used to specify each amino acid specifically and, since there are only 20 amino acids naturally occurring in proteins, some common biochemical modifications of amino acids as well. Of course, if each amino acid is described fully by 5 bits, then in a byte of 8 bits, only one amino acid can be represented. This would be 1 * 5 bit code. Forms of intermediate character such as 2 X 3 bit code will signify that amino acids are grouped into types. For this to be useful and efficient, the bit representations of the amino acids must be set up in a particular way described later. For present purposes it suffices to note that there are five types of specification for internally specified bytes, and these in full are as follows: Code type specification

ABCDEFGH XXXXXX01 XXXXXX10

6 • 1 bit code 3 * 2 bit code

285

B.Robson and PJ.Greancy

XXXXXX11 XXXX0 100

xxxxxooo XXXX1100

2 * 3 bit code 1 * 4 bit code 1 * 5 bit code control c o d e indicates xxxx is an instruction

Table I. Amino acid bit codes (precede digits 000) Binary

Decimal

ABCDE

Amino acid code The amino acid code takes up the leftmost (A - E) bits of an internally specified byte (see Table I). The properties of this code are as follows: (i) Increasing the number of bits considered from the first one (A) to the first five (A—E) progressively defines the amino acid more precisely. The first bit simply divides the amino acids into two groups, polar and non-polar (1 = polar, 0 = non-polar). The first two divide the amino acids into four groups, large and strongly hydrophobic (00), small and very weakly hydrophobic (01), polar neutral (10), and polar charged (11). The first three signify aromatic (000), aliphatic (001), small non-polar but with significant polar character (010), small polar but with significant non-polar character (011), hydroxyl group bearing (100), the acid amines (101), the acid sidechains (110), the basic sidechains (111). Adding a fourth bit defined the amino acids almost completely: some are uniquely defined, but a few which are closely related (tyrosine, phenylalanine) have the same 4 bit code. Adding a fifth bit gives 32 possible descriptions by bits which is more than enough to code the 20 naturally occurring amino acids. The remaining slots are used for modifications which can occur to amino acids after bisosynthesis, save for 00000 which is part of the 0000 0000 'check' byte (see above).

(ii) The order of the amino acids in the list describes their polarity. The later in the list the amino acid occurs, the more it is polar. Strictly, this applies only to the first 4 bits, and even here there is some slight depature from ideal order to meet the grouping criteria (paragraph i) and substitution criteria (paragraph iii). However, this change in order is within the variation encountered with the polarity scales of other authors. (iii) The order in the list is, to an approximation, the degree of relatedness of amino acids from the evolutionary point of view. The closer two amino acids are in the list, the more likely are they to substitute for one another, as judged from comparison of homologous sequences. However, as shown by French and Robson (1983) this is a simplification: this property as substitution cannot be conveyed by a list. A much better representation of this aspect is to consider the list as circular, so that tryptophan is close to arginine. There are several ways to represent this computationally (e.g. use of modules). Polarity and evolutionary relatedness of amino acids would nearly amount to the same thing if some such modification were not essential. Note that a deletion is 'glycine-like' on substitution frequency grounds.

Note that there are n x's in an '/' • j ' bit code where n = i * j . Note also that this is the most efficient way that a byte may be internally specified. Only in one case, bit E of the 1 * 4 bit code, is a bit apparently wasted. In fact, this bit is set as 0 for the internally specifying bytes, and as 1 to indicate that the byte is a control byte. All control bytes are thus XXXX1100 and are referred to colloquially as 'eleven hundred' bytes. Checking To ensure that the bits are being read in correct ('phase' or) frame a sequence of 8 zero bits is not deliberately used as data but signals a reading error if encountered in the data. Reading frame checks are also advised at intervals, consisting of the bytes (1111 1100) (0000 0000) (0000 0000). This sequence will (except in reading externally specified bytes) signal an error if read out of phase.

286

reading frame check code

0 1 2 3 4 5 6 7 8 9 10 II

00000 00001 00010 0001 1 00100 00101 001 10 001 1 1 01000 01001 01010 01011 01 100 01 101 01110 01111 10000 10001 10010 100 1 1 10100 10101 10110 10111 1 1000 1 1001 1 1010 11011 1 1 100 1 1 101 1 1 1 10 II111

W Y F L M I V

C(-S-S-) C(-SH) X (unknown) A hydroxyproline P blank/deletion G T modified (e.g. phosphothreonine) T S modified (e.g. phosphoscrine) S N-glycosylated N Z (not known whether Q or E) Q B (not known whether N or D) D modified glutamate (e.g. pyroglutamate.Gla) E modified hutidine (e.g. methylhistidine) H K R

12

13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

29 30 31


Unknown, blanks, deletion The byte 01010000 indicates an unknown amino acid residue. The byte 01110000 indicates a 'blank' i.e. it is to be skipped and not used as part of the information used in making comparison. This can also be used to record a deletion in an amino acid sequence. Two bytes representing two amino acids in 1 • 5 bit code which are separated by 01110000 will be considered as contiguous and the 'blank' will not appear in the comparison. The locations of the unknown and of the deletion in the series is such that they lie closest to average properties, and residues most associated with deletions respectively. Check byte The 'check byte' of 0000 0000 causes cessation of reading and a warning to be given if encountered. That is, it will not normally be written into a sequence, These significances of the 'unknown', 'blank' and 'check bytes' do not apply to an externally specified byte. This allows all 8 bits to be used for sequence information in the case of externally specified bytes, and this is appropriate since externally specified bytes represent more tightly compressed sequence information. Control bytes These are examples only, and are under development in our laboratory. In part this is because of the best choices for the meaning of a conrol byte are those that require less storage for the most frequently used commands. The relative frequency of usage changes, however, with application and increasingly sophisticated usage. Control bytes all end with 1100. They do not themselves contain any sequence information. The leftmost 4 bits (ABCD) currently signify the specific control function as follows: ABCD 0000

sequence separator: signify start or end of sequence, continue to read 0 0 1 0 signify end of data; stop 0 0 0 1 signify end of file, concatenate to other file defined elsewhere 0 0 1 1 signify end of file, read 2 bytes to specify next file name. If the 2 bytes are all null, return to previous file 0 1 0 0 interpret data following as amino acid sequence 0 1 0 1 interpret data following as RNA sequence 0 1 1 0 interpret data following as conformation sequence 0 1 1 1 read next byte to indicate type of interpretation (reserved for future use) read next three bytes to obtain number of externally specified bytes to follow, and read each in: 1000 8 * 1 bit code 100 1 4 * 2 bit code 1010 2 * 3 bit code 1011 2 • 4 bit code

1100 read next byte for extended control instruction set 1101 read permit only if following 7 bytes matches specified code 1 1 1 0 write permit only if following 7 bytes matches specified code 1111 skip next 2 bytes (generally all 0's read-frame check) RNA bit codes AB 0 0 A purine 01 G purine 10 C pyrimidine 1 1 U pyrimidine Note that the first bit indicates a purine (0) or pyrimidine (1), so that like the case of amino acids, reading more bits from left to right increases specification. The action of the remaining three bits (CDE) in a 1 • 5 bit code are not defined at this time. Note that since bits GH are sufficient to define the 3 • 2 bit code, then each byte can contain a triplet of RNA bases, and thus represent a codon for an amino acid. For example, 11111110 is the codon UUU standing for phenylaJanine (F). In this sense, there is an alternative formulation for the amino acids. For example, 11111110 in RNA code is equivalent to 00011000 in amino acids code: both represent phenylalanine indirectly or directly respectively. The relationship between the two data sets is said to be one of 'codon mapping'. Only RNA (rather than DNA code) is defined since the two are easily interconverted at run time, as are the complimentary strand codes. Relative value of amino acid and RNA codes Since the amino acid code maps to the RNA code, it may be pertinent to emphasize the basic concept justifying use of two codes. The first explanation (though rare these days) is of course that the DNA/RNA code may not be available, and several codes correspond to the same amino acids, such that the definition of RNA code from amino acid code is ambiguous. The second is that the arrangement of the amino acid code is different in order to reflect various, and particularly evolutionary, relationships. It is true that such relationships are to some degree reflected in the RNA code, both because of the 'failsafe' evolution of the genetic code to preserve the general properties of an amino acid substitution following a change of base, and because evolution reflects the coding relationships. However, the ordering of the amino acid code takes into account more factors, including the direct effects of the amino acid physiochemical properties under the influence of some 'averaged' pressure from Darwinian natural selection. Both RNA and amino acid scanning methods are recommended when full RNA data is available, and note that all three reading frames should be used when the reading frame is not apparent.

287

B.Robson and P.J.Greaney

Results and discussion Speed advantages depend on the application area The potential speeding over normal database searches is ~ 8-fold for 'virtual' representations not directly using machine bit operations, and higher if hardware used for specific bit operations is used, but these results depend very much on the nature of the study and need important qualification as follows. BLAST, the software suite of the National Centre for Biotechnology Information, probably uses the most efficient codes in widespread use for sequences searching, but though several code representations have been explored, none of these known to us are of a 'natural' type. Such efficient codes as used in these systems often use different principles of targeting to relevant sequences which are expected to be consistent with the 'natural' approach. One example would be a table of all amino acid residue triplets with pointers to sequences in which they occur at least once. The primary advantage of the 'natural' coding method is structured and efficient searching for homologies. This is also the single most important application of any protein sequence data base (von Heijne, 1987). It is possible to demonstrate that, unless injudicious programming leading to very heavy computational overheads is implemented, the approach must on dieoretical grounds alone be significantly faster than a standard direct search, and this must be true to varying extents irrespective of the matching/scanning method used. In part, the time saving arises because file reading is the major overhead and the mechanism for exploring sequence relationships is inherent in the data representation. Hence, if 8 bits are coded per byte, and each bit represents an amino acid as polar or non-polar, then a scan on the basis of similarity by polarity must be, and is, - 8 times faster than when 8 bits encode one amino acid. However, the speed gains will inevitably vary with the scanning method and indeed with the length and nature of the sequences for which it is of interest to locate the homology. Further, it is unlikely that a perfect match is required in every application, so there is also the effect of the 'percentage match' or odier comparison procedure, and the 'cut-off1 level representing the degree of match below which a sequence is rejected as not matching. However, one may also enter a short sequence in order to identify a sequence or region for checking or editing, or to detect any consensus, functional segments. For generality, we shall use 'input sequence' to represent the protein for which homology is to searched, or the consensus or marker sequence. Primary method of assessment For our primary method of assessment, we attempt to carry out a procedure with some aspects of both application areas, and in such a way that the results will represent a conservative estimate of speeding gains. That is, the bench-mark approach used is at least to a considerable extent representative of the worse cases encountered. To understand this, recall that the 288

databases representing polarity/non-polarity are searched first, with the 'input' string at the same polar/non-polar level description. For example, the input sequence could be the segment DEVKVVVS, which becomes 11010001 with 1 for polar, 0 for non-polar. The locations of the files matching or matching above a specified level of match are noted by the program, and represent the 'hit list'. The corresponding 'hit list' sequences at a higher level of description can be searched. This results in a further reduced 'hit list'. In principle, speed gains are still possible if certain levels are 'skipped'. In these studies, all levels are explored. The approach used blends two possible approaches to carrying out a scan. First, one may use a window of, say, 10 residues, which is scanned along the input sequence and tested for match with a window of 10 residues scanned along the database sequences. Second, one may perform a scan based on the approach of Needleman and Wunsch (1970) which allows for the presence of insertions/deletions. Both approaches could either eliminate anything but 100% match or identity between windows, or reject by considering a lower match level, say rejecting below 90% identity. Alternatively, more complex statistical assessments could be employed, e.g. based on the statistics developed by Dayhoff et al. (1987). Primary tests used a compromise approach and scoring scheme on both counts. A scan window of 20 residues was typically used. The score method was used of counting 1 for every residue match at the first level of 1 bit/residue, plus 1 for every further bit considered at higher levels. That is, a perfect match at level 1 gives 1, at level 2, 1 + 2 = 3, at level 3, 1 + 2 + 3 = 6, at level 4, 1 + 2 + 3 + 4 = 10, and at level 5, 1 + 2 + 3 + 4 + 5 = 15. Fora window of 20 residues, the maximum score is thus 20 x 15 = 300 indicating an identical segment of 20 residues in both sequences. In the first phase of studies, any window match of 50% or more of the maximum possible score at this level, say 50% x 1 X 20 = 10 at level 1, 50% x 3 x 20 = 30 at level 2, and so on, is considered interesting. A Needleman-Wunsch is then performed on any window so found interesting, with a penalty of — 1 to the score for every residue insertion/deletion. The score is reassessed, and the important feature is that a match at a higher level is not tested if it fails at a lower level. If in contrast this process is repeated by examination only of the more detailed data at level 5, or by testing matches even if found at a lower level, the process is 4.5 times slower. This is the worst possible case in two senses, (i) This is on a CONVEX 220 supercomputer which already treats the standard scanning method well. Some of the code peculiar to our approach has, however, yet to be optimized, (ii) The above acceptance criteria of 50% (with the Needleman—Wunsch penalty of —1) in the elimination procedure are, in fact, extremely generous, since there has to be very strong criteria for not continuing the test at the higher level. Exploring these aspects interactively on a variety of personal computers has allowed almost 8-fold speeding.


In practice, different results, generally implying much more directed and faster searches, can be obtained by more judicious rejection criteria at any level. Going through five levels, the sequence of resolution of residues and the progressive focus on progressively smaller hit lists proceeds as follows in resolving the properties: 1. polar/non-polar matches; 2. large hydrophobic/small hydrophobic/polar neutral/charged 3. presence of specific groups (e.g. aromatic, hydroxyl, amine, acidic, basic); 4. amino acids specified except high conservative relationships (e.g. phenylalanine, tyrosine) are not resolved; 5. amino acids fully resolved, including post-translation modifications. At any level, it is possible to impose further rejection criteria such as a test for a secondary structure propensity. Such approaches lead to further substantial speed increases, but lead naturally to the question of whether a match or rejection is justified. In general, the question of quality of match mentioned above is, in fact, a matter for the choice of the user. The method lends itself in a natural way to assessment and definition of quality of match. For example, when a typical protein sequence is the input and homology with it is sought, then it might be reasonable to consider the whole of the target set identified at level 3 as homologous, and—by definition—all sequences excluded from this target set as non-homologous. Equally, one might interpret all matches as matters of degree, and speak of the match as being 'at level 3'. In this choice, the sequences are assumed related because residues tend to occur at the same locations as residues with comparable significant groups, e.g. aromatic rings, hydroxyl groups. On some computers, easy facilities exist for using special bit registers and/or for 'bit shifting' data. Our experiences with earlier machines such as a Norsk Data suggest that special 'binary' facilities of this type can in principle result in speeds 100 times faster or more. The real value of searching data in this way is not in relation to speed per se; it lies in the fact that more distant, cryptic relationships can be discovered than would be identified by standard methods, and in the fact that discovery process is efficient. On a highly parallel computer such as a transputer the method appears capable of still fuller dramatic enhancement. The appropriate statistical methods and their implications for statistical significance will be discussed by us elsewhere. We briefly note here, inrelationto the question of what constitutes a 'match', that the significance of a relationship is arguably defined by the purpose to which the resulting relationship is to be put. Indeed, a Bayesian approach has been employed to take account of these more subtle aspects, and to find a pragmatic solution which is not too sensitive to the statistical model. The principle conclusions are for completeness stated briefly here. This is because these

considerations are of some importance to interpretation of the method, at least in regard to not overemphasizing any one standard, classical statistical test. In the absence of finding any obvious homology, the highest hit set of proteins encountered can all be considered as potential candidates for a deeper analysis of more subtle aspects of relationship, such as those based on comparable secondary structure tendency or tendency to adopt fold motifs, or higher levels of pattern generally. Neural network approaches might well be used in part here. In practice, methods related to those we have developed for statistical validation represent an important aspect of those used, and will again be described elsewhere. This aspect of exploring more deeply the less-convincing relationships found carries with it a notion of 'risk': if one must carry out a modelling study as best one can, one accepts a bigger risk with less convincing homologies. This risk aspect can, however, be made objective; indeed, it is a natural part of (or arguably extension of) the Bayesian statistical approach, and therefore this will again be developed elsewhere. Acknowledgements This research was financed by the data communication development fund of Proteus International pic. Comments at open discussions with a variety of database staff at the National Libary of Medicine, Bethesda, the European Molecular Biology Laboratory, and the Max Plank Institute, Munich, as well as travel funding or facilities in whole or part from these institutions for discussion purposes, is greatly appreciated.

References Ball J., Fishleigh.R.V., Greaney.PJ., LU., Marsden.A., PoolJ. and Robson.B. (1990) A polymorphic programming environment for the chemical, pharmaceutical and biotechnology industries. In Bawden.D. and Mitchell,E.M. (eds). Chemical Information Systems—Beyond the Structure Diagram. Ellis Horwood, Chkhester, pp. 107-123. Bishop.M.J. and Rawlings.C.J. (1981) Nucleic Acid and Protein Sequence Analysis—A Practical Approach. IRL Press, Oxford. Dayhoff.M.O., Schwarz.R.M. and Orcufl, B.C. (1978) In Dayhoff.M.O. (ed.). Atlas of Protein Sequence and Structure, Vol. 3, Suppl. 3. Doolittle.R.F. (1987) Of Urfs and Orfs. University Science Books, California. DumasJ.P. and NinioJ. (1982) Nucleic Acids Res.. 10, 197-206. Ftshleigh.R.V., Robson.B., GamierJ. and Ftnn.P.W. (1987). FEBSUtt.. 214, 219-225. French.S. and Robson.B. (1983) /. Mol. Evol., 19, 171-175. Gamier J., Osguthorpe.DJ. and RobsoivB. (1978)7. Mol. Bioi., 120,97-120. GforatJ.-F., Robson.B. and GamierJ. (1991) Biochemistry, 30, 1578-1586. Heijne,G. von (1987) Sequence Analysis in Molecular Biology. Academic Press, New York. Lipman.D.J. and Pearson.W.R. (1985) Science, 227, 1435-1441. Martin,D. (1985) Advanced Database Techniques. MIT Press, London. Needleman.S.B. and Wunsch.C.D. (1970) J. Mol. Bio/., 48. 443. Pain.R.H. and Robson.B. (1970) Nature, 227, 62-63. Robson.B. (1974) Biochem. J., 141, 853-867. Robson.B. and Pain.R.H. (1971) J. Mol. Bioi., 58, 237-256. Robson.B., Ptatt.E., Fishleigh.R.V., Marsden.A. and Millard.P. (1987)/ Mol. Graphics, 5, 8-17. Warr.W.A. (1989) In Warr.W.A. (ed.). Chemical Structure Information Systems. American Chemical Society, Washington, DC. Received on November 7, 1991; accepted on February 14, 1992

Circle No. 11 on Reader Enquiry Card

289

Searching through sequence databases.

Searching and Navigating UniProt Databases.

Searching for disability in electronic databases of published literature.

Similarity searching in databases of three-dimensional molecules and macromolecules.

Secondary structure-based profiles: use of structure-conserving scoring tables in searching protein sequence databases for structural similarities.

A comparison of searching the Cochrane library databases via CRD, Ovid and Wiley: implications for systematic searching and information services.

Evaluation of instructive texts on searching medical databases.

Searching mixed DNA profiles directly against profile databases.

Searching molecular structure databases with tandem mass spectra using CSI:FingerID.

Searching for religion and mental health studies required health, social science, and grey literature databases.

BLAST and FASTA similarity searching for multiple sequence alignment.

Supervised Learning for Detection of Duplicates in Genomic Sequence Databases.

Corruption of genomic databases with anomalous sequence.

A Data Structure for Rapid Mass Spectral Searching.

Tiered Human Integrated Sequence Search Databases for Shotgun Proteomics.

Molecular Assay Validation Using Genomic Sequence Databases.

Finding protein similarities with nucleotide sequence databases.

Molecular self-assembly: Searching sequence space.

Code-time diversity for direct sequence spread spectrum systems.

Attachment style, representations of psychotherapy, and clinical interventions with insecurely attached clients.

Restricted natural language based querying of clinical databases.

Assay suitability for natural product screening: searching for leads to fight Alzheimer's disease.

Construction of validated, non-redundant composite protein sequence databases.

NLP-PIER: A Scalable Natural Language Processing, Indexing, and Searching Architecture for Clinical Notes.