DNA Sequence--/.DNA Sequencing andMapping, Vol. 3, pp. 99-1 05 Reprints available directly from the publisher Photocopying permitted by license only

0 1992 Harwood Academic Publishers C m b H Printed in the United Kingdom

Indexing the sequence libraries: Software providing a common indexing system for all the standard sequence libraries Mitochondrial DNA Downloaded from informahealthcare.com by UB Heidelberg on 11/15/14 For personal use only.

RODGER STADEN and SIMON DEAR MRC Laboratory of Molecular Biology, Hills Road, Cambridge, CB2 2QH, UK

We describe a set of programs for creating and using indexes for the distributed forms of the major sequence libraries. The indexes conform to the specification of those distributed on cdrom by the EMBL sequence library. The programs create entry name, accession number, author and freetext indexes and a brief directory index. If a suitable application program is given an entry name or accession number these indexes allow rapid retrieval of sequences or annotation. Similarly the author and freetext indexes provide the data for extremely fast searching on author names and “keywords”. The indexing programs can create indexes for EMBL, SwissProt, CenBank, PIR and NRL3D libraries. We also describe the organisation and use of the different sequence libraries and their index files.

require a format that i s different to that distributed by the sequence libraries, which means that huge, (and growing) files need to be reformatted and, at least temporarily, duplicated. Faced with these problems we decided to employ a standard set of indexes, and, even though the various libraries differ, leave their sequence and annotations in their distributed formats. This meant that we needed to design the index formats, write the indexing programs for all the different libraries we wanted to use, and write the application programs to use the indexes in conjunction with the varied sequence and annotation formats. Having written the programs, the result is that we can use all the libraries, including weekly or nightly updates, and we save disk space by not changing the libraries from their distributed formats. The indexes are relatively small. In addition, the application programs allow rapid retrieval of entries from entry name or accession number, and very fast and exhaustive searches on “keywords” and author names. At the time we were considering this strategy EMBL announced the indexes they were planning to provide for the distribution of the EMBL nucleotide and SwissProt protein libraries on cd-rom. We decided to adopt the same indexing methods. Detailed information about the structure of the indexes is contained in the document EMBL CDROM (Indices), which i s available from the EMBL Data Library. Below we outline the relevant components of the indexes used.

KEY WORDS: sequence libraries, keyword searches, author searches, software

INTRODUCTION There are several major centres for collecting, annotating and distributing nucleotide and protein sequence data. These include EMBL, GenBank, SwissProt, PIR’ and the D N A Data Bank of Japan. Unfortunately there are also several formats for storing the data, and i n addition several methods of indexing. This diversity of libraries and formats creates problems for both users of sequence analysis programs and those who develop them. Not only do users need to search more than one nucleotide and protein library, but they may have to use a different program for each library because they are each in a different format. Furthermore, system managers may have to store several copies of the libraries in order to provide versions compatible with the variety of programs they need to support. Finally, and importantly, the majority of programs

Introduction to the indexes A library consists of several files known as “div-

99

1 00

R. STADEN AND S. DEAR

Mitochondrial DNA Downloaded from informahealthcare.com by UB Heidelberg on 11/15/14 For personal use only.

i s i o n s ” h h i ( t i ea( h contain the sequences and aiinotntioii tor some sub-division of the data. Both tMHL and GenBank are divided into 1 3 divisions (‘jlthough the groupings are different), SwissProt h,ii only 1 and PIR h‘is 3 . Thcl index tha! provides retrieval of entries based oil their clntiy name5 h‘is the following contents. For e‘ic h t’ntry name it re( ords:

I hc position ot ;Icharacter in a file i s defined by its ”ottset” The offset i s the number of bytes from t h e start of thc file and the tirst character in a file h,cs oftset 0 The defines the start 01 the ,innotation for entry and the the position of i t s first \equence ( harac ter The i s a number (between 1 and 1 3 for EMBL and Ct>nHanhithat definw which of the divisions coni ‘ t i n s t,ritry . The index i s sorted into c ~ l p h ~ ~,?It xorder ~ t i ~ on . As will be c l t w r i t w i below other indexes refer to the records i n t h i s c’ntry name index by their the number of entries containing c-t,irgcJi stt ing->, ,ind is i t i r nurnbpt of record in the hit file that contains t t i v tirst (Intry name th‘it reterences h t b tilt. is sorted alphabetically on t t i t t i l e s ( onsist entirely of records containing 1hese are the record iiurnt)c.ri i n thr ontry n‘imes index So from an $ 1( ~t ’ ~ ~ i ot icu m b e r found in the target file we get a poitit(2r t o ‘I iec ord i n the accession number hit file (chit rt.cord number:.) and a count of the number o t iclev,int records (). In the hit t I le record5 t i it-record-nu mber> through i < h i i rc’( circl nurnber>+-l) con“biii tlw iclcctrd nurntters of entries in the entry q:

name index. These records contain the m t r y names of the entries that include the accession number . Hence we can use an accession number to locate an entry. The EMBL CD-ROM also has freetext and author indexes and they have an identical structure to the accession number indexes: i.e. they each have a target file and a hit file. The freetext index i s an index of all non-trivial words occurring in feature tables, definition lines, title lines, keyword and comment lines. Again the target files are sorted on and the indexes can be used to locate entries in the same logical way as the accession number index. The final relevant index consisted on the EMBL CD-ROM is the brief directory index. This file consists of a record for each entry in the library. Each record includes:

The content of the first three items i s obvious anti i s an 80 character sunimarv describing the entry Again this tile i s sorted on . Note this means that the record numbers for entries in the briet directory index are the same as those in the entry name index Hcnce the from the ,~ccessioii number, freetext and author hit tile5 point directly to the relevant records in them both Each of the index files has a header o t 300 bytes which defines the file size, number of records, record size, database name, databast. r e l ~ a s enuniher, database release date, and a l w include\ 2 5 6 bytes of free space for future use Finally, for ( o m pleteness, we point out that the E-MBL ( D-KOM also has a set ot “pre-index” tiles which provide higher level of indexing tor t‘irget files We habe not found them necessary as our ( urrent a( c ~ 5 s and searching times are suffic icntly t a 5 t without them

Dealing with the different sequence libraries

In order to create the indexes c1escriht.d ,]hove fot each of the different libraries our progranis hdve t o read through each of their distributed tiat

Indexing the sequence libraries: software providing a common indexing system for all the standard sequence libraries.

We describe a set of programs for creating and using indexes for the distributed forms of the major sequence libraries. The indexes conform to the spe...
569KB Sizes 0 Downloads 0 Views