62

databases

Molecularbiologicaldatabasespresentandfuture Rainer Fuchs, Peter Rice and Graham N. Cameror The importance of databases as a research tool in molecular biology is growing steadily, and a wide range of databases relevant to genome research is currently available. However, the design of current databases is inadequate for accurate representatiofl

and analysis of the results of large-scale genome mapping and

sequencing projects. A new generation of databases is required to master the challenges of the future.

Q 1992. Elsevia Saence Publishers Ltd (UK)

Literature

Table 1. List of relevant biological databases3 Description

Notes on function

Sequence databases

EMBL/GenBank/DDF%j; Genlnfozs DNA and RNA sequences, with detailed annotation Protein and peptide sequences, SWISSPROT5, PIR” with detailed annotation Cloning vector sequences VecBase*” Protein sequence motifs PROSITE Alignment of immunoglobulin Kabat database7

tRNA databas@ S;3uem

feature annotation

EPDl” TFD”

Structure PDB’6 CAReBANK*’ Mapping GDB*3. CEPH28, OMIMI: GBaseb

sequences Alignmentof tRNA sequences

E. coli references, sequences and genetic map positions Eukaryotic promoters in EMBL/GenBank/DDBJ Target sites and sequences of eukaryotic transcriptior factors

Protein, DNA and carbohydrate structure atomic coordinates Complex carbohydrate structures

Iiuman genetic and physical mapping data Mouse genetic and physical mapping data

Other

REBASE29 ENZYME” LiMBso

Restriction enzyme data Enzyme EC numbers and reactions Database of molecular biologica! databases

aThistable is not intended to be comprehensive. For B more detailed list see the LiMB database30. In addition, the majority of biological-resource banks (e.g. American Type Culture Collection [ATCC] 12301 Parklawn Drive, Rockville, MA 20852, USA; and Microbial Strain Data Network [MSDNI University of Cambridge, 307 Huntingdon Road, Cambridge CB3 OJX, UK) have associated databases, some describing bank contents, and some content-related informational daiabmses bThe Jackson caboiatory, Bar Harbor, ME 04609, USA. cAvailable from the EMBL Data Library.

able though a biological rcsourcc ccntrr, and Ehc ban11 or similar ccntrcs would ;1ls0 mnnagc the distrihitic:S ofrclatcd straux for con~parativc ana!ys!s. Tbc wchnical problcnl~ of storing and shipping a large number of bioloykaj saqhs can probably bc circnmventcd by the ‘scqucncc-tnggcd sites’ (STS) approach’3. (A11 STS is a short, unique DNA scqucncc which charactrrizcs a mapping laudmark on the geuomc. Suficicnt information is stored in a database to allow one easily to reiovcr the scqucncc by polynicrasc chain reaction ~I’CR]; thus, no access is rcquircd to the biologicai mak,il ~rscdt9 dchie tlic tag,)

TIBTECti JAN/FEE 1992(VOL101

databases

storr

the scicn;ific

liternturc

in

computer form to provide rapid-starch hcilitics and on-liuc ncccss to abstracts of articles. Other databases mnintaio standard nomenclatures, restriceiou enzyme sites and enzyn.: classifications, Among chc most intercsdng new dat.:hases nrc those dcrivcd through two-dimcnsionnl gel clcctrophorcsis which can index all the proteins produced by a bactcrial spec-its or 3 human ccl1 lincl’~“. The spots on such g& can bc identified 3s esprcssed from a spccitic in a clone library. or as altcrcd by set of clouts mut;ltion in a specific gene, and thus mapped directly. FIi&rcnces in proacin expression bctweeu tisszc types cC1n bc dctrrmincd. and the altcrcd proteins traced to clorirs ii, 3 library. Short scqucncc fiagmctlts (f?om microscyucllciiig)) can cvcn bc used to identify cntrics for the same or similar proteins in the protcinFC~LIC’IICC databases. Very ilnpormnt in ~IJIZV xe:Ls ofnwdcrn biology xc structure d;ltnbascs (in par-:iculnr thr Brookhavcn dataof both base) “8, which store the ntomic coordkltcs lnrgc and s111:lllmolcculcr, for use ii1 proccin-structure analysis and drug design.

Linking the databases Databnscs in isolation arc of limit4 use. Most darabnscs, thrrcforc. include poinrcrs to related datncan bc ncccsscd.Figure bases so tbnt e::tra i:+..,nn;rtion 1 shows the cl;istjiig links bctwccn the scqiicncc databnscs nrld some of thcsc other data colh&ons. Each DNA-coding sccp~~xx rcfcrs to k protein scqucncr for fkther biolo$cnl detail, and cnch protcin Frqucucc rct;rs to the onginul DNA scq~c~~c and to tbc stracturc ifit is k13o~v11. SCX~::C~:Wentries :~lso r&x to the gcnctic and physical map locations, give dctaiis ofsourcr str;lins, gct?c IX~ICS, niid cnzym~ commission (EC) numbers. Each of thcsc links allows additional information to bc retrieved fro:n ocher spccinlizcd datnbascs. The SWISS-PROT da*nbnsc has pointers to all PROSITE motifs found within cnch sequcncc clltry; PKOSITE, in turn, has pointers to all enwies in SWISS-PILOT which col:tain the motif, and reports the number of f&e hits and m$csA&t:,~ giving 3 cicar guide to the specificity of the mot& Other databases contain no scquwc~ data, but i~lsicnd refer to ccqucnce entries in tbc DNA or protcin sc~;~~~:ncc databases. Thcsc pointers use ;1cccssion ixuilbcrs -.- urlique idcntifkrs which arc alwnys associntcd 4dl a dntabnsc entry CVCII chouyh its tx~lle nlny ch;g~. S&I !irrks Tom ECII, EPD and the Drw$ri/cl m;1p’7 tq; the nuclcotidc scqucncc cnwics arc shown in

Fig. 1. Databases for genome projects The rcccnt initiatives in mapping Gnglc, complete gcnomcs have for databases that concentrate usually ;1 sin& strain. Most of yet, m their early stages. Figure

and sequencing

highlighted the need on a singlr spccics, thcsc projects arc, as

2 shows the databaws currently available for a sin& strain (K-l 2) of the bacterium E. co/i. The sequence of more than 3W of

e

------F]

EMBL/GenBank Nucleotide sequence dntsbase (Hefs 2.3j

1

GDB Genome map data (Ref. 23) J-

‘l

+

)

i f

Drosophila genetic

TFD Transcription factors (Ref. 8)

t.----__

map database (Ref. 17)

SWISS-PROT Protein sequerice database (Ref. 5)

r I

*-K] I

PDB Three-dimensional protein structures [Ref. 16)

Protein sequence

REBASE Restriction enzymes

motifs

Figure 1

Linksbetween databases.

its gctlomc is now dctcnnincdt’, the gcuctic map is known in considcrablc detail, and a restriction map and a complctc set ofoverlapping clones arc available. The known E. cc4 scquctlcc is, of course, contained in the standard nuclcotidc and proccin-scqurncc databases. Mapping informaciot: is avaiiabic iu a drmilcd gcnccic linkage map’“, which is rcfct-rcd to by all rhc other dntabascs. I%ysical maps include a rcstrictton map with 7000 sites, and a mapped clone library”‘. The protein content is desctibcd by a datnbasc oftwodit;xxsiotxA clcstrophorcsis gcls’s. Two additional datnbnscs 5nk this information togcchcr. The CcncProtein Indcx~~ links gcncs to their protein products atld to the two-dimcnsiot,al gel spots. The ECDt t links the gcuctic map to the scqucncc datablscs, with directions for the construction oft-he fclll chmmosomal sc(~ucucc from overlapping fragments. Future directions The amount and cr;mp!rsity of infortnntiou from ~1 gcnonic-scale scqueticing prqjccts will gcnzi;;:c !?cw rcquircmcnts for present-day dacabxes, in particular the nucleotidc-scqucncc dxabases. Aldlough sotnc of these challenges arc specific to their operation, most of them arc similar for databases of other genomc information and broadly npplicnblc,

Data acqrrisitiorr It cnn Lx prcdictcd thx by the year ?(hcl!! t!:c ttuclrotidc-scqucncc coilecriou will have growu to scvcral hundred times its current size. This iucrcnsc in information uot only crcxcs rrrtnendotts tcchuicnl problctns ofdata storngc and tllatlagct~ict1r ufthc databanks, but also atKcts the way data arc ncquircd by chctn. TraditiounlIy. most informatiort itt the s~~quc’ncc dntabascs has been cstrxtcd fiotn the scicn~:i$ic iitcraturn, but- the itnpotta~~~x of Ellis routr has dccrcaccd significalrtly. Today, scq ucttccs from publications accoutlt for only -10%~ of the currcttt input to the EMUL/GcttHa~~k/1)13~~J daabasc; tltc ttt;ljority of data xc sttbtnittcd directly by the sclctttistc. Tiw gowing rciuccaticc of journals to print scqucncc daG2 iucrc,,tes the iikclihood chat this trcud will continttc and tit,,< xblication ;f ~cquct~ccs in thr traditional tltannc’r Wilt ccasc. IXrcct data dcpositiott iu the pttblit d:tt:tbnscs is thcrcforc becoming incrmsitigly itnpottant. Gcttotnc projects will also open up a catnplcrcly uc’v\- r-nurc ofdat;t acquisition. In rbc filturr, the bulk _.^ - LLIIL,tL. --r-**** 71:A;.. i t&s~‘ ot It::,-nxanon In -I uic_.i Itrill come frolli pro: ;;t-specific data’u~nks as tl;e primary collection poii:?c ofthese data. The sui:~tnission oiscqueticcs t?otti the .irtgoing yc:tst chromosome 111and the C. c’i the current one,uclcotidc:t qucncc dntabasc would cor.rcntratc on building a itablc ‘backbone’ ofpublishcd .~nclsubmitted scqucncc data. Ditf‘crcnt sets ofintcrpnxition can then bc Ii&cd flcsibly to the scqurncr database and to cat!: other in different, and CVC’IIcontradiscory ways. The creation of thrsr indcpcndcnt ‘annotation databases’ will be the responsibility ofthc scientific communi~ and bc carried out by scientists who arc cxpcrts in ri?cir field. Of C~JW-W,rcscarchcrs will rrquire integrated access co ehcsc di&rent types of data collections, and the need for dcvcloping appropriate systems which allow the databasr user to navigate through this network of databases creates an exciting research field and a challenging task for all database producers and software dcvclopcrs.

___._ .___~.__. TIBTECH y\NffEE

1992 lV@L

Molecular biological databases--present and future.

The importance of databases as a research tool in molecular biology is growing steadily, and a wide range of databases relevant to genome research is ...
574KB Sizes 0 Downloads 0 Views