Finding protein similarities with nucleotide sequence databases.

FINDING PROTEIN HOMOLOGIES WITH DNA DATABASES

r71

[ 73 Finding

111

Protein Similarities with Nucleotide Sequence Databases

&STEVENHENIKOFF,JAMESC.

WALLACE, and JOSEPHP. BROWN

Introduction The relative ease with which DNA can be cloned and sequenced has led to rapid expansion of the nucleotide sequence databases. As a consequence, the amino acid sequences of the large majority of new proteins have been deduced from nucleotide sequences rather than determined directly. Since these sequences are initially deposited in the DNA databases, the protein databases have become mostly secondary sources of information. This trend promises to increase as larger scale nucleotide sequencing projects get under way. Therefore, it is worthwhile to search nucleotide sequence databases for protein similarities, since these databases are more complete and up-to-date. As illustrated in this chapter, it is advantageous to search these databases for amino acid rather than nucleotide sequence similarities. Therefore, we have adapted amino acid sequence searching procedures to detect similarities within nucleotide sequence databases.’ This involves translating the DNA database sequence in each possible reading frame into protein prior to alignment with the protein sequence used as query. Not only is this simple procedure very sensitive, but it also allows access to sequence data that are not known or suspected to encode proteins, and hence would not appear in protein databases. Searches using translated database sequences can be more sensitive than nucleotide sequence searches of the same data because (1) the genetic code is degenerate and (2) regularities in proteins can be used as selective criteria. The codon degeneracy problem is best illustrated in the case of serine: Since serine can be encoded in six ways (UCN, AGU, or AGC), a perfect match between serines in two aligned proteins is very likely to be mismatched at the level of nucleotide sequence. In fact, many of these nucleotide mismatches are at all three positions, such as UCA (Ser) versus AGU (Ser). Codon degeneracy is even more of a problem where organisms show different biases in the choice of codons for any particular amino acid. For example, in mammals, CAG (Gln) is preferred over CAA (Gln) about I S. Henikoff and J. C. Wallace, Nucleic Acids Rex 16,6 19 1 (1988).

METHODS

IN ENZYMOLOGY,

VOL.

183

Copyright 0 1990 by Academic F’ress, Inc. All rights of reproduction in any form reserved.

112

SEARCHING

DATABASES

171

75% of the time, whereas in yeast this preference is reversed.* Therefore, the large majority of glutamine matches in homologous proteins between yeast and vertebrates are mismatched at the third position. Translating a database sequence prior to comparison overcomes this difficulty. Translation also allows one to take into account amino acid preferences and conservative substitutions, such as the lod (log odds) matrix scoring system.3 There are potential drawbacks to the translation of nucleotide database sequences for searching. Some of these are due to the greater number of comparisons required to search a translated nucleotide sequence database when compared to a protein sequence database. Since all six possible reading frames must be searched, there are at least B-fold as many comparisons necessary in order to search a coding sequence when compared to the derived amino acid sequence. Furthermore, the nucleotide sequence databases include large amounts of sequence not encoding protein. Therefore, the number of computations involved are typically more than an order of magnitude greater for translated database searching than for protein database searching, with a corresponding increase in computer time. In addition, the greater amount of data requires more disk storage space. Nevertheless, we have found that even relatively modest personal computers are adequate for translated searches. By 4-fold compression of the nucleotide databases and translation “on-the-fly,” disk storage requirements are not excessive [about 8 megabytes (MB) for EMBL 17 or GenBank 58). Typically, personal computers are turned off at night, so that even the most exhaustive search can be carried out with minimal cost and inconvenience by using an otherwise idle machine. The most intensive of the searches that we describe here requires about 8 hr on an IBM-compatible personal computer with an Intel 80286 processor or about 4 hr on a “386” machine. The rapid improvement in technology, the availability of useful general purpose software, and the reduction in hardware prices have put these machines in many laboratories where sequence data are being collected and analyzed. Although nucleotide sequence databases have increased in size over the past several years, the increase in speed and storage space of general purpose personal computers has been even greater. Another problem for translated searches is the increased level of chance similarities detected when more data are analyzed. This background level of spurious matches reduces sensitivity, making distant relationships po* T. Maruyama, T. Gojobori, S. Aota, and T. Ikemura, Nucleic Acids Res. 14, rl5 1 (1986). 3 M. 0. Dayhoff, R. M. Schwartz, and B. C. Orcutt, in “Atlas of Protein Sequence and Structure” (M. 0. Dayhoff, ed.), Vol. 5, suppl. 3, p. 345. National Biomedical Research Foundation, Silver Spring, Maryland, 1978.

171

FINDING

PROTEIN

HOMOLOGIES

WITH

DNA

DATABASES

113

tentially difficult to detect. Generally, sensitivity can be increased by extension of an alignment following detection.4,5 However, extension is not always appropriate for increasing sensitivity of translated searches. Therefore, we have relied on an alternative means of increasing sensitivity: identifying a short conserved sequence motif and using it for sensitive detection. In this way, one does not rely on extension of a similarity for detection, although it can be useful for confirmation. We illustrate these methods, individually and in combination, using examples taken from our recent analyses of bacterial activator protein families. The programs NUTSS (Nucleotide Translation Similarity Searcher) and PATMAT (PATtem MATrix builder) which carry out the procedures described in this chapter are available on a 360K floppy diskette from the authors on request. These programs are also part of a comprehensive package, GENEPRO, available from Riverside Scientific Enterprises (18332 57th Avenue N.E., Seattle, WA 98 155). A 100% IBM-PC compatible computer with a hard disk and 640K of memory is recommended. A version for the Apple Macintosh II is planned. The searching programs utilize uncompressed databases available from GenBank or highly compressed GenBank, EMBL, and NBRF-PIR databases available from Riverside Scientific. Methods

and Applications

Method 1: StandardSimilarity SearchUsing a SingleAmino Acid Sequenceas Query Standard similarity searches are performed by fetching an individual nucleotide sequence, translating each reading frame into protein, comparing that reading frame with the query, repeating the comparison for the next reading frame, and then repeating the entire operation for the next nucleotide sequence. The comparison strategy6 is to align a fixed length of sequence (a window) from the query with the same length from a translated database sequence and calculate a lod (log-odds) score, which measures the likelihood that two aligned amino acids are functionally equivalent.3 The window is then aligned with the next stretch of translated sequence and a lod score calculated. For greater speed, only alignments in which one or more dipeptides match between the query and the translated database sequences 4 R. F. Doolittle, Science 214, 149 (1981). 5 D. J. Lipman and W. R. Pearson, Science 227, 1435 (1985). 6 W. J. Wilbur and D. J. Lipman, Proc. Natl. Acad. Sci. U.S.A. 80,726 (1983).

114

SEARCHING

DATABASES

171

are considered. To reduce the likelihood of missing optimal matches, all possible alignments including one or more dipeptide matches are made. Gaps are not allowed. Generally, a window of 30 amino acids is effective.’ In some cases, however, larger windows are useful for more sensitive detection.’ Use of a fixed searching window rather than a flexible one5 allows the typical investigator to evaluate more readily the meaning of a high score, as no hidden decisions have been made by the program. The top scoring sequences are then visually examined for extension of each detected match using standard computer alignment8 and dot matrix9 methods. Visual inspection at this point takes advantage of the investigator’s judgment in deciding whether an alignment reflects common ancestry or a chance similarity. Extension is only one criterion used in this decision; other information, such as function, species, intron-exon structure, and DNA sequence context, can be considered at the same time. Application I: Detection of LysR Family by Standard Similarity Searching. An example of the standard searching approach is the previous identification of members of a large family of bacterial activator proteins related to LysR.’ Using the Salmonella typhimurium MetR protein as query of GenBank 52 and EMBL 14, and a window of 90, 14 different bacterial sequences were detected with higher lod scores and number of matches than the next best matches, which appeared to be chance similarities (Table I). Excluding proteins with similar known functions in different species, this group consisted of five proteins (LysR, NodD, IlvY, CysB, and AmpR) which, like MetR, were known to activate other genes, one protein of unknown function, and two partial open reading frames which were not known to encode proteins. Considering that each of the searches examined the equivalent of about 100,000 translated sequences the size of MetR (276 amino acids), evidence for similarity was very strong. Several criteria were used to demonstrate that all of these predicted proteins are actually related and not the result of chance similarity to MetR. Dot matrix analysis indicated that multiple regions of similarity could be detected between pairs of the proteins. Regions that were similar in one pair tended to be similar in the other pairs. Alignments extended essentially from end to end, suggesting a similar overall fold. The full sequence of one of the incomplete predicted sequences (Escherichia coli LeuO) was determined, extending the alignment to its carboxy terminus. As LysR is thought to have a helix-turn-helix DNA-binding domain between residues 2 1 and 40, each aligned protein was evaluated for the likelihood of containing this 7 S. Henikoff, G. W. Haughn, J. M. Calvo, and J. C. Wallace, Proc. Natl. Acad. Sci. U.S.A. 856602 (1988). * S. Needleman and C. Wunsch, J. Mol. Biol. 48, 443 (1970). 9 J. Maize1 and R. Lenk, Proc. Natl. Acad. Sci. U.S.A. 78, 7665 (1981).

[71

FINDING

BEST DATABASE

PROTEIN

MATCHES

USING

HOMOLOGIES

WITH DNA

115

DATABASES

TABLE I S. typhimurium MetR ACTIVATOR

PROTEIN

Lod score6

Matches

Protein

92 91 91 90 90 88 81 86 85 85 8.5 85 85 85 85 84 84 84

24 28 24 31 27 31 22 23 23 23 22 21 20 20 18 21 18 18

Escherichia coli LysR activator protein Enterobacter cloacae AmpR activator protein Alcaligenes eutrophus TfdO ORF S. typhimurium ORF upstream of leu operon (LetrOy E. coli IlvY activator protein E. coli ORF upstream of the leu operon (LeuO)d E. coli CysB activator protein E. coli ORF downstream of the ant operon (AntO) Rhizobium meliloti NodD activator protein Rhizobium Ieguminosarum NodD activator protein Rhizobium triforii NodD activator protein Brudyrhizobium sp. NodD activator protein Rhizobium sp. NodDl activator protein S. typhmurium CysB activator protein Mouse ubiquitin mRNA, inverted sequence HSV-a glycoprotein b mRNA, inverted Rabbit /I-myosin heavy chain mRNA, inverted Bovine protein C mRNA, inverted

AS QUERY*

Database’

p, G

E E G, E G, E G, E G, E E P, G, E G, E G, E E E G, E G G, E G, E E

a Window = 90. b For computations, +8 has been added to each PAM250 matrix value (Ref. 3) to eliminate negative values, and the mean log-odds (lad) score of each alignment has been multiplied by 10 and rounded off to the nearest integer. c P, NBRF-PIR 14; G, GenBank 54; E, EMBL 14. d Using R. meliloti NodD protein as query.

motif using the parameters of Dodd and Egan.iO In nearly every case, the best predicted region of the protein aligned precisely with residues 2 l-40 of LysR.’ This detection of nine members of a single family using DNA database searching for protein similarities demonstrates several features of the procedure: (1) Access to a more complete database. Only two of the proteins, LysR and Rhizobium meliloti NodD, were present in the NBRFPIR database (Table I). These were the only members of the family previously known to be related. (2) Detection of unrecognized open reading frames. In the case of TfdO and LeuO, the published sequences were for genes adjacent to the ones detected. Since this family is usually characterized by divergent transcription, the amino-terminal portion of a family member might fall within the presumed regulatory region of a known gene. (3) lo I. B. Dodd and J. B. Egan, J. Mol. Biol. 194, 557 (1987).

116

SEARCHING

DATABASES

171

Confirmation of detected matches using other criteria. In this case, the use of a predictive scheme for a DNA-binding motif I0 was particularly valuable. (4) Sensitivity of the procedure. The overall level of similarity of the proteins of the various family members to MetR ranges from 15 to 24%. To achieve sufficient sensitivity for a standard translated database search, it was necessary to use a searching window of 90. Gaps could confound a search that uses such a large window. The procedure described next is an alternative means of achieving higher sensitivity while using a small searching window.

Method 2: Matrix Searching Gribskov et al.ii have demonstrated

that greater sensitivity can be obtained in a search using a position-specific scoring matrix derived from related sequences (profile analysis) rather than using a single query. The matrix consists of values for each amino acid at each position reflecting the frequency with which that residue appears at that position among the aligned sequences. As demonstrated below, this procedure can be effective even when there are only two sequences contributing to the matrix. Furthermore, it appears to be advantageous to use a relatively short region to generate the scoring matrix so that the remaining information can be used to confirm candidate matches and so that variable gaps need not be considered in the search. One advantage of searching for a short pattern is that no computational shortcuts are necessary, since every stretch of translated database sequence the size of the pattern can be searched in about the same time that a standard search can be run for a typical query. Another advantage is the conceptual simplicity of the procedure: The precise pattern is decided on beforehand by the investigator who does not need to use any scoring matrix other than the one he constructs. A matrix-building program uses aligned amino acid sequences (e.g., Fig. la) to construct a scoring matrix (e.g., Fig. lb). The individual matrix entries can be weighted to compensate for some of the nonrandomness of database sequences. For protein database searches, one can divide each matrix entry by the average frequency with which that amino acid is found in proteins in order to penalize the more common residues which occur more frequently by chance. For DNA database searches, one can divide each matrix entry by the frequency with which a codon for that amino acid appears in the genetic code in order to penalize chance occurrences for database sequences translated in all frames. For the examples described below, weighting by codon frequency is found to improve performance. After this optional weighting, each position is normalized so that the sum IL M Gribskov, (1687).

A. D. McLachlan,

and D. Eisenberg,

Proc. Natl. Acad. Sci. U.S.A. 84,435s

r71

FINDING

PROTEIN

HOMOLOGIES

DNA

WITH

117

DATABASES

a 1st

Protein

residue

acid

Amino

seuuence

MetR

(S.

typhimurium)

22

AAAVLHQTQSALSHQFSDLEQRLGFRLFVR

LysR

(E.

coli)

24

AAHLLHTSQPTVSRELARFEKVIGLKLFER

TfdO

(A.eutrophus)

21

AARRLHISQPPVTRQIHALEQHLGVLLFER

LeuO

(E. coli) (S.typhimurium)

22 22

AAHVLGMSQPAVSNAVARLKVMFNDELFVR AAHTLGMSQPAVSNAVARLVVMFNDVLFVR

NodD (R. leguminosarum) NodDl (R. meliloti) NodD (R. trifolii) NodD (Bradyrhizobia) NodD2 (R. meliloti)

26 29 26 25 26

AARSINLSQPAMSAAISRLRDYFRDDLFIM AARRINLSQPAMSAAIARLRTYFGDELFSM AARSINLSQPAMSAAIGRLRAYFNDELFLM AARKINLSQPAMSAAIARLRSYFRDELFTM AARRVKLSQPAMSAAIARLRTYFGDELFSM

CysB CysB

coli) typhimurium)

22 22

TAEGLYTSQPGISKQVRMLEDELGIQIFSR TAEGLYTSQPGISKQVRMLEDELGIQIFAR

AmpR (E.

cloacae)

26

AAIELNVTHSAISQHVKTLEQHLNCQLFVR

IlvY

(E.

coli)

21

SARAMHVSPSTLSRQIQRLEEDLGQPLFVR

Ant0

(E.

coli)

26

AAEALYLTPQTITGQIRALEDALQAKLFKR

Mu0

(E. (5'.

b Amino po5. 1 2 3 4 5 6 7 a 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 ?.a 29 30

acid

A

C

D

E

F

G

Ii

I

x

L

n

N

P

Q

R

a

al 100 a 22 0 0 0 0 0 0 44 0 0 a 12 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 32 22 0 0 0 0 0 0 0 0 0 0 12 0

0 0 0

0 0 0

0 0 32

0 0 11

0 0 0

0

0

0

0

0

0

0 0 0 0 0 0 0 0 0 0 0 0

11 0 6 0 0 0 0 11 0 0

0 0 47 0 0 12 0 0 0 0

0 11 0 11 0 0 0 0 32 0

24 1, 0

0 0 0

0 17 0

0 0 0

0 1

0 0

0 27

a7 12

a 0 0 2 0 0 0 0

17 12 0 18 0 0 0 0

0 0 48 0 0 0 0 0

4 0 2 0 0 0 0 0 0 0 17 0 0

a 12 0 0 0 0 0 0 0 3 5 16

6 0 0 7 0 0 0 4 0

0 0 0 14 0 0 0 0 0

12 0 0 29 3 0 0 0 0

12 0 0 0 19 0 0 14 0

0 0 45 0 0 0 0 0 0

24 0 0 0 0 0 0 0 0

0 14 0 10 0 20 0 2 0

Ia 0 0 2, 0 0 0 43 0 14 0 0 100 0 0

18 0 0 6 12 0 0 0 0 29 0 0

15 0

0 0 7

0 0 0

47 0 11 0 0 0 0 1132 0 0 0 6 0 0 73 0 0 0 43 0 5 5

40 0 32 0 0 0 0 0 0 0 0 0 34 0 0 0 24 0 0 0 0

0 0 0 0 21 0 0 0 0 0 0 0 17 0 0 0

a0

0

0 0 0 0 0 0 36 0 0 0

0 1 0

0 0 43

0 0 0

0 0 0 0 0 0 0 12 56 11 0 0 0 0 0

0 0 0 16 0 10

0 0 16 0 75 22 0 0 0 1, 62 0

0 0 0 0 0 0 0 0 0 0 0

0 0 0 57 0 22 0 0 70 0 0 0

1,

0 0 0 0 0 0 0 0 0 7 0

18 0 0 0 1, 0 0 17 14 29 0

12 23 0 4 0 4 0 * 0 5 0

7 0 0 0 0 0 0 0 0 0 0

0 0 0

0 0 0

0 0

0 5 0

5,

T

0 0 16 43 0 0 33 0 30 0 0 0 0 9 0 0 2 0 0 0 0 0 0 0 2 0

"

2 0 16 0 0 0 0 24

w

Y

l

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

4 0 0 34 0

24

12

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1. Alignment (a) and scoring matrix (b) of LysR family members within a highly conserved region. The names used for putative proteins, such as LeuO, AntO, and TfdO, are adopted for heuristic purposes and do not connote any functional relationship to the leu, ant, or tfi operons. FIG.

118

SEARCHING

DATABASES

171

of scores is 100. Stop codons can be penalized with negative values if desired. In addition to weighting of residues, optional grouping of closely related sequences is allowed. Grouping allows the contributions of multiple closely related sequences to be averaged prior to the calculation of a matrix entry. In this way, multiply represented subfamilies contribute only a single value to a scoring matrix that is otherwise composed of distantly related sequences. The resulting matrix is then used to score each stretch of translated sequence equal to the size of the window by assigning the value found in the matrix for each residue at each position. The sum of values for the whole stretch is the pattern score. Pattern scores can be thought of as equivalent to the number of matches in a simple alignment multiplied by 100, except that fractional matches are allowed. The searching program notes the single best score for each reading frame of a database entry and reports the 200 best pattern scores in a search, along with the 200 aligned sequences. Extension and inspection of the top scoring matches is carried out as for standard similarity searches. Confirmed new members can be added to the list of aligned sequences and the scoring matrix recalculated. In this way, starting with a crude scoring matrix derived from only two proteins, a more refined matrix can be constructed, allowing for an even more sensitive search. The following three applications illustrate the practical use of these searching tools and reveal several previously unknown relationships among bacterial activator proteins. Application 2: Expansion of LysR Family by Matrix Searching. It was apparent from visual inspection of the multiply aligned LysR family members7 that the most highly conserved ungapped region corresponded to MetR residues 22 through 5 1, which includes part of the predicted helixturn-helix DNA-binding motif. Figure 1a shows this aligned stretch of 30 amino acids, where true homologs, representing proteins of similar function in different species, are grouped. A scoring matrix weighted according to codon frequency was generated (Fig. lb). Consider, for example, the first matrix entry, 8 1 for alanine at position 1, present in 7 of 9 groups. This entry is calculated by first dividing the fraction of groups that have alanine at position 1 by the fractional occurrence of alanine codons in the genetic code (719 f 4164 = 12.44). Similar calculations are carried out for serine and threonine, each of which occurs in one group in position 1. For serine, the corresponding value is 1.19 (l/9 + 6/64) and for threonine, 1.78 (l/9 +4/64). Matrix entries are calculated by normalizing these three values so that their sum is 100. For example, the matrix entry for alanine is 12.44+(12.44 + 1.19 + 1.78) X 100 = 81. It is worth noting that an alanine match is given a relatively low score in the lod scoring matrix of Dayhoff et al,3 in part because alanine is not

171

FINDING

PROTEIN

HOMOLOGIES

WITH

DNA

DATABASES

119

usually a conserved residue. However, in the case of this particular region of 30 amino acids, alanines in the first and second positions are among the most highly conserved residues in the protein. Using a matrix that is customized to take this peculiarity into account helps to increase the sensitivity of the search. Also, we do not allow “conservative replacements,” but rather assign values of zero for all residues that do not occur at a position in any of the known family members: What is a conservative replacement in most situations might not be in the particular position being searched. For example, the highest scoring substitution in the lod scoring matrix is between phenylalanine and tyrosine.3 Nevertheless, their chemical properties are quite different; for example, substitution of a phenylalanine for the active site tyrosine of a tyrosine kinase will inactivate the enzyme. Our procedure focuses on the most highly conserved regions of a protein where such special residues are often found. Furthermore, a scoring procedure that makes fewer assumptions allows easier interpretation of the results. The searching program scored all possible segments of 30 amino acids derived from GenBank 58 and EMBL 17 translated in all six reading frames using the scoring matrix shown in Fig. 1b. Since there are about 20 million bases in each database, each search involves about 40 million comparisons (20 million bases per frame+ 3 bases per amino acid X 6 frames = 40 million amino acid positions that can be aligned with position 1 in the scoring matrix). The output (Fig. 2) shows the top 28 scores along with the aligned stretch of 30 amino acids for each corresponding database sequence. As expected, the sequences used to construct the matrix were among the top scores, ranging from 13 19 for Alcaligenes eutrophus TfdO down to 1093 for Rhizobium leguminosarum NodD. The highest score for what is clearly a chance occurrence is 750 for an open reading frame derived from the opposite strand of the Caenorhabditis elegans myosin heavy chain gene. Of the 11 new sequences scoring above 750,3 are NodD sequences from other sources, and 1 is an incomplete sequence encoding S. typhimurium IlvY. The latter has a relatively low score (820) because the coding sequence for only 22 amino acids was present in the database entry. This illustrates how the searching algorithm will align translated sequence with the scoring matrix out to the end of a database entry, allowing for detection of sequences that do not include the entire match. The remaining 7 matches represent new members of the LysR family. One match is to a predicted sequence of 77 amino acids upstream of and oppositely oriented to the Pseudomonas putida clcABD operon: Residues 2 1 - 50 score 13 14. This corresponds to the amino terminus of the ClcR activator protein (A. Chakrabarty, personal communication). The second match is to an ORF of 144 amino acids upstream of and oppositely

120

SEARCHING

EMBL ID

seauence

AETFDA Ml6964 STMETR ECLEUP ECGALLYS STLEUP ECAMPR ECCYSB STCYSB EAALDS ECILVYC ECRPSTB M21093 P.MNDD Ml8972 RMNODDZ RTNODG BSNODl RLNOD Ml8971 ECOTDC AETFDCD RSNODDl PFASPA Ml9460 STILVYCR CEMYUNC

A. P.

eutrophur purida

entry clc

S. Cyphimurium mefR gene E. coli leu operon E. coli lysR gene S. E. E.

Cyphimrium leu operon cloaca.% ampR gene coli cyss gene

S. E. E. E. P.

typhimurium cys8 gene aerogenes aldc opera" coli il"Y gene coli rpsr-ant region aerueinosa trpI gene

R.

meliioci

nod6

R. japonicum R. melilori R. Crifolii Bradyrhizobium

g&e

nodDl gene nodD2 gene nodD gene nodD gene

R. leguminosarum nodD gene R. japonicum nodD.? gene E A.

coli tdc eucrophus

operon

Rhizobium

sp.

P. P.

cacBC

fluorescens putida

Window

Frame

CfdA operon operon

CfdCD operon nodD1 gene aspA gene

gene

S,ryphimurium ilvC upstream c.elegans major MHC gene Human 285 rRNA gene

I71

DATABASES

-3 -2 3 -1 1 -1 1 1 3 -1 -2 -2 2 -2 2 2 2 -1 1 2 1 -1 1 -1 -1 -3 2 -1

1319

1314* 1280 1250 1223 1221 1218 1203 1202 1188* 1186 1179 1172* 1154 1128 1126 1111 1100 1093 1077 1072* 1038* 984 944* 856* 820 750 731

alisnment

AARRLHISQPPVTRQIHALEQHLGVLLFER AARRLHISQPPITRQIQALEQDIGWLFER AAAVLHQTQSALSHQFSDLEQRIGFRLFVR AAHVIGMSQPAVSNAVARLKVMFNDELFVR AAHLLHTSQPTVSRELARFEKVIGLKLFER AAHTLGMSQPAVSNAVARLVVMFNDVLFVR AAIELNVTHSAISQHVKTLEQHLNCQLFVR TAEGLYTSQPGISKQVP."LEDEU;IQIFSR TAEGLYTSQPGISKQVRMLEDELGIQIFAR AAKALGISQPPLSQQIKRLEEEVGTPLFRR SARAMHVSPSTLSRQIQRLEEDLGQPLFVR AAEALYLTPQTITGQIRALEDALQAKLFKR AAEELHVTHGAVSRQVRLLEEDLGVALFGR AARRINISQPAMSAAIARLRTYFGDELFSM AARSINISQPAMSAAIARLRTYFGDDLFTM AARRVKLSQPAHSAAIARLRTYFGDELFSM AARSINLSQPAMSAAIGRLRAYFNDELFLM AARKINLSQPAMSAAIARLRSYFRDELFTM AARSINLSQPAMSAAISRLRDYFRDDLFIM AARSINLSQPAMSAAITRLRTYFRDELFTM AAKELGLTQPAVSKIINDIEDYFGVELWR AAQRNHISQPPLTRQIQALERDIGAKL ASRRINLSQPANSAAITRLRTYFRDELFTM AAERRFVTQPAFSRRIRSLEAAIGLTLVN AAELLHIAQPPLSRQISQLE SARANHVSPSTLSRQIQRLEED TAFFLLKSQVLMSTMQERSEEGIGLRLSGR AAPGSGSSVRHMSRAPRGGDSALGSSLFTR

FIG. 2. Results of a pattern search using the scoring matrix of Fig. 1b. Asterisks indicate new LysR family members.

oriented to the Enterobacter aerogenes aldc operon: Residues 2 1 - 50 score 1188. No information is available concerning its function. The third match is to the Pseudomonas aeruginosa TrpI activator protein which is known to regulate the trpBA operon.‘* Residues 26-45 score 1172. The fourth match is to a protein of unknown function in the E. coli tdc operon. Residues 27-46 score 1072. Extension of the alignment beyond the searching window further indicates that this is a LysR family member. The fifth match is to an incomplete ORF of 46 amino acids upstream of and oppositely oriented to the A. eutrophus tfdCDEF operon: Residues 2 l-46 (26 amino acids) score 1038. This ORF shows 65% identity to the first 46 amino acids of the ORF upstream of and oppositely oriented to the A. eutrophus tfd4 operon, a previously detected member of the LysR family.’ The relationships between the putative ORFs upstream of TrpI, clcABD, tfdA,and tfdCDEF and various known members of the LysR family have been noted by others.‘*J3 The sixth match is to an incomplete ORF of 5 1 amino acids upstream of and oppositely oriented to the Pseudomonas fruorescens aspA operon: Residues 23 - 5 I (29 amino acids) score 944. No information is available concerning its function. The seventh match is to I2 M. Chang, A. Hadero, and I. P. Crawford, J. Bucteriol. 171, 172 (1989). I3 E. J. Perkins, G. W. Bolton, M. P. Gordon, and P. F. Lurquin, Nucleic Acids Rex 16, 7200 (1988).

r71

FINDING

PROTEIN

HOMOLOGIES

WITH

DNA

DATABASES

121

an ORF of 39 amino acids upstream of and oppositely oriented to the p.

putidu cutBC operon: Residues 20-39 (20 amino acids) score 856. This corresponds to the amino terminus of the CatR activator protein (A. Chakrabarty, personal communication). One way to test the assertion of common ancestry for the new sequences is to use each ORF as query in a standard similarity search. When this was done, each of the 7 new members detected several other members of the LysR family as best matches using a window of 30 (data not shown). Furthermore, each new sequence shows characteristics found in several known LysR family members: (1) the location of this region of similarity about 20 amino acids from the likely amino terminus, (2) a high score using the helix- turn- helix DNA-binding predictive schemei in the region that aligns with the other proteins (data not shown), and (3) divergent transcription from a promoter region for a known gram-negative bacterial operon in 6 of 7 cases. These features further support the assertion of common ancestry. Since only 20 predicted amino acids were necessary in one case to detect a distant relationship, it is clear that the matrix searching strategy is very sensitive. In addition, the absence of high scoring chance occurrences demonstrates that the strategy also is extremely selective. Application 3: SuccessiveSearchingAllows Expansionof AraC Family. The previous example demonstrated the ability to detect short sequence similarities by a single round of matrix searching. The following application demonstrates the use of successive standard similarity and matrix searches for detecting, confirming, and expanding a family. The procedure is summarized in Fig. 3 (S. Henikoff, unpublished results). Step 1. The E. coli AraC protein regulates the araBAD operon encoding structural components for arabinose metabolism. A standard similarity search of GenBank 58 was carried out using E. coli AraC as query. The resulting matches included known AraC proteins from three other bacteria and a striking similarity (3 1% identity) to an incomplete ORF ( 110 amino acids) upstream of the Streptomyceslividans xp55 gene (Xp550). This previously unrecognized ORF (T. Eckhardt, personal communication) would correspond to the carboxy-terminal portion of a presumptive member of the AraC family. Step 2. Inspection of the alignment revealed a segment of 43 amino acids near the carboxy terminus that is most similar between the AraCs and Xp550 (data not shown). A scoring matrix was constructed and used to search GenBank 58 and EMBL 17. The six highest scoring matches were E. coli RhaR (1679), MelR (1573), transposon TnlO TetD protein (1375), M5 protein ( 1364), RhaS (1354) and PhoO, an incomplete and apparently unrecognized ORF upstream of and divergent from the E. coli phoM operon ( 1354). The best spurious match was 12 16.

122

1. Standard

SEARCHING

similarity

[71

DATABASES

search using 5. co/i AraC:

I

t Best match is to S. lividans 2. Pattern search using a matrix derived from the most similar ungapped region: t Best matches are to:

Xp550:

n I I

I I

E. coli RhaS: E. coli MelR: E. coli M5:

a

TnlO TetD:

3. Pattern search using the expanded

E. coli RhaR:

m

E. co/i PhoO:

1 1’1 s

t Best match is to P. putida

4. Standard

similarity

I I

matrix

searches using individual

I

XylS:

I

:

proteins

t Best match using XylS is to E. co/i Ada: t Best match using Ada is to E. co/i Ogt: 5. Independent

test: helix-turn-helix

predictions t

Precise alignment

with the AraC motif: 0

FIG. 3. Expansion of the AraC family. Solid boxes indicate alignment with the 43-amino acid window used in the matrix searches. Predicted helix-turn-helix regions are shaded.

I71

FINDING

PROTEIN

HOMOLOGIES

WITH

DNA

DATABASES

123

The overlapping genes encoding the rhamnose regulatory proteins RhaS and RhaR regulate and are oppositely oriented to the rhaBAD operon encoding structural components for rhamnose metabolism.‘4 The similarity of RhaR and RhaS to one another and to AraC has been noted previously.i4 MelR resembles AraC in that it is a sugar-sensitive regulatory protein encoded divergently from a gene that it regulates.i5 This relationship appears not to have been noted previously. The other three predicted proteins also were not previously reported to be members of a family and are of unknown function.16-I8 Step 3. A new scoring matrix including the six new sequences was constructed and used to search GenBank 58 and EMBL 17. The highest scoring match (958 versus 907 or less for likely spurious matches) was P. putidu XylS, a known bacterial activator protein.19 Step4. Standard similarity searches were carried out for each of the new sequences. Using XylS as query, the best match was to the E. coli Ada protein, a known regulatory protein that includes a carboxy-terminal 06methylguanine methyltransferase domain.20~21 Using Ada as query, an excellent match was to an unrecognized reading frame upstream of the E. coli nirR gene. In this case, the similarity was confined to the methyltransferase portion of Ada. This sequence coincides with that for the E. coli ogt gene22 except for seven differences, including two frameshifts which are probably errors in the database entry. Step 5. An independent test of the hypothesized relationships among these proteins was carried out. AraC is a helix- turn-helix DNA-binding protein.‘O The predicted 20-amino acid DNA-binding motif begins 38 amino acids upstream of the segment used to construct the scoring matrix. The helix-turn-helix predictive scheme lo was carried out for each of the proteins. For RhaS, M5, TetD, PhoO, XylS, and Ada, the highest scoring stretch of 20 amino acids aligns precisely with the predicted helix-tumhelix region of AraC (data not shown). Since the segment detected in the matrix searches does not overlap the predicted helix- turn- helix region, I4 J. F. Tobin and R. F. Schlief, J. Mol. Biol. 196,789 (1987). Is C. Webster, K. Kempsell, I. Booth, and S. Busby, Gene 59,253 (1987). I6 G. Braus, M. Argast, and C. F. Beck, J. Bucteriol. 160, 504 (1984). I7 E. H. Kemp, N. P. Minton, and N. H. Mann, Nucleic Acids Res. l&3924 (1987). r* M. Amemura, K. Makino, H. Shinagawa, and A. Nakata, J. Bucteriol. 168,294 (1986). r9 S. Inouye, A. Nakazawa, and T. Nakazawa, Gene 44,235 (1986). 2oY. Nakabeppu, H. Kondo, S. Kawabata, S. Iwanaga, and M. Sekiguchi, J. Biol. Chem. 260, 7281 (1985). 21B. Demple, B. Sedgwick, P. Robins, N. Totty, M. D. Waterfteld, and T. Lindahl, Proc. N&l. Acad. Sci. U.S.A. 82,2688 (1985). 22P. M. Potter, M. C. Wilkinson, J. Fitton, F. J. Cat-r, J. Brennand, D. P. Cooper, and G. P. Matgison, Nucleic Acids Res. 15, 9 177 ( 1987).

124

SEARCHING

DATABASES

171

this is an independent test of similarity. Therefore, these diverse proteins are likely to resemble AraC in structure and function. Table II summarizes known features of the nine AraC family members.

Application 4: Multiple Searchesto Detect Complex Relationships amongFamilies. The following application further demonstrates the ability to detect short sequence similarities, where failure to extend an alignment is due to the limited nature of the homologous region or to frameshifts in database entries. In addition, this application shows how multiple searches can be used to detect complex relationships. The LuxR protein controls genes for bioluminescence in Vibrio jischeri.23,24 Using the predicted amino acid sequence of LuxR as query in a standard similarity search of GenBank 58 and EMBL 17 with a window of 30, we detected a striking similarity to a 28kDa protein of suspected regulatory function upstream of the E. coli sequence encoding UvrC.*’ Alignment of the two amino acid sequences essentially end to end (Fig. 4a) shows that they are identical at 24% of aligned residues with the insertion of four small gaps. Visual inspection of this alignment indicates that the region of greatest ungapped similarity begins at residue 184 in LuxR and 182 in UvrC-28K and extends for 43 amino acids. A scoring matrix derived from this aligned stretch was used to search GenBank 58 and EMBL 17. After LuxR, five bacterial sequences scored the highest (Fig. 5a). The highest scoring sequence of these five encodes GerE, a Bacillus subtilis protein involved in regulation of spore formation, The 74-amino acid GerE protein aligns with the carboxy-terminal one-third of LuxR and UvrC-28K. The next highest scoring sequence encodes E. coli UhpA, a known regulatory protein. Only the carboxy-terminal one-third of UhpA aligns. The next highest scoring sequence is derived from a region of unknown function downstream of the PseudomonasaeruginosatrpAB gene (TrpO). Examination of the sequence just upstream in a different reading frame reveals a striking similarity to E. coli UhpA. Deletion of 19 bases leads to the alignment of TrpO with UhpA (Fig. 4b). The two sequences show 26% identity of aligned residues with only four gaps. Subsequent reexamination of the nucleotide sequence data obtained for this region suggests an intact 205-amino acid ORF (I. Crawford, personal communication). The next highest scoring sequence is the R. meliloti FixJ protein, an activator protein that was previously shown to be related to UhpA (30% *’ J. Engebrecht and M. Silverman, NucleicAcidSRex 15, 10455 (1987). x J. H. Devine, C. Countryman, and T. 0. Baldwin, Biochemistry27,837 (1988). 25 S. Sharma, T. F. Stark, W. G. Beattie, and R. E. Moses, NucleicAcidsRes.14,230

1 (1986).

YeS ? Yl3 YeS Yes ? ? ? Yes Yes

Yes Yes Yes Yes

TetD (TnlO) PhoO (E. coli) XylS (P. putida) Ada (E. co/i)

Regulatory protein

Yes Yes YeS No No YeS

Aligned helix-turn-helix

AraC (various species) Xp550 (Streptomyces lividans) RhaS (E. co/i) RhaR (E. co/i) MelR (E. coli) MS (E. coli)

Predicted protein

Yes Yes Yes ?

Yes ? Yes Yes Ye.7 ?

Divergent transcription

AraC Function

Regulates arabinose metabolism Unknown (unrecognized ORF) Regulates rhamnose metabolism Regulates rhamnose metabolism Regulates melibiose metabolism Possibly regulates polysaccharide biosynthesis Unknown Unknown (unrecognized ORF) Activates xylene-degradative functions Activates response to mutagenic alkylation

TABLE II FEATURESOF~EDICTEDBACTERIALF'ROTEINSRELATEDTO

NNA~N~SP~IK~TSGLITGFSF~I~TANNG-~GMLSFAHSE~NYIDSLF~C~IPLI~P-SL~D~INI~NKSNNDLT~EK~~~WA~~ :: :::: :: ..: :: : -DDLFSEAQPLW~G*~VHSVF~~~T~A~FLSFSRCSRRE-IPILSDE~~QLLVRESLHALMRLNDEIVMTPEMNFSKREKEILRWTAE

,_

+++++++++++++ :

:

++++++++++++++++++++c+++++++++++++++++++++ AKGKPtRAAARRQGv~~~G~SI~~S~N~L~SLSDREMN~YUUUGNTNKAIAQ9LFLS~K~S~SRIMLELT... .: : : :: :. :: : : : : ::: :::::: : :: KRCSPDELIAAVHTVATG GCYLTPDIAIKLaSCRQ-DPLT1(RERQVAEKLAQG~VKEIMELGLSPKTVHVHRANLMEKLGVSNDVELARRMFDGW

MSKVLIVDDH~AI~~~RLL~~RD-G~~R~DNGA~~~~~~D~I~IGIPKIDGLE~I~~~~~TK~L~LTRQNR~QF~~AGPWA-SS _:: :::::..:

+++++++++++++++cttt+++++++++++ GKSS~DISKI~C~ER~TFHLTNAQMK~TTNRCQ~I~KAILTG*IDC~~FKN : : _: .: :: : : :: :: : : :_:

241

250

194

196

196

198

100

100

4. (a) Alignment between V. jscheri LuxR and E. coii UvrG28K (28K). Colons denote identity, and single dots denote conservative replacements. Pluses indicate the location of the 43-amino acid scoring window. (b) Alignment between translated sequence downstream of Pseudomonas aeruginosa trpAB (Typo) and E. co/i UhpA. The indicated insertion is nxessary to bring the former sequence into frame.

'J~P* FIG.

TrpO

IJhp*

TrpO

b

28K

LuxR

28K

: :

96

28K

LuxR

98

LuxR

a

ECUVRC

ECWRC Ml7642 Ml9039 Ml7102 PATRPAB JO3174 ECUVRC ECMALT KARCSA MMHOMMH3 DMTHBl

ID

Scoring

EMBL

b)

PATRPAB JO3174 ECUVRC DMTHBl KARCSA MMHOMMH3 ECMALT

Ml9039 Ml7642 Ml7102

ID

Scoring

EMBL

a)

entry

derived

from

LuxR

entry

derived

from

LuxR,

-3 1 3 2 1 3 2 1 2

2 2

Frame

WrC-28K,

2 -3 2 1 3 2 1 2 2 1 3

GerE,

pattern

1234 1160 1113

1511 1351

2657 2459 2397 2007 1824 1712

3404 3196 1723 1496 1426 1365 1350 1243 1220 1202 1193

UvrC-28K

Frame

and

FIG. 5. Results of successive

E. coli uvrC operon ti. subtilis gerE gene V. fischeri 1~x2 operon E. coli uhpA gene P. aeruginosa trpAB operon R. meLiLoCi fixU operon E. coli uvrc operon E. coli maIT gene K. aerogenes rcsA gene Mouse MH-3 homeo box gene D. melanogascer HBl

Sequence

matrix

coli uvrC operon fischeri LuxR operon subtilis gerE gene coli uhpA gene aeruginosa trpAB operon melilori fixLJ operon coli uvrC operon melanogas~er HBl K. aerogenes rcsA gene Mouse MH-3 homeo box gene E. coli maLT gene

E. V. B. E. P. R. E. D.

Sequence

matrix alisnment

TrpO,

Window

FixJ

and

alicinment

UvrC-23K

searches for

similarities

to

LuxR.

NFSKREKEILRWTAEGKTSAEIAt4ILSISENTVNFHQKNMQKKINA SLTKREREVFELLVQDKTTKEIASELFISEKTVRNHISNAMQKLGV DLTKREKECLAWACEGKSSWDISKIIGCSKRTVTFHLTNAQMKLNT PLTKRERQVAEKIAQGMAVKEIAAELGLSPKTVHVHRANIJ4EKLJX' SLSDREMTVLQYLANGNTNKIAQQLFLSEKTVSTYKSRIMLKLNA TLSERERQVLSAWAGLPNKSIAYDLDISPRTVEVHRANVMAKMKA SLSERELQIMLEIITKGQKVNEISEQLNLSPKTVNSYRYRl.lFSKLNI PLTQREWQVLGLIYSGYSNEQIAGELEVAATTIKTHIRNLYQKLGV SLSKTESNMLQMWMAGHGTSQISTQMNIKAKTVSSHKGNIKKKIQT QVLELEKEFHFNRYLTRRRIEIAHTLCLSERQVKIWFQNRRMKWKK YSQGKP.MLILKLRKEGKTYKDIQKTLKCSAKMVSNAIKYKWKPENR

UhpA,

SKREKEII.RWTAEGKTSAEIAMILSISENTVNFHQKNMQKKIN TKREKECIAWACEGKSSWDISKILGCSKRTVTFHLTNAQMKLN TKREREVFELLVQDKTTKEIASELFISEKTVRNHISNAMQKLG TKRERQVAEKLAQGMAVKEIAAELGLSPKTVHVHRANLMEKLG SDREMTVLQYLANGNTNKAIAQQLFLSEKTVSTYKSRIMLKLN SERERQVLSAWAGLPNKSIAYDLDISPRTVEVHRANVMAKK SERELQIt4LMIl'KGQKVNEISEQLNLSPK'IVNSYRYRMFSKLN QGKRMLILKLRKEGKTYKDIQKTLKCSAKMVSNAIKYKWKPEN SKTESNMLQMWMAGHGTSQISTQMNIKAKTVSSHKGNIKKKIQ LELEKEFHFNRYLTRRRIEIAHTLCLSERQVKIWFQNRRMKWK TQREWQVLGLIYSGYSNEQIAGELEVAATTIKTHIRNLYQKLG

Window

128

SEARCHING

DATABASES

I71

identity), aligning essentially from end to end.26 The next highest scoring sequence is a 23K protein of unknown function just downstream from UvrC-28K and upstream of UvrC. 25 Alignment of this sequence with UhpA, FixJ, and TrpO shows, respectively, 25, 22, and 25% identical aligned residues (data not shown). Therefore, these four predicted proteins are members of a single family. However, they do not align from end to end with the LuxR-UvrC-28K pair. Rather, the carboxy-terminal portions of all six proteins are similar and align with the short GerE protein, but the amino termini differ between the two groups. Two known bacterial activator proteins, Klebsiellaaerogenes RcsA and E. coli MalT, score among the best spurious matches in this search. When the scoring matrix is expanded to include GerE, UhpA, TrpO, FixJ, and UvrC-23K and the window extended by 3 amino acids, RcsA scores 74 points above the best spurious matches (Fig. Sb). Likewise, MalT improves to 19 1 points above the best spurious matches. The similarity is confined to the carboxy terminus of MalT, a 901-amino acid protein. Alignment of MalT with GerE is particularly striking, with 33% identical residues (data not shown). In addition, standard similarity searches using both MalT and GerE detect another likely homolog, a partial reading frame downstream of and oppositely oriented to the B. subtilis gerA operon (GerO). This 58-amino acid incomplete sequence shares 26 identical residues with GerE and 20 with MalT (data not shown). Therefore, a total of ten diverse proteins, including LuxR and five other known activator proteins, are each identified as having a similar carboxy-terminal domain (Fig. 6). Previous work has shown that the amino termini of UhpA and FixJ are related to a family of activator proteins that includes E. coli OmpR, SfrA, and PhoB, Agrobacterium tumefaciensVirG, S. typhimurium CheY and CheB, B. subtilis SPOOFand SpoOA, R. leguminosarumDctD, and Klebsiella pneumoniaeNtrC. 26 These similarities are seen for the amino-terminal portions of UhpA and FixJ. Therefore, the proteins that align from end to end with UhpA appear to be composites different from LuxR: the amino-terminal portion of each is related to the family of activator proteins that includes SpoOF while the carboxy terminal portion is related to the family that includes GerE (Fig. 6). Visual examination of the multiply aligned regulatory proteins related to OmpR and SPOOFEDreveals that the most strongly conserved ungapped region aligns with residues 77 - 120 of OmpR. A scoring matrix of this aligned stretch was used to search GenBank 58 and EMBL 17. The results are shown in Fig. 7. Scores for members of the family used to construct the 26M. David, M.-L. Daveran, J. Batut, A. Dedieu, 0. Domergue, J. Ghai, C. Hertig, P. Boistard, and D. Kahn, Cell 54,67 1 (1988).

r71

FINDING

PROTEIN

HOMOLOGIES

WITH DNA DATABASES

129

FIG. 6. Schematic diagram indicating the complex relationship among several bacterial regulatory proteins and homologs of unknown function. Different regions of similarit! among aligned segments are indicated by different shadings. Numbers indicate protein size in amino acids.

entry

perABC

opera

1

829

1355

1351 1239 1213 1159+ 1133 1119 1108* 1036* 1028 903 889 886

1

1903 1891 1664 1662 1607 1578 1578 1565 1542 1536 1518 1373

Score

2 3 1 3 2 2 1 2 2 2 3 3

1 3 2 1 3 2 1 1 2 1 3 1

Frame alianment

ILLLTARGETPRAIEGLEAGAMTTCPNRSSRTNFCCASTRSCGG

VVMLTARGEEEDRVRGLETGADDYITKPFSPKELVARIKAVMRR IIMVTAKGEEVDRIVGLEIGADDYIPKPFNPRELLARIRAVLRR LMFLTGRDNEVDKIIGLEIGADDYITKPFNPRELTIRARNLLSR ILMLTAKDEEFDKVIGLELGADDYMTKPFSPREVNARVKAILRR IIISGDRLEETDKWALELGASDFIAKPFSIREFLARIRVALRV VIIMTAHSDLDAAVSAYQQGAFDYLPKPFDIDEAVALVERAISH VIIMTAHSDLDAAVSAYQQGAFDYLPKPFDIDEAVALVDRAISH IIISGARLEEADKVIALEIGATDFIAKPFGTREFLARIRVALRV SIVITGHGDVPMAVEAMKAGAVDFIEKPFEDTVIIEAIERASEH VLFLTARSEEVDRLLGLEIGADDYVAKPFSPREVCARVRTLLRR VINLTAFGQEDVTKKAVDIGASYFILKPFDMENLVGHIRQVSGN VLMVTAEAKKENIIAAAQAGASGYWKPFTPATLEEKINKIFEK VIIHTAYGELDMIQESKEI.GALTHFAKPFDIDEIRDAVKKYLPL VIMVTAEAKKENIIAAAQAGASGYWKPFTAATLEEKLNKIFEK VLVMSAQNTFMTAIKASEKGAYDYLPKPFDLTELIGIIGRALAE TIMISVHDSPALVEQALNAGARGFLSKRCSPDELIAAVHTVATG ILLITGHGDVPMAVDAVKKGAWDFLQKPSIRAKLLILIEDALRQ VIVMSAQNTFMTAIRPSERGAYEYLPKPFDLKELITIVGRALAE VIVISAQNTINTAIQAAEADAYDYLPKPFDLPDLMKRAARALEL IIMLTVHTENPLPAKVMQAGAAGYLSKGAAPQEWSAIRSVYSG DPAFDRPRRDPRAIEGLEAGADDYLPKPFEPKNFCCASMRSCGA HVSSLTGKGSEVTLRPLELGAIDFVTKPQIGIREGMLAYNEMIA VRKVPPQKDLTGLKITLEQLAKDCISKPKMREEYLLKINQASSE SIHETVYENDGDPITLLDQIADNSEEKWFDKIALKEAISDLEER DPASDRARRNPASDRGARGRGDDYLPKPFEPNELLLRINAILRR

Window

FIG.

7. Results of a search using a scoring matrix derived from 11 proteins related to OmpR.26 Asterisks indicate new members of the family.

Ml7102 STPGTA Ml4227 RCNIFRl2 ECUVPX RCFBC ECCHE3 SV5PVA BSSPOlA RCPETG

Ml5810

R. capsulara

Seouence

RCPETG

ID

E. coli phoB gene E. coli ompR operon E. coli dye gene (SfrA) E. coli phoP gene A.rumefaciens virG gene E.coli gLnALG operon (NtrC) K. pneumoniae ntrC gene A.tumefaciens vii-G gene R. meliloti fixU gene E.coli phoM operon B.subriLis spoOA gene E.coL i cheRBY2 opera B. subtilis spoOF gene E. coli cheY gene R. meliloti ntrC gene E. coli uhpA gene S. typhimurium pgtA gene B. paraponiae ntrBC homologue R. capsularus nifR genes E. coli uvrC operon R.capsuLafa fbc (p&ABC) operon E.coli cheRBY2 operon Simian virus 5 V and P mRNA B.subriLis spoIIA gene R. capsulata perABC operon

EMBL

ECPHOB ECOMPB ECDYE Ml6775 ATVIR ECGLN KPNTRC ATVIRBCG JO3174 ECPHOM BSSPOOAA ECCHE3 BSSPOOF ECCHEY

171

FINDING

PROTEIN

HOMOLOGIES

WITH DNA

DATABASES

131

matrix ranged from 1903 for PhoB down to 1028 for CheB. In addition, other proteins that were previously shown to be members of the family scored high: PhoP, a B. subtilis homolog of PhoB, a 28K protein encoded at the E. coli phoM operon, and both the B. parasponiae and Rhodopseudomonas capsulata homolog of NW. The UvrC-23K protein discussed above also scored high. The S. typhimurium PgtA protein which regulates phosphoglycerate transport scored 1223. PgtA aligns with NtrC in both its amino- and carboxy-terminal domains. The next highest score is for a database sequence upstream of the R. capsulata petABC operon. In addition, two other matches to this same operon sequenced in a different laboratory are among the highest scoring spurious matches. These latter two matches are shifted in frame with respect to one another by one base. All three matches are derived from the same region of the petABC operon. Since the two database sequences represent independent determinations of nearly identical DNA sequences, it is possible to deduce an approximation of the correct sequence assuming that they encode a homolog of this protein family. Figure 8 shows that, when these adjustments are made, a coding sequence (PetO) can be identified that is clearly homologous to OmpR, extending from the beginning of each database entry to the carboxy terminus. Several different sets of frameshifts are necessary in each of the two sequences. That these are OmpR RCFBC RCPETC

MQENYKNLWDDDMRLRALLERYLTEQGFQVRSVANAEQMDRLLTRESFHLMVLDLMLPGEDGLSICRRLRSQ

.. .. .. .

. . . . . . . . . . . . . : ::.

. ..A~"PDPQRLSGHKRARRGPHARRLLSGLE~N~IVLDVMMPGEDGLSLTRDLRTK . ..EFNLIVLDVMMPGEDGLSLTRDLRTK

++++++++++++++++++++++++++++++++++++++++++++ OmpR RCFBC RCPETC

OmpR RCFBC RCPETC

SNPMPIIMVTAKGEEVDRIVGLEIGADDYIPKPFNPRELLARIRAVLRRQANELPGAPSQEEAVIAFGKFKLNL ::. .::.:: .:: ::: :::::.:::: :.::: :: :.::: . ..: . -MTTPILLLTARGETRER(+l bp) (+l bp)ELLLRINAILRRVPEAVTAGPKYLSLG(-1 bpj' IEGLEAGADDYLPKPFEPK -----RALDL (+l bp)RRVPEAVTAGAKYLSLG----PLRYDL -----ILLLTARGETPP.AIEGLEAG(+l bp) GDDYLPKPFEPNELLLRINAIL GTREMFREDEPMPLTSGEFAVLKALVSHPREPLSRDKLMNRGREYSA-----MERSIDVQISRLRRMVEEDP . : :: .: .::::.:::: DRG~LSQG;Q;VR,iAT&M&RADAG; MRSWA;TRCRQRRGCPGDRA"D"QITRLRRKIEPDP (+l bp)VIGRTEL(-1 bp) DRGELSQGDQPVRLTPTEAALMRIVRASAGEVIGRTELGRTE~RDRSASAERR-----GD~VDVQITR~RK(-l

:

.:::: bp) IEPDR

OmpR RCFBC RCPETC

AHPRYIQTVNGLGYVFVPDGSKA* .. .. .. .. .. . :. : REPRYLQSCADLATCLHPIEACKRLPQRQKSANRAEPRLFASPACECMLDCGFPGR* RNRATCRRCADLATCLHPIEACKRLPQRQKSANRAEPRLFASPSCECMLDCGFPGR*

FIG. 8. Alignment between E. co/i OmpR and the two translated database entries (RCFBC and of the region upstream of the Rhodopseudomonas cupsulutu petABC operon. The several frameshifts indicated are necessary to obtain a consistent coding sequence aligning with OmpR. Pluses indicate the region detected in the pattern search. RCPETC)

132

SEARCHING

DATABASES

r71

sequencing errors is the only plausible interpretation in this case, since none of the frameshifts coincide between the two sets of data. In contrast, within extensive regions known to code for the petABC proteins, the two sets of data differ only with respect to nucleotide substitution. It should be noted that in both petABC entries, the sequence from only one DNA strand was obtained for the region in question; in fact, the high error rate of these data has already been acknowledged by the investigatorsz7 In both cases, two frameshifts occurred within the searching window (Fig. 8). Clearly, even very uncertain sequence data can be of great value when present in nucleotide sequence databases. Perhaps investigators should be encouraged to deposit uncertain sequence (duly noted) into the databases. In summary, Fig. 6 shows the complex relationships among these and other bacterial regulatory proteins determined in this and in previous studies.‘~26J8~29 Summary In this chapter we describe strategies for the searching of translated nucleotide sequence databases. By applying standard searching techniques developed for protein databases,6 we have found that previously unrecognized homologies can be detected. In addition, we have shown that extremely high sensitivity can be obtained using the scoring matrix strategy” for short regions of similarity. The latter approach is particularly effective for detecting homologs found at the ends of sequences and within data of poor quality. These individual methods are demonstrated for the LysR family of bacterial activator proteins. Successive applications of these methods allow for sensitive detection of complex relationships, as demonstrated for the AraC family and for the complex LuxR- OmpR-NtrC families of bacterial activator proteins. Although our examples are drawn from bacterial sequences, these methods are likewise effective for higher eukaryotic genomic sequences, where protein-coding sequences are usually interrupted by introns. This should be particularly important in the future, since much of the expected increase in nucleotide sequence databases is likely to come from eukaryotic genomic sequencing projects.

*’ E. Davidson and F. Daldal, J. Mol. Biol. 195, 13 (1987). ** S. C. Winans, P. R. Ebert, S. E. Stachel, M. P. Gordon, and E. W. Nester, Proc. Natl. Acad. Sci. U.S.A. 83,8278 (1986). 29B. T. Nixon, C. W. Ronson, and F. M. Ausuhel, Proc. Nat/. Acad. Sci. U.S.A. 83, 7850 (1986).

A Mycoplasma hyorhinis protein with sequence similarities to nucleotide-binding enzymes.

Significance of protein sequence similarities.

Secondary structure-based profiles: use of structure-conserving scoring tables in searching protein sequence databases for structural similarities.

Corruption of genomic databases with anomalous sequence.

Construction of validated, non-redundant composite protein sequence databases.

Searching through sequence databases.

Nucleotide sequence of AMV-capsid protein-gene.

Identifying novel protein phenotype annotations by hybridizing protein-protein interactions and protein sequence similarities.

Similarities in nucleotide sequence between serum and faecal human parvovirus DNA.

NrichD database: sequence databases enriched with computationally designed protein-like sequences aid in remote homology detection.

A murine sequence-specific DNA binding protein shows extensive local similarities to the amyloid precursor protein.

Molecular Assay Validation Using Genomic Sequence Databases.

Nucleotide sequence of the cDNA encoding human tyrosinase-related protein.

Nucleotide sequence of the tamarillo mosaic virus coat protein gene.

Exact correspondence between walk in nucleotide and protein sequence spaces.

Nucleotide sequence of a Neurospora crassa ribosomal protein gene.

Nucleotide sequence of cDNA encoding the BYMV coat protein gene.

Nucleotide sequence of the Spiroplasma citri fibril protein gene.

Nucleotide sequence of the protein D2 gene of Pseudomonas aeruginosa.

Nucleotide sequence of the bovine bactericidal permeability increasing protein (BPI).

Nucleotide sequence of the bean leafroll luteovirus coat protein gene.

Flexible protein sequence patterns. A sensitive method to detect weak structural similarities.

SIMAP--the database of all-against-all protein sequence similarities and annotations with new interfaces and increased coverage.

Nucleotide sequence of a cDNA encoding a protein with primary structural similarity to G-protein coupled receptors.