k./ 1992 Oxford University Press

Nucleic Acids Research, Vol. 20, No. 13 3471 -3477

Escherichia coli promoters: neural networks develop distinct descriptions in learning to search for promoters of different spacing classes Michael C.O'Neill Department of Biological Sciences, University of Maryland-Baltimore County, Baltimore, MD 21228, USA Received February 10, 1992; Revised and Accepted June 1, 1992

ABSTRACT Back-propagation neural networks were trained to recognize promoter sequences of each of the three major spacing classes found in E.coli. These networks were trained with the object of maximizing their ability to generalize while maintaining the level of false positive identifications at a fraction of 1 percent. These objectives were generally met. Networks for the 16 base spacing class captured between 78 and 100% of previously unseen promoters in different tests; networks for the 17 base class identified 97% of the test promoters; networks for the 18 base class identified 79% of the test promoters. A tandem poll of networks for all three spacing classes produced a cumulative false positive level of less than 0.5%. In each case, the weight matrices used by the networks in their classification were analyzed to determine the relative weight assigned to the occurrence of a given base at a given position within the promoter. In this fashion, an approximate description of the network's definition of the promoter can be obtained.

INTRODUCTION It has proven surprisingly difficult to arrive at an accurate definition of what constitutes a procaryotic promoter sequence. With more than two hundred examples in hand, there is no process or program which will, with certitude, pinpoint the location of a new promoter given only sequence data. The problem has been complicated by the existence of three major spacing classes which are divergent in sequence (1), by the frequent compromise of promoter sequence in accommodating overlapping regulatory elements, and by the extreme range of efficiency evidenced in the known promoter set (2). Straightforward consensus search methods have, to date, proven inadequate to this challenge, as have many statistically based descriptions. Recently, I have reported that back-propagation neural networks (3,4) could be trained to provide an improved definition and search function for the 17 base class promoters (5). In the work presented here that description is still further refined in the case of the 17 base spacing class promoters and is extended, as well, to include the promoters of the 16 and 18

base classes. The weight matrices of the trained networks were analyzed to determine the relative static weighting of each base at each position of the promoter sequence.

METHODS Networks The neural networks employed in this study were all designed, trained, and tested using Neuralware II Professionalrm software. A general consideration of the problem of applying neural networks to problems in sequence analysis has been provided in an earlier work (5). All networks in this study were three layer back-propagation networks. Within that description, the specific network architecture varied according to the details provided below. In all cases, the input sequence was coarse-coded in binary form: 0001 =A, 0010=C, 0100=G, 1000=T. A 58 base input sequence is therefore represented by 232 binary characters with one input neuron for each character; a 20 base input is represented by 80 characters and requires 80 input neurons. A 'desired' output value is supplied with each input sequence, 1.0 for a promoter input and 0.0 for a non-promoter input. When a trained network is used to search a test sequence, an output value of 0.90 or greater is scored as a promoter unless otherwise specified.

16 Base class The networks trained for promoters of the 16 base spacing class all employed a 58 base input sequence requiring 232 input neurons. Two distinct network sets were trained for this promoter class. One training set was used for the first 4 networks and a second training set was used for the second set of 4 networks. All promoter sequences were taken from previous listings (6,7). For the first 4 networks the promoter sequences used in training were: rpoB, malK, malEGF, aroH, recA, tyrT, tnaA, mnDPl, and rnnEXP2. This promoter set was expanded, as described previously (5), by permuting the bases at non-critical positions: 1-4,7, 10, 12-14,21-28, 32, 34-36, 48, 50, 51, 55-58. This results in 1008 'promoters'. Two sets of 4000 random sequences, 50% in A+T, were prescreened using previously described promoter-search programs (8) to eliminate incidental promoter sequences. The promoter set was repeated 4 fold and combined with the 2 non-promoter sets as the training set. Inputs

3472 Nucleic Acids Research, Vol. 20, No. 13 were selected at random from this collection. The likelihood was thus 1/3 that a given input would be a promoter and 2/3 that it would be a non-promoter sequence. The test set, for performance evaluation after training, included the following promoters: S10, ampC, leutRNA, rnnABPI, rnnGPI, rnnEPl, mnnXPl, rnnABP2, and mnGP2. The networks were trained with a learning coefficient of 0.9 and a momentum of 0.6. Temperature during training was employed as indicated. Network 16-1 had 10 hidden layer neurons. It was trained on a total of 851 input sequences. Network 16-2 had 22 hidden layer neurons. It was trained on a total of 1700 input sequences with a temperature of 40° applied throughout training. Network 16-3 had 30 hidden layer neurons. It was trained on a total of 10,225 input sequences with a temperature of 400 applied for the first 5000 inputs. Network 16-4 had 30 hidden layer neurons. It was trained on a total of 1600 input sequences with a temperature of 400 applied

throughout training.

The promoters to be used in training the second set of networks were chosen by the following mechanism. A network was trained

for approximately 5000 inputs, using random sequence plus all of the 16 base spacing class promoters except one. The network was then asked to identify the remaining promoter as a promoter. This was done for each of the 18 promoters in the class. Seven of the promoters failed this test at least twice. These promoters, aroH, rpoB, ampC, malEGF, malK, tnaA, and recA, together with mnABP1, were used as the training promoters for the second set. These promoters were permuted and combined with the same two random sequence sets used above for the first set. All 4 networks had 15 hidden layer neurons. Temperature was not employed during training. Network 16-5 was trained on a total of 16,613 input sequences. Network 16-6 was trained on a total of 8,850 input sequences. Network 16-7 was trained on a total of 15,345 input sequences. Network 16-8 was trained on a total of 11,571 input sequences. The promoter test set for these networks included: S10, tyrT, leutRNA, rnnGPI, rnnDPl, rnnEP1, rnnXPl, rnnABP2, rnnGP2, and rnnDEXP2. 17 Base class The networks for the 17 base spacing class promoters were produced by very brief retraining of the 4 networks previously reported (5). Two additional promoters, the tet promoter of pBR322 and the X cin promoter, were added to the 39 promoter set used earlier. This set included: malT, trpP2, fol, uvrBP3, lexA, rplJ, lacd, trp, bioB, spot 42, MIRNA, tuj'B, supBE, araC, thr, uvrBPl, oriL, glnS, str, spc, rpoA, T7A1, X Pr, X Prm, P22ant, gnd, T7B, T5N25, T5 28, T5 26, T5H207, T5DE20, T7C, 434Pr, 434Prm, P22nmnt, TnlOPin, TnlOPout, and Tn5neo. This set of 41 'true' sequences was expanded by permuting all possible single base changes in positions other than those known to harbor promoter point mutations. The rule was that one base change could be made in any promoter at any position except for positions in the -35 region, -4/+2 bases or in the -10 region, -5/+2bases. In this expansion, there would be 132 derivatives of each original promoter sequence, each altered at a single position relative to the wild type promoter. This was done on the assumption that single point mutations which have not shown up in studies todate are unlikely to have significant negative effect on the sequence; this could result in some none promoter sequences being included in the promoter group. The 'false' input group included 4000 random sequences which were 60% in A +T and about 100 examples of derivatives of the P22ant promoter,

representing all possible pairwise permutations of double promoter-down mutations (15). The random sequences were prescreened, using previously described promoter-search programs (8). Input sequences from this combined set were supplied to the network in random order, with 'true' inputs assigned a desired output of 1.0 and 'false' inputs a value of 0.0. These networks all had 15 hidden layer neurons. Network 17-1, which had been trained on 165,000 inputs previously, was trained for an additional 2000 inputs with the modified training set. Network 17-2, which had been trained on 100,000 inputs, was trained for an additional 25 inputs. Network 17-3, which had been trained on 136,000 inputs, was trained for an additional 300 inputs. Network 17-4, which had been trained on 130,000 inputs, was trained for an additional 100 inputs. The set of test promoters was: T7A3, T7D, X pL, X pO, X pR', P22Prm, phiXA, fdX, pBRbla, pBRP1, pBRprimer, colElPI, rsfprimer, rlOORNAI, TnSIR, bioP98, X c17, X L57, cya, divE, ,u PE, plSprimer, pColvirPl, pyrEPI, mp, rpmH2p, rpmH3p, rpslP2, rpmB, T7B, ompC, fdII, carABPl. 18 Base Class None of more than 30 attempts to train a network for this class using a 58 base input succeeded. The input was then reduced to the 20 positions showing the highest information content (9) within the set of 18 base spacing class promoters, positions 2, 6, 7, 10, 13, 15, 16, 17, 18, 19, 22, 30, 34, 39, 40, 42, 44, 53, 57, and 58. All of the networks trained for this class had 80 input neurons and 10 hidden layer neurons. The training promoters were not permuted. The 4000 random sequences, 65 % A+T, were prescreened for promoter sequences as in the other cases. The promoter subset was duplicated to equal the random sequence subset in number. A randomly selected input from the training set thus had an equal chance of being a promoter or a random sequence input. Network 18-1 used galP2, deoPl, uvrBP2, lpp, alaS, lacPl, trpR, his, ilvGEDA, and oriR in its training set. This network was trained on a total of 9855 inputs. The training set for networks 18-2, 18-3, and 18-4 included the promoters above plus hisJ, bioA, araBAD, P22PR, and phiXD. These were trained on a total of 6291, 2700, and 1200 inputs respectively. The promoters in the training set for 18-5 were: uvrBP2, Ipp, alaS, trpR, his, ilvGEDA, oriR, araBAD, P22PR, glyA, motA, and tyrTI140. This network was trained on a total of 1958 inputs. The promoter test set for these networks would include bioA, hisJ, araBAD, P22PR, phiXD, phiXB, T7A2, carABP2, dspD, dnaA-lp, dnaK-Pl, glyA, htpR-P2, motA, ssb, sucAB, tonB, trxA, and tyrT/109, less any of these which were in the training set.

Programs Although these networks were produced with proprietary software, the weight matrices which constitute the trained networks can easily be converted to a format suitable for use in a generic back-propagation kernel. I hope to be able to provide network access to these trained neural nets for use in promoter searches through the NCI Advanced Scientific Computing Laboratory at Frederick Cancer Research and Development Center by the time of publication.

RESULTS In the course of this study, more than 100 distinct networks were trained to various degrees in the bid to develop one or more networks which could search out promoters of the three major

Nucleic Acids Research, Vol. 20, No. 13 3473 spacing classes with high efficiency while producing no more than practical levels of false positives ( 0.94

16-2

7/9 all> 0.90

16-3

7/9 all> 0.95

16-4

7/9 all> 0.93

Poll(> =2)

7/9

709,1422, 3480 340,485, 629,1186, 2866,3301 340,629, 709,1062 1758,2211, 2506,2536, 3089,3105, 3556,4086, 4236 485,2952, 3301 340,485,

629,709,

Table 2. Performance of 17 base class networks on test promoters and pBR322. The lowest score among the captured promoters is listed in each case. The coordinates of hits in pBR322 are shown for scores greater than the lowest test promoter score in each case. Coordinates in the counter-clockwise(ccw) case are measured counter-clockwise from the EcoRl site. Sites of known promoters are underlined. Network

Test Promoters Recognized/Total

17-1

34/35 all> 0.99

pBR322(ccw)

Recognized/Total A. Set 1 16-1

Seven of these networks failed, in at least 2 successive training attempts, to identify the missing promoter. These seven promoters plus one example of a ribosomal RNA promoter were used as the training set for a new network. This is clearly a biased training set which has the built-in advantage that the most difficult cases have been removed from the test group. However, it also provides a disparate training set with substantially lower information content than the first training set. Table IA shows the results of 4 separate networks trained on the first training set (described in detail in the Methods). Each of the networks recognized the same 7 of the 9 promoters in the test group (missing ampC and S10). A search of pBR322 (10,11) in both directions, employing the worst test-promoter score as a cutoff, produced agreement by at least two networks on only 3 sites; lowering the cutoff to 0.9 produced a total of 8 hits. Table lB shows the results of 4 separate networks trained on the second training set. Three of the networks recognized all 10 of the test promoters; the 4th network recognized 9 of the 10. A poll of the 4 networks would find all of the test promoters and only a single site in pBR322 at the cutoff level of the test promoters; lowering the cutoff to 0.90 resulted in 2 sites in each direction of the pBR322 search.

1619,2538, 2900,3476, 3988 2864,2900 152,455, 719,985, 1340,1602, 1627,2046, 2864,2900, 3486,3782

7/9

-

17-2

9/10 all> 0.90 10/10 all> 0.98

182,629 182,3742

33/35 all> 0.99

17-3

34/35 all> 0.99

1646,3988 2864,2900,

1619,3400 306,1646,

2717,3216, 3400,3476, 3478,3613,

17-4

32/35 all> 0.99

10/10 all> 0.98 10/10 all> 0.98

3742

Poll(> =2) Poll(4/4)

10/10 9/10

182,3742

3400,3973

-

-

-

2900,2945,

-5,477, 1584,1970, 2024,3387, 3650,3947, 3993,4130,

-5,339, 1584,1970, 2024,2859, 3343,3650, 4130,4135,

-5,1567,

1584,1970, 2271,2446, 3343,4130, 4153

125,277, 295,496, 532,603, 807,1021, 1056,1166, 1226,1484, 1498,2397, 2522,2550, 3123,3719, 4236,4278

7,125 277,337, 404,807, 1056,1226, 1498,2627, 2987,3123, 3672,4278

34,125, 404,807, 1166,1226, 1498,1619, 2550,3672, 4278

Poll(> =2)

34/35

Poll(4/4)

32/35

-5,339,

125,277,

4153

404,807, 1021,1056 1226,1498, 2550,3123, 3672,4278

-5,1584,

125,807,

477,1584, 1970,2024, 3343,3650, 4130,4135,

3973 533,2864,

16-7 16-8

125,807, 1021,1226, 4278

4153

3988

B. Set 2

16-5 16-6

-5,339,

4135

306,479,

-

pBR322(ccw)

477,1584, 1657,1970, 4130

3301

Poll(4/4)

pBR322(cw)

1970,4130

1226,4278

3474 Nucleic Acids Research, Vol. 20, No. 13 I have previously described 4 networks trained to recognize promoters of the main 17 base spacing class (5). These networks averaged about 78% recognition on a large, previously unseen test set (36 promoters) with false positives in the 0.1% range. Further analysis of those promoters which were not being correctly identified indicated that they shared some common qualities. If this signalled a subgroup not represented in the training set, including 1 or 2 members in the training set might have a significant effect on performance. To this end, the trining set was increased from 39 to 41 promoters with the addition of tet and X cin. Each of the 4 former networks was trained briefly (see Methods) with the new training set. The results are seen in Table 2. Two networks miss a single promoter in the 34 promoter test set; the other networks miss 2 and 3, respectively. A poll of the 4 finds 33 of the 34 promoters. The poll in the search of pBR322 with the same cutoffs produces 11 hits (cw) and 12 hits (ccw), including 4 known 17 base promoters, for a false positive level of 0.2% or less. Thus the poll achieves a 97% recognition rate at the cost of doubling the false positive

level. The 18 base spacing class promoters were notably refractory to the training methods which had succeeded with the 2 other spacing groups. The original Hawley-McClure list of 13 promoters was expanded with the addition of 4 phage promoters and 8 additional promoters from the Harley-Reynolds list; these additions were prescreened with my pseudo-information content promoter program (8) to confirm that they were good candidates for the 18 base spacing class. Nonetheless, in over 30 distinct trials with many different choices of training sets and network configurations, none was successful. Only after going to a reduced input set, using the 20 positions having the highest information content for the promoter set, was I able to obtain partially satisfactory results with this spacing group. Table 3 summarizes the results with five different networks, trained with a 20 base input from a 12 to 15 promoter training set. The individual networks scored between 68 and 83 % on the promoter test sets. A poll requiring the agreement of at least 3 of the 5 networks identified 11 of 14 test promoters (the supernumerary promoters in two of the test groups were excluded by definition) while producing 7 hits in each direction of the pBR322 search, including the single known promoter of the 18 base spacing class. This corresponds to 79% promoter recognition with a false positive level of 0.15% or less. The training sets employed above relied heavily on the Hawley -McClure promoter compilation. Given a complete set of trained networks, one might then use the much larger Harley-Reynolds compilation as an extended test group. There are many differences in the indexing of particular promoters with respect to starting base in the two compilations. No effort to

adjudicate these was made here. All 152 promoters from the Harley-Reynolds list defined over the requisite 58 base positions were used. The following results were obtained in a search of this database with 4 16 base networks, 4 17 base networks, and 5 18 base networks in tandem. Majority polls resulted in 22 16 base promoters, 74 17 base promoters, and 38 18 base promoters; the additional promoters were: aroF, P22PR, unci, and uvrD for the 16 base class; aroG, colEl-C, dnaA-2p, fplas-traM, hisS, htpR-Pl, ilvIH-P4, livJ, metA-Pl, metA-P2, and Tn2661bla-Pa for the 17 base class; CloDFmnaI, NRlmaC, pl5rnal, pBRmnal, R1RNA2, RSFmal, TnlOtetA, and TnlOtetR for the 18 base class. In 16 cases, the same promoter was recognized in two spacing classes. Thus 118 distinct promoters were recognized

Table 3. Performance of 18 base class networks on test promoters and pBR322. The lowest score among the captured promoters is listed in each case. The coordinates of hits in pBR322 are shown for scores greater than 0.90. Sites of known promoters are underlined. Network

Test Promoters

pBR322(cw)

pBR322(ccw)

193,1452, 2209,2211, 2928,3139, 3343,3347,

104,107, 217,472, 1106,1369, 2088,2973, 2995,3198, 3719,4133 14,104, 107,217, 365,1106, 1195,1325, 1369,1393, 1833,2012, 2088,2377, 2973,3198,

Recognized/Total 18-1

11/14 all> 0.90

3727

18-2

18-3

10/14 all> 0.91

11/14 all> 0.92

47,193, 517,677, 815,960, 1452,1701, 1801,1817, 2219,2928, 3139,3343, 3347,3475, 3491,3533, 3649,3785 109,193, 199,517, 677,815, 928,960, 1070,1452, 1729,1817, 2209,2306, 2514,2579, 2928,3139, 3343,3347, 3475,3586, 3753,3884

18-4

13/19 all> 0.95

325,960, 2027,2928, 3139,3347, 3789,4130

4133

11,104, 158,217, 603,849, 864,1195, 1393,1704, 2088,2216, 2387,2417, 2581,2717, 2973,2995, 3198,3260, 3363,3465, 4133,4146, 4186,4230 57,214,

217,234, 365,603, 621,1041, 1143,1195, 1325,2012, 2626,4133, 4186

18-5

14/17 all> 0.92

Poll(>=3)

11/14

35,193, 325,677, 913,960, 1070,1079, 1357,1457, 1752,1817, 1960,2039, 2211,2776, 2928,3139, 3161,3343, 3526,3583, 3586,3753, 3789,4130, 4153 193,960

1452,2928, 3139,3343, 3347

Poll(5/5)

10/14

2928,3139

202,203, 217,234, 403,472, 761,1041, 1052,1753, 1835,2012, 2014,2102, 2251,2503, 2995,4002

104,217, 1195,2088, 2973,3198, 4133 217

out of 152 in the input; the removal of the 52 promoters used in the training sets, yields 66 of 100 promoters for a recognition rate of 66% over all three major spacing classes. It should be noted that this is a minimum value in that an incorrect indexing choice in the original listing can prevent proper recognition; gal and hisJ, for example, were entered in the Harley -Reynolds list as in the 16 base class, whereas they would have been recognized had they been entered as 18 base spacing class promoters.

Nucleic Acids Research, Vol. 20, No. 13 3475 The following reasoning was used to partially 'decompile' the weight matrices of the trained networks to obtain a given network's averaged promoter profile. The weight matrix of a trained network, in the absence of a specific input sequence, can only provide a 'static' estimate of its preferences. Here one is interested in the relative contribution of a given base at a given position in the overall assessment of whether the site is a promoter or not. Let us consider the example of a 'T' , code 1000 (see 5), at the first position of the 58 base promoter sequence. In the trained network, this first input neuron is connected by a weighted line, carrying the weight wll, to the first hidden layer interneuron . The relative effect on the first interneuron of having this first input neuron active is taken as w1I divided by the sum of all weights, including the bias, on the input lines to this interneuron. The relative influence of this first interneuron, in turn, upon the output neuron is the product of the weight on the line connecting them, wol, and the activation state of the first interneuron (see 4) divided by the sum of the weight-activation products of all the interneurons; however, since this divisor is a constant for a given trained network, it can be factored out of a consideration of the relative influence of interneurons on the output. In the absence of a complete input, the activation state of the first interneuron is undefined; in its place we will substitute an 'averaged' activation, using the sum of its input weights, as defined by the trained network, in its sigmoidal transform function. The influence of 'T' in the first position is mediated not only by the first interneuron but by all interneurons, since the network is fully connected (there are weighted lines from the first input neuron to each of the interneurons). Thus the function above must be summed over all interneurons for each input position in turn. For a given input position,j, the relative weight of an input at j is E[wjj Woj] E wi+Bi [1 +

exp(-E wij+Bi)]

J

J

where j indexes the input neurons and i indexes the interneurons.The averaged weight of each base at each position of the promoter was determined by this procedure for each network and then summed over all the networks used for a particular spacing class. The result specifies a relative weight, either positive or negative, for each of the four possible bases at each position of the promoter site. This is shown in Figure 1. Figure IA and lB show the two different network sets for the 16 base spacing class; Figure Ic shows the new networks for the 17 base class; and Figure Id shows the incomplete 20 position result from the reduced set networks of the 18 base class. This procedure produced the following optimum sequences for the three spacing classes, two for the two groups in the 16 class, the old 17 set and the new 17 set, and the incomplete 20 base set for the 18 base class. CGTCTAAAGAAGGATTGAAACAGCRTCCGGATTCCG TATAATGCGCTCCACTGGGGTG I CGACAAAAGAACGATTGACAGGGCGTSYGGTTTCCG TATAATGGSMTCCACCGGACAG 2

ACTACAATCAMTGGTTGACAAAGKTATCGAGGTGTKRTATAATWKCGGCCYAACAGAY I ACKACAATCAMTGGTTGACAAAGTTATCGAGGTGTKRTATAATWKCGGCCYAAMAGAC 2 A

AA

C

T TTGAC

W

A

T

TA M T

T

TT

Inasmuchas this procedure can only evaluate the static averaged behavior of the trained networks, it was worth asking if the sequences suggested are really close to the network ideal by employing them as test sequences in the trained networks. When

this was done, the suggested 'best' sequences for the 16 and 17 base spacing classes received higher scores than all of the promoter sequences actually used to train the networks, with a single exception; str tied with the proposed sequence in network 17-4 which scored both sequences as 1.0. An analysis of Figure 1 suggests common features for the two classes for which there is complete information: 1) There is more negative discrimination than positive; the mean value for each of the 4 bases is negative across the 58 position total for both the 16 and 17 base classes. This is more extreme in the 17 base class where only 2 A's, 1 C, 1 G, and 4 T's carry large positive weights. 2) In almost all instances, a strong positive choice is unshadowed, that is, there is strong discrimination against the same base occurring either a position before or a position after the chosen position (the double T of the -35 sequence is the notable exception). This has the effect of protecting against ambiguity in the spacing class. 3) The optimum sequences for the two spacing classes share a remarkable symmetry arrangement. In the 16 base class sequence, positions 5 to 22 and 37 to 54 represent a direct repeat (50% homologous); the intervening 14 base section (the spacer less positions 21 and 22) is homologous to the complement of positions 42 to 55 (57%). In the 17 base class sequence, positions 3 to 22 and 38 to 57 represent a direct repeat (60% homologous): the intervening 15 base section (the spacer less positions 21 and 22) is homologous to the complement of positions 40 to 54 (60% homologous). Here, again (see 1), there is the suggestion that current promoter form may have evolved by means of the repeated use of a single smaller sequence element.

DISCUSSION The tables provided above demonstrate a very high level of promoter detection by these trained neural networks with low levels of false positive identification. Still higher figures have been reported for other neural network formulations over the past few years (12-14). However, these other reports share the feature that they achieve high capture rates at the expense of high levels of false positives in the range of 2 to 6%. Such a method would, at best, produce 85 'promoters' from a search in either direction of the sequence of pBR322 or 960 'promoters' in a search of lambda. A rather similar result could be more simply obtained by a search protocol which defined a promoter as a sequence having at least three of the six consensus bases for the -35 and -10 sequences, separated by 16, 17, or 18 bases (2). From a practical point of view, a method with this level of false positives is useless. As noted in earlier work, Figure 1 suggests that the optimum promoter sequence is spacing class dependent, with differences which extend even to the secondary bases of the -10 sequence. The 16 base spacing class appears to make much greater use of bases outside the -35 and -10 regions than do the other classes. Even though the data set for the 18 base class is relatively incomplete, it can be seen that the weighting pattern for this class bares little resemblance to those of the other classes. The sequence changes between the earlier networks for the 17 base class and the current improved networks for that class are trivial. The improved performance must arise from rather subtle changes in the overall weighting pattern which shows a slight positive shift in most positions. The standard approach to training networks for the 18 base spacing class failed to produce networks which could generalize;

3476 Nucleic Acids Research, Vol. 20, No. 13

B

A

E-

0

I 20

40

30

SEQUENCE POSITION

SEQUENCE POSMON

D

C

I.0 1-m

4?

0.5

0 0-4

V

0

5

ek

VI t

0

a

*

u

0

V ,

-0.5

v

0

v V

V0

v 0 V

0

VV v

v

J

_

0 0

SEQUENCE POSMON

l0

20

30

40

50

60

SEQUENCE POSITION

Figure 1. The relative weight of each base at each of 58 promoter positions was determined from a static analysis of the weight matrices of the trained networks for the 16 and 17 base spacing classes and for 20 positions of the 18 base spacing class. The result for networks 16-1 to 16-4 are shown in Fig.lA; the result for networks 16-5 to 16-8 are shown in Fig.lB; the result for networks 17-1 to 174 are shown in Fig.iC; the result for networks 18-1 to 18-5 are shown in Fig.lD. The locations of the -35 and -10 regions are indicated. the networks generally had no problem learning the training set, but then did very poorly on naive test sets. This sustained failure in efforts to train networks with the full-length 58 base sequences of the 18 base spacing class remains unexplained. With a backpropagation network, there is no significant theoretical advantage in moving to a reduced input. The contrast of this experience with that obtained with the other classes suggests the possibility that the database for this class is corrupted. In that instance, a reduced input set might reduce the conflict among inputs, allowing a more general solution. It may also be relevant that, in the screening of the Harley-Reynolds list, 13 of the 16 double classifications involved the 18 base class. One might question whether the networks' descriptions of the major promoter types are specific enough to predict the effects of mutations with accuracy. There is a recent report in which

the complete set of -35 and -10 single base mutations in the p22ant promoter were tested in vivo (15). When this sequence set is tested in a trained network for the 17 base spacing group, the network agrees with their ordering of the best and the worst bases for 10 of the 12 positions but is not able to reproduce their order for the intermediate bases. It should be recalled that the networks were not trained with quantitative input data; they were tasked to make a qualitative distinction. It may, however, be possible to provide quantitative inputs in the form of in vitro measurements of promoter strength (2). Networks so trained might, in turn, be able to predict the effects of specific mutations on subsequent measurements of promoter strength in vitro. These still will not predict in vivo behavior (16,17), a capability which must await the understanding and incorporation of additional affective elements to complete the description of the promoter.

Nucleic Acids Research, Vol. 20, No. 13 3477

REFERENCES 1. O'Neill, M.C. (1989). J. Biol. Chem. 264, 5522-5530. 2. Mulligan, M.E., Hawley, D.K., Entriken, R., and McClure, W.R. (1984). Nucl. Acids Res. 12, 789-800. 3. Werbos, P. (1974). PhD. thesis, Harvard University. 4. Rumelhart, D.E., Hinton, G.E., and Williams, R.J. in 'Parallel Distributed Processing: Explorations in the Microstructures of Cognition.' pp. 318-362, Cambridge, Mass.: MIT Press, 1986. 5. O'Neill, M.C. (1991). Nucl. Acids Res. 19, 313-318. 6. Hawley, D.K. and McClure, W.R. (1983). Nucleic Acids Res. 8, 2237-2255. 7. Harley, C.B. and Reynolds, R.P. (1987). Nucl. Acids Res. 15, 2343-2361. 8. O'Neill, M.C. (1989). J. Mol. Biol. 207, 301-310. 9. Schneider, T.D., Stormo, G.D., Gold, L., and Ehrenfeucht, A. (1986). J. Mol. Biol. 188, 415-431. 10. Sutcliff, J.G. (1978) Cold Spring Harbor Symp. Quant. Biol. 43, 77-90. 11. Peden, K.W.C. (1983) Gene(Amst.) 22, 277-280. 12. Ezhov, A.A., Kalambet, Y.A., and Chemy, D.I. (1989). Studia Biophys. 129, 183-192. 13. Lukashin, A.V., Anshelevich, V.V., Amirikyan, B.R., Gragerov, A.I., and Frank-Kamenetskii, M.D. (1989). J. Biomolec. Structure and Dynamics 6, 1123-1133. 14. Demeler, B. and Zhou, G. (1991). Nucl. Acids Res. 19, 1593-1599. 15. Moyle, H., Waldburg, C., and Suskind, M.M. (1991). J. Bact. 173, 1944-1950. 16. Brunner, M. and Bujard, H. (1987). The EMBO J. 6, 3139-3144. 17. Knaus, R. and Bujard, H. (1988). The EMBO J. 7, 2919-2923.

Escherichia coli promoters: neural networks develop distinct descriptions in learning to search for promoters of different spacing classes.

Back-propagation neural networks were trained to recognize promoter sequences of each of the three major spacing classes found in E. coli. These netwo...
1MB Sizes 0 Downloads 0 Views