View Article Online / Journal Homepage / Table of Contents for this issue

Published on 01 January 1992. Downloaded by Northeastern University on 26/10/2014 00:05:30.

Faraday Discuss., 1992,93, 269-280

Modelling the Structure and Function of Enzymes by Machine Learning Michael J. E. Sternberg" and Richard A. Lewis Biomolecular Modelling Laboratory, Imperial Cancer Research Fund, Lincoln's Inn Fields, London WC2A 3PX, UK

Ross D. King Department of Statistics, Strathclyde University, Glasgow G1 1XH, UK

Stephen Muggleton Turing Institute, George House, 36 North Hanover Street, Glasgow Gl 2AD, UK

A machine learning program, GOLEM, has been applied to two problems:

(1) the prediction of protein secondary structure from sequence and (2) modelling a quantitative structure-activity relationship in drug design. GOLEM takes as input observations and combines them with background knowledge of chemistry to yield rules expressed as stereochemical principles for prediction. The secondary structure prediction was explored on the a / a class of proteins; on an unrelated test set it yielded 81 % accuracy. The rules from GOLEM defined patterns of residues forming a-helices. The system studied for drug design was the activities of trimethoprim analogues binding to E. coli dihydrofolate reductase. The GOLEM rules were a better model than standard regression approaches. More importantly, these rules described the chemical properties of the enzyme-binding site that were in . broad agreement with the crystallographic structure.

One major school of approaches to modelling theoretically the inter-relationships of chemical formula, three-dimensional structure and function of enzymes can be termed knowledge-based and aims to derive empirical rules from a series of observations. This paper will be considering two modelling problems. The first is the prediction of the secondary structure of a protein from its amino acid sequence.'-6 The second is modelling a quantitative structure-activity relationship (QSAR) of a series of drugs inhibiting an In both topics, the simplest method is deriving probabilities from frequency of occurrence273or by linear regression,'09" but recently neural network^^*'^-'^ have been used. The main problem with these approaches is that numerical values, such as the weights of a neural network, are obtained and this provides little insight into the chemistry of the problem. In this paper we report the use of a machine learning program GOLEM14 to these topics. Rules are derived that yield predictions that are better than alternative approaches. In addition the rules provide an understandable description in terms of the chemistry of the system.

GOLEM GOLEMI4 is a program for machine learning by inductive logic pr~gramming.'~The general scheme is shown in Fig. 1 and closely resembles that of the standard scientific method. Observations are collected from the outside world and are coded as facts and negative counter-examples. These are combined with background knowledge that in 269

View Article Online

270

Modelling by Machine Learning

Published on 01 January 1992. Downloaded by Northeastern University on 26/10/2014 00:05:30.

Test accuracy of rules on observations

Rules

Background knowledge

Add rules accepted to background knowledge

Fig. 1 Machine learning by GOLEM

both applications describes the relevant chemical principles. The observations and background knowledge are then combined by GOLEM to form an inductive hypothesis which is a rule. These rules are computed by generalisation of the facts. GOLEM first takes at random two examples from the observations. Then GOLEM computes the set of chemical properties that are common to both examples. These properties are then made into a rule which will be true for both examples. The accuracy on the rule is then evaluated on the remaining data. This is repeated several times (typically 10) and the best rule kept from these two examples. Then a further example is then chosen at random. This rule is then tested on further observations and the process repeated. If the rule continues to be accurate it is stored and added to the background knowledge. In GOLEM, the observations, background knowledge and rules are represented in the first-order predicate calculus in the computer language PROLOG. Predicate calculus is expressive enough for most mathematical concepts but is readily comprehensible and therefore makes an ideal representation for machine learning. GOLEM is written in the language C and runs on SUN Sparcstations and VAXes.

Secondary Structure Prediction Background The disparity between the number of known protein sequences (>25 000) and the number of experimentally determined structures (around 500) continues to increase, highlighting the need for computer algorithms to predict protein structure from sequence (see review'6717).Prediction of secondary structure (a-helices and &sheets) from the local sequence is a major component of the solution to the folding problem and has applications both for the general analysis of sequences and in model-building studies by homology ( e.g. retroviral proteaset8 and cytochrome P450I9). The earliest statistical methods achieved ca. 60 % accuracy for a three state (a,/3 and coil) prediction.* Recently, agproaches using large data sets3 and/or sophisticated analyses such as neural networks yielded only marginal improvements of up to 65 % accuracy. Similarly a study of sequence patterns by Rooman and Wodak' gave a prediction of ca. 60 % accuracy, which is comparable to the earlier pattern approach of Lim.' One method of improving prediction is to restrict the algorithm to a particular . ~ class of structures (a/a,alp, P I P ) . Recently neural networks by Kneller et ~ 1 have been applied to the a / a class and yielded 76 % accuracy. However, a major problem

View Article Online

Published on 01 January 1992. Downloaded by Northeastern University on 26/10/2014 00:05:30.

M. J. E. Sternberg et al.

27 1

with neural networks is that one obtains a large set of numbers that do not provide insight into principles of folding. Furthermore, in our comparison of the accuracy of predicting &turns, neural networks'z were only marginally more accurate than a simple statistical treatment.20 An alternative approach is the rule-based method of machine learning that can derive intelligible residue patterns based on the chemical properties of the residues. Previously King and Sternberg6 employed a top-down machine learning approach that yielded 60% accuracy. Here we report on the use of GOLEM, a bottom-up approach, on the a / a class of proteins.

Method Sixteen proteins determined to high-resolution and with the a/a domain structure were selected for the data set from the Brookhaven data bank.21 The training set was: 155C (cytochrome c550), 1CC5 (cytochrome c5), lCCR (cytochrome c), lCRN(crambin), 1CTS (citrate synthase), 1ECD (erythrocruorin), 1HMQ (hemerythrin), 1MBS (myoglobin), 2B5C (cytochrome b5), 2C2C (cytochrome c2), 2CDV (cytochrome c3), 3CPV (parvalbumin). The test proteins used were: 156B (cytochrome b562), 1BP2 (phospholipase A2),351C (cytochrome c551) and residues 1-108 in 8PAP (papain). The proteins selected were non-homologous. Secondary structure was assigned using the Kabsch and Sanderz2algorithm. The observation for input into GOLEM was the location in the protein of the a-helices. This was coded as alpha(protein-name,position). For example, alpha( 155C, 110) states that residue 110 in protein 155C is in an a-helix. The background information contains extensive knowledge of the chemistry of the protein. The simplest information is the primary structure such as: position( 155c, 110, V). that states residue at position 110 in 155C is a valine. GOLEM does not have built-in arithmetic and information about the sequential relationship of residues has to be provided explicitly. Thus the fact: octf(l9,20,21,22,23,24,25,26,27) describes the sequential positions of nine residues 19-27. Similarly alpha_triplet(5,6,9). and alpha-pair( 5 , 9 ) . encode the location on one face of the a-helix of the three residues 5, 6 and 9 and the pair 5 and 9. The key to the prediction is the classification of residue types into groups according to the following chemical properties: hydrophobic, very-hydrophobic, hydrophilic, positive, negative, neutral, large, small, tiny, polar, aliphatic, aromatic, hydrogen-bond-donor, hydrogen-bond-acceptor, not-aromatic, small-or-polar, not-polar, aromatic-or-very-hydrohobic, either-aromatic-or-aliphatic, not-proline, not-lysine. These facts are encoded as, for example, small(p). In addition, information is given about the relative sizes (Ltv) and hyrophobicities (Lth) of the residues in the form: Ltv(X, Y)

View Article Online

Published on 01 January 1992. Downloaded by Northeastern University on 26/10/2014 00:05:30.

272

Modelling by Machine Learning

Fig. 2 Predicted and observed secondary structures. For the four proteins in the testing set, the figure gives first the sequence, then the actual secondary structure and finally the prediction. H denotes a residue within an a-helix, - in a coil

meaning X is smaller in volume than Y. The scale for size was taken from Schulz and S~hirmer*~ and for hydrophobicity from E i ~ e n b e r g . ~ ~ The above data provided GOLEM with the observations and background information. During the training, GOLEM was allowed to misclassify 10 facts to allow some noise. For a rule to be accepted, it had to be more than 70% accurate and cover at least 3 YO of the data. Learning was stopped once no more rules were obtained. The rules obtained led to a speckled prediction with individual residues not predicted as helical but within a run of helical residues. A bootstrapping procedure was followed where the predictions made by GOLEM were used as background knowledge and GOLEM learnt how to smooth the data. This was repeated twice and the resultant rules generalised by hand to include symmetry. Resuits

The accuracy of the algorithm is measured as the percentage of residues whose state is correctly predicted. The initial GOLEM rules on the training and testing sets were 75 and 72 % accurate. After smoothing the values were increased to 78 and 81%. The better improvement on the testing set rather than the training data reflects the standard error in each of these values of ca. 2%. The value of around 80% for a two state (a-helix or coil) can be compared with the recent work by Kneller et aL4 on neural networks on a/a proteins that was 76%accurate. Thus GOLEM yields results that are slightly better than those of other recent approaches. Fig. 2 gives the results of the predictions on the four testing proteins compared to the true structure. In general helical regions are well located though, as in other schemes, some short helices can be missed.

-

M.J. E. Sternberg et al.

View Article Online

273

internal

Published on 01 January 1992. Downloaded by Northeastern University on 26/10/2014 00:05:30.

hydrophobic

not-aromatic,

notg

Fig. 3 Helix prediction rule. A typical rule for helix prediction displayed as a helical wheel

GOLEM generated 28 rules which defined patterns of residue relationships that lead to the formation of ar-helices. For example the rule in Fig. 3 defines the properties of nine sequential residues which, if obeyed, lead to the central residue being in a helix. Interestingly proline is disallowed at several positions, in keeping with its disruption of the helix; however, it is allowed at other locations. Clearly more detailed analyses are required to decide if these rules are indeed obtaining new stereochemical principles of protein folding.

Drug Design Background A standard problem in the rational design of a potent drug from a lead compound involves modelling the inter-relationship between the chemical structures of a series of compounds and their activities (see One standard method uses the Hansch equation'0*'' in which the properties of the substituents are linked by an empirical equation to their observed activity. More recently, neural network~'~ have also been applied to model QSAR. However, neither approach yields insight into the chemistry of the druglreceptor interaction. We report here the use of GOLEM to model a QSAR. In developing a methodology for drug design, it is helpful to study a system in which there is three-dimensional information for the drug/receptor complex. An ideal system (Fig. 4) is the activities of trimethoprim analogues on dihydrofolate reductase (DHFR) as crystallographic information is available?5926This system has also been studied by the Hansch analysis2' and so the results from machine learning can be directly contrasted with this alternative methodology. Method

The study was performed with a training set of 44 trimethoprim analogues [Table l(a)] from Hansch et al?' and a testing set of 11 further cogeners [Table 1( b)] from Roth et aZ?8*29Biological activities are measured as log (l/Ki) where Ki is the equilibrium constant for the association of the drug to DHFR.

View Article Online

Published on 01 January 1992. Downloaded by Northeastern University on 26/10/2014 00:05:30.

274

Modelling by Machine Learning

R5

Fig. 4 ( a )Structure of trimethoprim analogues. ( b ) A diagram of the interaction of trimethoprim Faint stippling indicates that the residue lies below the with DHFR from X-ray ~tructures.’~*~~ plane of the phenyl ring, darker stippling that the atoms are above

To apply GOLEM to the QSAR of the trimethoprim series, the activities of pairs of drugs are compared. Pairs where the activity are equal or within the margin of experimental error (given in the papers) are discarded. The output is a set of rules which decides which of a pair of drugs has higher activity. These paired comparisons are then converted to a ranking by the method of David.30 The observations supplied to GOLEM are the paired examples of greater activity, e.g. great(d20, d15). which states that drug no. 20 has higher activity than drug no. 15. The background facts are the chemical structure of the drugs and the properties of the substitutents. Chemical structure is represented in the form: struc(d35, NO;?,NHCOCH3, H). which states that drug no. 35 has: NO2 substituted at position 3, NHCOCH3 substituted at position 4 and no substitution at position 5. In addition, if either position 3 or 5 has no substitution, as in drug no 35, the position with no substitution is assumed to be position 5 (this is used in ref. 27). The additional background information was the chemical properties of the substituents. The properties were assigned heuristically and were chosen to make the approach generally applicable to drug-design problems. The particular properties are: polarity, size, flexibility, hydrogen-bond donor, hydrogen-bond acceptor, ?T donor, ?T acceptor, polarizability and u effect. Each of the 24 substituents (except

View Article Online

275

M. J. E. Stemberg et al. Table 1 Predicted and observed activity of trimethoprim analogues

Published on 01 January 1992. Downloaded by Northeastern University on 26/10/2014 00:05:30.

(4 X

log 1/ Kiapp

3,5-(oH)2 4-O(CH&CH3 4-O(CH2)5CH3 H 4-NO2 3-F 3-O(CH2)&H3 3-CH20H 4-NH2 3,5-(CH,OH), 4-F 3-O(C H2)6CH3 4-OCH2CH20CH3 4-Cl 3$4-(OH)2 3-OH 4-CH3 3-OCH2CH20CH3 3-CH20(CH2)3CH3 3-OCH2CONH2 4-OCF3 3-CHtOCH3 3-C1 3-CH3 4-N(CH3)2 4-Br 4-OCH3 3-O(CH2)3CH3 3-O(CH2)&H3 4-O(CH2)3CH3 4-NHCOCH3 3-OSOZCH3 3-OCH3 3-Br 3-NO2,4-NHCOCH3 3-OCH2C6HS 3-CF3 3,4-(OCH2CH2OCH3)2 3-1 3-CF3, 4-OCH3 3,4-(OCH3)2 3,5-(OCH3)2, 4-O(CH2)20CH3 3,5-(OCH3)2 3,4,5-(OCH3)3

3.04 5.60 6.07 6.18 6.20 6.23 6.25 6.28 6.30 6.3 1 6.35 6.39 6.40 6.45 6.46 6.47 6.48 6.53 6.55 6.57 6.57 6.59 6.65 6.70 6.78 6.82 6.82 6.82 6.86 6.89 6.89 6.92 6.93 6.96 6.97 6.99 7.02 7.22 7.23 7.69 7.72 8.35 8.38 8.87

observed rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20.5 20.5 22 23 24 25 27 27 27 29 30.5 30.5 32 33 34 35 36 37 38 39 40 41 42 43 44

rank by GOLEM 17 4.5 4.5 1 7.5 6 15 2 7.5 3 9.5 16 20 12 18 13 9.5 21 24 14 19 28 30.5 30.5 22.5 11 22.5 29 26 27 25 33 36 37 34 35 32 39 38 41.5 41.5 43 40 44

rank by Hansch 2 4 10 6.5 20.5 6.5 6.5 16 3 23 6.5 11 20.5 17.5 1 9 17.5 34 35.5 13 22 29.5 29.5 27.5 27.5 24 26 32 14 15 12 25 38 35.5 37 31 19 40 33 39 41 43 42 44

hydrogen) was given an integer value for each of these properties ( i e . hydrogen with no substitution had no properties)). This was represented using different predicates for each property and value, e.g. polar(Br, polar3). states that Br has polarity of value 3.

View Article Online

276

Modelling by Machine Learning Table 1 (continued)

Published on 01 January 1992. Downloaded by Northeastern University on 26/10/2014 00:05:30.

(b)

X

log 1/ Kiapp 7.56 7.74 7.87 7.87 8.42 8.57 8.82 8.82 8.85 8.87 8.87

observed rank 40 43 44.5 44.5 48 49 50.5 50.5 52 54 54

1 2 3.5

3.5 5 6 7.5 7.5 9 10.5 10.5

rank by GOLEM 52.5 40 40 40 44 52.5 47.5 52.5 52.5 52.5 47.5

9 2 2 2 4 9 5.5 9 9 9 5.5

rank by Hansch 45.5 44 41 45.5 53 51 54 42 48.5 50 47

4.5 3 1 4.5 10 9 2 11 7 8 6

X gives the substituents. The observed value of the affinity is expressed as log 1/ Kiapp.The first 44 drugs [Table l ( a ) ] were used in the training set and the observed rank ranges from 1 to 44. The final 11 drugs [Table l(b)] are the test set and the first number is the rank in the 55 drugs and the second the rank for the 11.

GOLEM does not have in-built arithmetic and so, as with the protein work, information was explicitly given about the relative values of these properties for the substituent groups. For example: polar(OCF3, polar4). polar(CH,OH, polar2). great-polar( polar4, polar2). together state that OCF, is more polar than CH20H. Finally information was given about the relative values of the properties compared to fixed values, e.g. greatO-polar( polar1 ). states that a polarity of 1 is greater than a polarity of 0. The input to GOLEM was 871 observations and 2976 items of background information. The run time was ca. 30 cpu min on a SUN SparcStation l to generate one rule.

Results Nine rules were obtained that predicted the relative activity of two drugs in terms of the chemical properties of the substituents (see Table 2). The rules are in the form of Prolog clauses but can readily be interpreted into English. The rank values predicted for the 44 drugs in the training set of GOLEM and by the application of the Hansch equation are given in Table 1. The results are presented graphically in Fig. 5 . The Spearman rank correlation with the observed order is 0.92 for GOLEM and 0.79 for the Hansch equation. The difference in these rank correlations

View Article Online

M. J. E. Sternberg et al.

277

Table 2 Rules for QSAR derived by machine learning

Published on 01 January 1992. Downloaded by Northeastern University on 26/10/2014 00:05:30.

Rule 3.1 - (coverage 119/0 train: 105/0 test) great(A, B) :-struc(A, D, E, F), struc(B, h, C, h), h-donor(D, h-dono), pi-donor( D, pi-donl), flex(D, G,), less4,flex(G). Drug A is better than drug B if drug B has no substitutions at positions 3 and 5 and drug A at position 3 has hydrogen donor=O and drug A at position 3 has w-donor= 1 and drug A at position 3 has flexibility < 4. Rule 3.2 - (coverage 244/71 train; 248/4 test) great(A, B) :-struc(A, C, D, E), struc(B, F, h, G), not except3.2(A, B). except3.2(A, B) :- struc(A, C, D, h), struc(B, E, h, F), h-donor(E, h-don0). Drug A is better than drug B if drug B has no substitution at position 4 unless drug A has no substitution at position 5 and drug B at position 3 has hydrogen donor=O. Rule 3.3 - (coverage 102113 train: 33/0 test) great(% B) :-struc(A, G, H, I), struc(B, C, h, D), pi-donor(C, pi-dono), polar( C, E), greatO-polar( E), h-acceptor( C, F), greatO-h-acc( F). Drug A is better than drug B if drug B has no substitutions at position 4 and drug B at position 3 has w-donor = 0 and drug B at position 3 has polarity > 0 and drug B at position 3 has hydrogen acceptor 7 0. Rule 3.4 - (coverage 129/2 train: 126/0 test) great(A, B) :-struc(A, C, D, E), struc(B, G, h, h), h-donor(C, h-dono), pi-donor(c, pi-donl), flex(C, F), less4_flex(F), polarisable(G, H), less3_polari(H). Drug A is better than drug B if drug B has no substitutions at position 4 and 5 and drug B at position 3 has polarisability < 3 and drug A at position 3 has hydrogen donor = 0 and drug A at position 3 has w-donor = 1 and drug A at position 3 has flexibility < 4. Rule 4.1 - (coverage 289/72 train: 99/0 test) great(A, B) :-struc(A, D, E, F), struc(B, h, C, h), not except4.l(A, B). except4.1(A, B) :-struc(B, h, C, h), size (C, size3). except4.1(A, B) :- struc(B, h, C, h), size(C, size2), h-acceptor(C, h-accl). Drug A is better than drug B if drug B has no substitutions at position 3 and 5 unless drug B at position 4 has size=3 or drug B at position 4 has size = 2 and hydrogen aceptor = 1. Rule 4.2 - (coverage 187/2 train: 193/2 test) great(% B) :-struc(A, E, F, G), struc(B, C, D, h), not-h(E), polar(f, polar2). Drug A is better than drug B if drug B has no substitution at position 5 and drug A has a substitution at position 3 and drug A at position 4 has polarity = 2. The rules are first given as Prolog clauses (in which ':-' is a definition and a ',' is logical 'and') and then in English. The rules have been classified into those primarily relating to substituent 3 (rules 3.1-3.4); to substituent 4 (4.1 and 4.2). The rules given are the more general and three other more specific are not given.

View Article Online

Modelling by Machine Learning A

Published on 01 January 1992. Downloaded by Northeastern University on 26/10/2014 00:05:30.

50

A

d

40 A

B

I

0 0 00 0 0 0::

30

u

.CI

O

o 0

10

O

o 0

0

0

I

u

03

u

i

OO O

0 '

.A*"

50

O

A

A . A

40 0 0

A

E

Btj %

0

& 20

0

0 0

0

0

0

0

co co

30 0 0

0

0 0

I

om

I

0

lo*

20

30

40

50

true rank

Fig. 5 Scattergram of the observed rank us. that predicted by GOLEM ( a ) and by Hansch ( b ) . 0, Train; A, test.

was shown to be significant by Fisher's z tran~formation.~' The value z = 2.2 is significant at the 5% level and almost significant at the 1 % level (0.985). A better test of a prediction method is its performance on data not used in developing the algorithm. The structure and activity of eleven new trimethoprim analogues was used as a test set for the Hansch equation and the machine learning rules. The results for the 11 drugs not used for training are given in Table 1( 6 ) . These values were obtained by relating the activity of each of these drugs to the true order of the 44 compounds used in the training set. The rank correlation for the 11 drugs by GOLEM was 0.46 compared to 0.42 for the Hansch method (Fig. 6). The Fisher z-value is 0.10, which is not significant reflecting the similar rank correlations obtained on a small test set. Thus on the test set, the machine learning effectively is as accurate as the regression approach of Hansch. Both methods predict well that the test drugs have high activity (Fig. 6).

View Article Online

Published on 01 January 1992. Downloaded by Northeastern University on 26/10/2014 00:05:30.

M. J. E. Sternberg et al.

279

A key aim of the use of GOLEM was to derive rules that provide an understanding of the stereochemistry of the interaction of the drugs with the enzyme DHFR. Thus the GOLEM rules should prove a similar description of the interactions to that obtained crystallographically.25’26Fig. 4( b) shows a diagram of the binding of trimethoprim to E. coli DHFR in the ternary complex26with NADPH. The phenyl ring of trimethoprim is effectively buried, being sandwiched in a hydrophobic cleft between Phe-3 1, Leu-28 and Met-20 on one side and Leu-54, Ile-54, Ser-49 and with the NADPH cofactor on the other. The environments of the 3-, 4- and 5- substituents vary. Both metu positions (corresponding to 3- and 5-substitutions) are constrained in size by the proximity of atoms of the enzyme and the cofactor. In contrast the para (or 4-) position is exposed to solvent. To obtain the general features of the DHFR binding site from the rules in Fig.5, we examined rules that applied to more than 100 pairings of the training set. From rules 3.1 to 3.4, the general properties of a favourable substituent at the 3-position are: wdonor of 1; neither a hydrogen-bond donor nor an acceptor; flexibility

Modelling the structure and function of enzymes by machine learning.

A machine learning program, GOLEM, has been applied to two problems: (1) the prediction of protein secondary structure from sequence and (2) modelling...
1MB Sizes 0 Downloads 0 Views