J. Mol. Biol. (1990) 216, 441-457

Machine Learning Approach for the Prediction of Protein Secondary Structure Ross D. King Turing Institute George House, 36 North Hanover Street Glasgow G1 2AD, U.K.

and Michael J. E. Sternbergt Laboratory of Molecular Biology Department of Crystallography, Birkbeck College Malet Street, London WCIE 7HX, U.K. and Biomolecular Modelling Laboratory Imperial Cancer Research Fund, P.O. Box 123 44 Lincoln's Inn Fields, London WC2A 3PX, U.K. (Received 14 December 1989; accepted 18 July 1990) PROMIS (protein machine induction system), a program for machine learning, was used to generalize rules that characterize the relationship between primary and secondary structure in globular proteins. These rules can be used to predict an unknown secondary structure from a known primary structure. The symbolic induction method used by PROMIS was specifically designed to produce rules that are meaningful in terms of chemical properties of the residues. The rules found were compared with existing knowledge of protein structure: some features of the rules were already recognized (e.g. amphipathic nature of a-helices). Other features are not understood, and are under investigation. The rules produced a prediction accuracy for three states (a-helix, t-strand and coil) of 60 ~/o for all proteins, 73 ~o for proteins of known a domain type, 62~/o for proteins of known fl domain type and 59~/o for proteins of known a/fl domain type. We conclude that machine learning is a useful tool in the examination of the large databases generated in molecular biology.

1. I n t r o d u c t i o n

molecular biologists form theories from data. Such tools are needed because it is often difficult for humans to perceive patterns in data, even though strong patterns exist; for example, a pattern may be obscured by a large amount of data. Machine learning techniques can be divided into two broad classes; non-symbolic learning methods, such as neural networks, and symbolic methods, to which PROMIS belongs. Non-symbolic methods are closely related to statistical methods (Angus, 1989). A neural net learning system consists of a network of non-linear processing units that have adjustable connection strengths (Rumelhart et al., 1986). Learning consists of altering the weights of connections between the units in response to a training signal that provides information about the correct classification in input terms (primary and secondary

The secondary structure prediction problem, despite its importance, is still unsolved (Cohen et al., 1986; Gibrat et al., 1987; Rooman & Wodak, 1988; Qian & Sejnowski, 1988). This paper describes PROMIS (Protein Machine Induction System), a machine learning program, designed to aid the process of acquiring new knowledge about the problem. PROMIS differs from other machine learning approaches to secondary structure prediction in that its emphasis is both on acquiring humanly comprehensible prediction rules and on maximizing prediction accuracy. PROMIS is a test of the idea of using machine learning as a tool to aid working t Author to whom all correspondence should be addressed at Imperial Cancer Research Fund. 0022-2836/90/220441-17 $03.00/0

441

(~) 1990 Academic Press Limited

442

R. D. King and M. J. E. Sternberg

structure). The goal is to find a good i n p u t - o u t p u t mapping, this input-output matching can then be used to predict the test set. Qian & Sejnowski (1988) and Holley & Karplus (1989) applied neural networks to the general problem of predicting secondary structure, achieving accuracies of 64"3~/o and 63~/o, respectively. T h e y both experimented with different designs of networks and reached a b o u t the same conclusions a b o u t the best design (Spaekman, 1989). McGregor et al. (1989, 1990) applied neural networks to the specific problem of predicting fl-turns and achieved a 3 ~/o improvement over existing methods. The prediction accuracies of the neural networks are as good as any other methods. However, using neural nets for learning has the large disadvantage t h a t the predictions made, have little explanation in humanly comprehensible terms (i.e. are not easily related to the chemical properties of the residues). Instead, the predictions are based on the values of a large number of adjustable numerical parameters. This means t h a t it is difficult to understand how a neural network works, and it is difficult to relate its prediction to other knowledge a b o u t proteins. In general, neural networks are good if all we want is a simple prediction, but are bad if an explanation of the prediction is required. In protein secondary structure prediction we would like to know why a protein formed a particular secondary structure, this makes the problem inherently unsuitable for a solely neural network approach. In contrast, symbolic learning methods are based on manipulating symbols and not just numbers and can produce results t h a t are humanly comprehensible. Symbol manipulation is the basis for most artificial intelligence research and symbolic learning techniques have been successful in a n u m b e r of previously intractable problems and are now widely applied in industry (Leech, 1986; B r a t k o et al., 1988). Apart from PROMIS, NTC (New Term Constructor) is the only other symbolic machine learning program t h a t has been applied to the problem of predicting protein secondary structure (Seshu et al., 1988). NTC starts with simple user-supplied terms (patterns of residue), from which it gradually forms new terms t h a t are useful in the domain, i.e. have high discriminatory power and co-operate well with the other terms in prediction. The main way NTC differs from P R O M I S is t h a t in NTC comparatively little attention is paid to the biological aspects of the problem and no a t t e m p t is made to acquire explicit knowledge a b o u t protein folding. NTC achieved a prediction accuracy of 60"6 ~/o. In this paper, we have applied a symbolic learning method (PROMIS) to the secondary structure prediction problem. This algorithm, using the database of proteins of known conformation, learns general rules t h a t characterize the relationship between p r i m a r y and secondary structure in the database. These rules can then be used to predict an unknown secondary structure from a known p r i m a r y structure. The algorithm is designed to produce rules t h a t are intelligible in terms of

chemistry theory. These rules can then be compared with existing knowledge to gain a b e t t e r understanding of protein structure. 2. Methods (a) Data The set of proteins used came from the Brookhaven Data Base (Bernstein et al., 1977). The proteins used were selected to remove homologous proteins. In addition, only one polypeptide chain was selected from any protein. Selection of the data was done to make the data as unbiased as possible. The secondary structure of the proteins was objectively designated using a modified algorithm from Kabsch & Sander (1983), which automatically assigns secondary structure from co-ordinates. All types of secondary structure that were not a-helix or fl-strand were considered as coils. The example proteins were split into a training and test set. This was done randomly to give a roughly 7 : 3 division. The proteins used in the training set were (Brookhaven codes): P155C, P1ALP, PLAZA, P1CAC, P1CC5, P1CCR, P1CRN, P1CTS, P1CTX, P1ECD, P1GPD, P1HIP, P1HMQ, PIINS, PILZM, PIMBS, P1NXB, PlrCHD, P1SN3, P1SRX, P2APP, P2B5C, P2C2C, P2CDV, P2CHA, P2GRS, P2LYZ, P2PAB, P2SNS, P2SOD, P2SSI, P2STV, P2TAA, P2TBV, P3ATC, P3CPV, P3FABH, P3PGK, P3PGM, P3TLN, P4ADH, P4LDH and P4RSA. There are 8024 residue positions; 2161 were in a-helices, 1466 were in fl-strands and 4397 were in coils. The proteins used in the test set were: P156B, P1ABP, PIACX, P1BP2, P1OVO, P1PYP, P1SBT, P1TIM, P2ADK, P2CNA, P2FD1, P2MDH, P351C, P3DFR, P3RXN, P4FXN, P5CPA and P8PAP. There were 3283 positions in the test set, 917 were in a-helices, 668 were in fl-strands and 1698 were in coils. The proteins were classified on the basis of domain type. If a protein contained more than one domain type (e.g. had an a domain and a fl domain), then the protein was split and each domain considered as a separate protein. The a domain type proteins in the data were: P155C, PI56B, P1BP2, PICC5, PICCR, P1CRN, PICTS, PIECD, PIHMQ, P1MBS, P2B5C, P2C2C, P2CDV, P351C and P3CPV. There are 2028 residue positions, 1065 were in a-helices, 55 were in fl-strands and 908 were in coils. The fl domain type proteins in the data were: PIACX, P1ALP, PLAZA. PICTX, PINXB, P1OVO, PISN3, P2AAP, P2CHA, P2CNA, P2PAB, P2SOD, P2SSI, P2STV, P2TBV and P3FABH. There were 2852 residue positions, 162 were in a-helices, 1059 were in fl-strands and 1631 were in coils. The a/fl domain type proteins in the data were: P1ABP, PIGPD, PIRHD, P1SBT, PISRX, PITIM, P2ADK, P2GRS, P3ATC, P3DFR, P3PGK, P3PGM, P4FXN and P5CPA. There were 4404 residue positions, 1352 were in a-helices, 669 were in fl-strands and 2382 were in coils. The a + fl domain type proteins in the data were: P1CAC, P1HIP, PIINS, P1LZM, PIPYP, P2FDI, P2LYZ, P2SNS, P3RXN, P3TLN and P4RSA. There were 2023 residue positions, 499 were in a-helices, 351 were in fl-strands and 1173 were in coils. (b) Machine learning (i) General description The learning problem is: given the proteins of known primary and secondary structure, find generalized relationships between the existing primary and secondary

Machine Learning and Secondary Structure Prediction structure that can be used to predict an unknown secondary structure from a known primary structure. The inputs of PROMIS are: data giving the corresponding primary and secondary structure of proteins whose conformation is known from crystallography, and information on the relative chemical and physical properties of the different residues. The output of the program is a series of if-then prediction rules: "if a specified primary structure occurs then predict a particular secondary structure". Learning is carried out by a search of the possible general relationships/rules in the database to find relationships/rules that describe the data well. The search starts with the simplest possible rule: if any residue then the residue is in a given type of secondary structure (e.g. every residue forms an a-helix). This rule is then systematically altered to find more powerful rules. This process is repeated at each stage of the search, with the most powerful rules being altered to find yet more powerful rules, which are in turn altered to find yet more powerful rules. The search ends when no more powerful rules can be found. The search is guided by an evaluation function that defines how well the rule describes the data. This evaluation function is a compromise between the accuracy of the rule and the amount of the examples covered. The final rule is in the following form: if a particular pattern of residues then the residues are all in a particular type of secondary structure. In the pattern description, each residue position is defined by a particular set of physical-chemical properties: e.g. if the 1st residue is positive in charge, the next negative and the 3rd position positive then the 3 residues are in an a-helix. (ii) Search In this work, machine learning is based on the idea of searching for powerful generalizations of examples (Mitchell, 1982; Dietterich & Michalski, 1983; Carbonell & Langley, 1987). Given (1) an initial rule (initial search state: a rule for predicting protein secondary structure from primary structure), ( 2 ) a method of evaluating the predictive power of a rule (evaluation function: based on the examples of proteins with known primary and secondary structure), (3) a method of altering rules into new candidate rules (state operators: generalization and specializations based on background knowledge about, chemistry, residue structure, etc.). Find by search through the data, a sequence of actions that will transform the initial rule into a rule with highest evaluation function that can be found (find the goal state: the most powerful inductive generalization). (iii) Rules The rules used in PROMIS take as input patterns of residues and produces as output predicted types of secondary structure. In the simplest form of this type of rule, where no classification of the residues is used, only 3 or less residues can be used to specify a pattern of residues. This is because larger patterns of amino acids do not occur frequently enough in the database to allow statistical analysis. Rooman & Wodak (1988) used this form of rule in their work on secondary structure prediction rules. In PROMIS, to overcome the limitation of only being able to specify 3 residues, the residues are grouped into classes and these classes are used to specify residue patterns. This scheme allows more complex patterns to be described as each class specifies several residues. Each class contains residues that share a particular property, or chemically sensible conjunction or disjunction of properties. The classification used in this work is based on that

443

of Taylor (1986). I t is a compromise between the physicalchemical properties of the individual amino acids, and the frequency of mutation between one amino acid and another (this is also a measure of relativeness between residues). For example, the class "charged" contains the residues {d,e,r,k,h}; ({} indicates a set), the class "hydrophilic" contains the residues {s,n,d,e,q,r}, and the class charged or hydrophilic contains the residues {s,n,d,e,q,r,k,h}. The logical operators: "or", "and", "not" are used in describing class properties. A "minus" notation is also used in the descriptions, as this is easier to understand than strict logical formalism: for example, the class "all minus k _ p " contains all possible residues except k and p. The classes used in PROMIS are given in Table 1. Using these classes, the rules used in PROMIS are in the form: if the sequence of classes [Cc Dc Ec] occurs, then they are all in the secondary structure type S. (i.e. A sequence of residues occurs [w x y] ([] indicates a sequence), where the 1st position residue w belongs to the class Cc, where the 2nd position residue x belongs to the class De, where the 3rd position residue y belongs to the class EC.) For example, the rule [positive negative positive] Secondary structure type A, positive =- {h,k,r}, negative -= {d,e), means that the primary sequence, [h d r] is predicted to have, [A A A]; as a corresponding secondary structure by use of the rule. This type of rule resembles that used by Cohen et al. (1983). It is an implicit assumption of this form of rule, consisting of a sequence of chemical properties, that what is important in forming secondary structure from primary structure is not the particular residues at each position, but some specific chemical property or combination of properties. (iv) Rule evaluation The goal of the search is to find powerful rules for converting primary structure into secondary structure. To do this, a definition of what is meant by a powerful rule is needed, an evaluation function. In addition, a method of assessing statistical significance is needed to ensure that the rules found are real regularities in the data and not just chance coincidences. There are 2 main factors in producing an evaluation function for a rule; accuracy and coverage. Coverage defines how general the rule is and can be defined (P+N)/ (P + N + M); where P is the number of correctly predicted positions, N is the number of incorrectly predicted positions and M is the number of positions not predicted. Accuracy defines how correct the rule is and can be defined as P/(P + N). The existence of a large amount of noise in the data (associated with the restrictions inherent in our data representation) means that the best rules should not be expected to necessarily be 100% accurate. I t is also expected that no single rule will have 100% coverage. The evaluation function used is ( P - N ) / (P+N+M), it is a combination of measuring accuracy and coverage. The justification for its use being that it increases with correct predictions, decreases with incorrect predictions and is normalized for a given example set. To find the evaluation of a rule, the sections of primary

Table 1

The residue classification used by PROMIS Class name All Kydrophobic Small Positive Charged Polar minus y Tiny Aliphatic Aromatic Negative Charged minus h Hydrophilic minus positive Hydrophilie Charged or hydrophilic Charged or hydrophilic or p (Polar minus aromatic) or charged or p Polar or p (Polar minus aromatic) or charged ((Polar minus aromatic) minus positive) or p (Polar minus aromatic) minus positive (Small and polar) or p Small and polar Small and hydrophilic Tiny or (small and polar) Tiny or (small and polar) or p Tiny or (negative and hydrophilic) or t Tiny or (negative and hydrophilic) or t or p Tiny or (polar minus aromatic) Tiny or (polar minus aromatic) or p Tiny or polar (Small minus p) or polar Small or polar (Small minus p) or (polar minus aromatic) Small or (polar minus aromatic) (Small minus p) or hydrophilic Small minus p (Small and hydrophobic) or tiny Small and hydrophobic (Small minus polar) minus p Small minus polar Small minus hydrophilic Aliphatic or (small minus hydrophilie) Aliphatic or (small minus polar) (Aliphatic or (small and hydrophobic) minus t) Aliphatic or (large minus polar) Very-hydrophobic Very-hydrophobie or p Very-hydrophobic or t Very-hydrophobic or t or p Very-hydrophobic or t or k Very-hydrophobie or (small minus p _ c) or k Very-hydrophobic or (small minus c) or k Hydrophobic or small Hydrophobic or (small minus p) Hydrophobic or p Aromatic or very-hydrophobic Aromatic or aliphatic or m Aromatic or m Large minus negative Large and polar Large minus aliphatic Large Hydrogen-bond-donors Hydrogen-bond-accepters All minus p (Not hydrophobic) and (not p) Not (hydrogen-bond-donorsor hydrogen-bond-accepters) Not charged All minus k p Proline Glyeine

Residues {a,e,q,d,n,l,g,k,s,v,r,t,p,i,m,f,y,e,w,h} {h,w,y,f,m,l,i,v,c,a,g,t,k) / {p,v,c,a,g,t,s,n,d~ {r,k,h} {d,e,r,k,h} {t,s,n,d,e,q,r,k,h,w} {a,g,s} {l,i,v} {h,w,y,f} {d,e} {d,e,r,k} {s,n,d.e,q} {s,n,d,e,q,r} {s,n,d,e,q,r,k,h} {s,n,d,e,q,r,k,h,p} {p,t,s,n,d,e,q,r,k,h I {p,t,s,n,d,e,q,r,k,h,w,y I {t,s,n,d,e,q,r,k,h} {p,t,s,n,d,e,q} {t,s,n,d,e,q~ {p,t,s,n,d) {t,s,,n,d) {s,n,d} {a,g,t,s,n,d t {p,a,g,t,s,n,d/ {a,g,t,s,n,d,e,q} {p,a,g,t,s,n,d,e,q} {a,g,t,s,n,d,e,q,r,k} {p,a,g,t,s,n,d,e,q,r,k} {a,g,t,s,n,d,e,q,r,k,h,w,y} {v,c,a,g,t,s,n,d,e,q,r,k,h,w,y} {p,v,c,a,g,t,s,n,d,e,q,r,k,h,w,y} {v.c,a,g,t,s,n,d,e,q,r,k} {p,v,c,a,g,t,s,n,d,e,q,r,k} {v,c,a,g,t,s,n,d,e,q,r} {v,c,a,g,t,s,n,d} {v,c,a,g,t,s / {v,e,a,g,t} {v,e,a,g} {v,c,a,g,p} {v,c,a,g,p,t} {I,i,v,c,a,g,p,t} {1,i,v,c,a,g,p/ {l,i,v,c,a,gi {f,m,l,i,v} {f,m,l,i,v,a,g} {p,f,m,l,i,v,a,g} {f,m,l,i,v,a,g,t} {p,f,m,l,i,v,a,g,t} {f,m,l,i,v,a,g,t,k} {f,m,l,i,v,a,g,t,k,s,n,d} {f,m,l,i,v,a,g,t,k,s,n,d,p} {h,w,y,f,m,t,i,v,c,a,g,t,k,s,n,d,p} {h,w,y,f,m,l,i,v,c,a,g,t,k,s,n,d} {h,w,y,f,m,l,i,v,c,a,g,t,k,p} {h,w,y,f,m,l,i,v,c,a,g / {h,w,y,f,m,l,i,v~ {h,w,y,f,m} {r,k,h,w,y,f I {q,e,r,k,h,w,y~ {q,e,r,k,h,w,y,f,m} {q,e,r,k,h,w,y,f,m,l,i / {w,y,h,t,k,c,s,n,q,r} {y,t,c,s,d,e,n,q} {a,e,q,d,n,l,g,k,s,v,r,t,i,m,f,y,c,w,h} {e,q,d,n,s,r} {a,l,g,v,p,i,m,f I {a,q,n,l,g,s,v,t,p,i,m,f,y,c,w I {a,e,q,d,n,l,g,s,v,r,t,i,m,f,y,c,w,h} {p} {g}

Residues gives the set of residues found in a particular class. Occ. is the number of times this class appears in the rules.

Oec. 15 5 1 2 0 0 0 4 1 0 1 0 0 1 3 I 3 1 0 0 0 0 0 0 2 1 2 2 7 4 6 10 5 2 4 0 1 0 0 1 0 0 I 1 5 1 0 3 6 0 2 3 2 ll 1 3 9 2 0 1 3 6 0 1 9 0 0 9 II 0 0

Machine Learning and Secondary Structure Prediction and secondary sequence are collected where the rule applies in the database. The sequences of actual secondary structure are then compared with the predicted secondary structure to count how many positions were correctly predicted and how many positions were incorrectly predicted. For example: if the primary sequence, [f g h h g hi, is found to have the following secondary structure in the database, [AAABBB], and is predicted to have the following secondary structure by a rule, [A A A A X X] (X = no prediction made), then the number of correctly predicted positions is 3, the number of incorrectly predicted positions is l, the number of positions not predicted is 2 and the evaluation of the rule is 0-333. I t is important that any relationship between primary and secondary structure that is found in the database should be statistically significant. This is because the rule is to be used to predict unknown secondary structure in the future, and thus must represent a real relationship, not just one that arises through the chance existence of particular primary and secondary structures within the database. As a heuristic for finding significant rules, a threshold test is used in PROMIS. This involves introducing a threshold number of positions that a rule must cover before it is considered to be significant. For example, if the threshold is set at 100, then the number of correctly predicted and incorrectly predicted positions must be > 100. A similar method is used in R U L E G E N from Meta-DENDRAL (Buchanan & Feigenbaum, 1981) and the work of Rooman & Wodak (1988). (v) Control of search A consequence of the ability to describe more complex patterns of primary sequence, is that it is no longer possible to search exhaustively all possible rules as there are too many. Therefore, a form of heuristic search is used. Heuristics are rules of thumb that aid the search and are normally but not always correct; e.g. try searching where you have already found good rules. The form of heuristic search used was hill climbing beam search. The term hill climbing search means that you always move to a state with a higher evaluation function and you never go back to a previous state; the search behaves like a person climbing a hill by only going up-hill. This has the advantage that you do not have to remember everywhere you have been, but the disadvantage is that it is possible to be caught at a local maximum; as a climber going only up-hill could be caught on a side peak and not reach the main summit as it would mean going down-hill. For this reason, a beam (collection) of different rules is used in parallel to reduce the danger of all the rules being caught at local maxima; as a party of climbers each taking different paths would have a better chance of getting a member to the summit than a lone climber would have on his own (Bisiani, 1987). Beam search is used in many symbolic machine learning programs (e.g. CN2 by Clark & Niblett, 1987). The representation of the residues in various classes is equivalent to a generalization hierarchy with the classes being partially ordered (Michalski & Stepp, 1983). This is because some classes are more general than others: e.g. the class charged is more general than the class positive. The most general class (the root node of the hierarchy) is the class of all residues ("all"). The hierarchy is not fully ordered because a class can often be generalized in several ways. The representation of the residues in a generaliza-

445

All

Polaror p X All minus

/

p

Small or

/(Small minusp) or polar J / C h a r g e d or hydrophilic / /

~.\

Charged or hydrophilic or r (Polarminus aromatic)

polar

Tiny or polar ~

Polarj # /

ICHA ~ ~ \

~

Positive

~

or charged or p (Polarand aromatic)

or charged

Chargedminush Negative

Figure 1. Generalizations and specializations of the class charged. - - represents generalization, -~ represents specialization.

tion hierarchy suggests the method of induction known as climbing a generalization tree (Michalski, 1983). This is based on generalizing by moving up the hierarchy. This can be more formally represented thus: if the rule exists [Bs] --* A; where Bs and Cs are sets, Cs is higher in the hierarchy than Bs and A is a secondary conformation type; then the following inductive inference can be made: [Cs] -* A.

A possible example of this is [positive]-* A becomes [charged] ~ A (Fig. 1). Specialization can easily be carried out with the use of the generalization tree by simply reversing generalization and moving down the tree. A method of increasing the length of the string is also nesdeed. This can be achieved by adding a new class to either end of the string of classes, a form of specialization specific for string induction. A possible example of this is: [positive] -~ A, becomes [positive negative] -~ A. To reduce search, the amount of search the operators used were restricted to (1) lengthening one end of a rule at a time by adding a new class to the rule's condition; (2) using the generalization tree operators to generalize and specialize on one class of the rule's condition at any one time. All the rules were found starting with the beam containing only the rule, [all] -* required secondary structure; where all matches every type of primary structure and the required secondary structure is either a-helix, fl-strand or coil. At each stage in the search, the search operators are used to generate new rules from the beam set; i.e. the beam set rules are generalized, specialized and extended to produce new rules; some of which are hoped to be better than the original rules. These new rules are then evaluated by examining how well they do in the training data. Any rule from the beam store that does not produce a better rule is stored, as it has reached a local maxim and cannot be improved. The new beam set is then selected. This process is repeated until the beam set is empty (no more paths to try) or a set number of iterations have passed without a rule entering the store with the highest found evaluation (no more progress is being made). The beam size was 10, made up as follows: first the highest-scoring rule according to the evaluation function; then (if possible) 3 rules selected to be different in at least

R. D. King and M. J. E. Sternberg

446

/

Input beam o f / rules

I J Form new rules from the old beam by use of "-! generalization, specialization and extension.

Evaluate new rules

t , , Select the best rules andI

form them into the new I beam set .........I

/

to e

ru,e /

in the old beam / that can not / be improved /

Yes

utput the rule / ith the highest / valuation in / e store / Figure 2. Flowchart of the PROMIS algorithm. 2 places from this rule; finally (if possible) the next highest-scoring 6 rules, making l0 in all. (c) Experiments A number of different experiments were carried with PROMIS to determine the usefulness of the algorithm and the best way to apply it to protein structure prediction. The most basic experiment was simply to apply PROMIS to the complete database. This resulted in finding rules for predicting a-helices, fl-strands and coil structure that are

i

/

general for the whole database. After PROMIS had been tried on the complete database. The database was then split up into separate databases consisting of proteins of only one type of domain as it might be easier to find rules if the example proteins were more homogeneous in structure. Four new databases were formed: for a domain proteins, fl domain protein, a/~ domain proteins and a + ~ domain proteins. PROMIS was applied to each of the individual databases. Experiments were then tried to find variations of successful rules that have different accuracies. This would highlight the important constant features

Machine Learning and Secondary Structure Prediction of a rule across the range of accuracies. For example, if a successful rule is found with accuracy 65%, then variations of the rule might be sought with accuracy > 8 0 % ; the resulting rule is likely to have fewer examples; but it will show what features are important in making the rule more accurate: conversely, if a lower accuracy is set then the most general features of the rule will tend to show through and possibly make the rule more comprehensible. Experiments were then tried using a form of threshold logic in the representation of rules. This explores whether a rule matches a primary structure even if all the class positions of the rule do not match the primary structure exactly. I t was hoped by changing the representation in this way that more powerful rules will be found, rules that cover more examples of a particular secondary structure while still retaining high accuracy and human comprehensibility. Mistake matching implies a model of secondary structure where every position in a primary sequence is not vital in forming the secondary structure, and where any one position can be substituted without certain loss of the corresponding secondary structure. Finally, the rules found by PYCOMIS were combined together to produce a complete secondary structure prediction method. This prediction method takes as input the primary structure of a protein and produces as output a prediction of the protein secondary structure. (d) Implementalion PROMIS was implemented in Quintus Prolog. The induction experiments were carried out on SUN 2 and 3 computers. As PROMIS uses generate and test control and Prolog is inefficient, the search was computationally expensive and took ~ l h for each generate and test cycle on a Sun 3/75. 3. R e s u l t s

(a) Experiment 1 P R O M I S was used to find general rules for predicting secondary s t r u c t u r e from p r i m a r y structure. Six such rules were found, three for a-helices, two for fl-sheets and one for coils. These rules were found to h a v e good a c c u r a c y and cover a large a m o u n t of the d a t a b a s e (Table 2). All the rules found can be interpreted to show a g r e e m e n t with accepted knowledge a b o u t the c h e m i s t r y and structure of proteins, t h e y m a y also contain clues a b o u t Table

447

previously unrecognized regularities in protein structure. T h e only knowledge a b o u t proteins t h a t was coded into P R O M I S was a b o u t residue classes. The higher level features t h a t can be recognized in the rules, such as the a m p h i p a t h i c i t y in a-helix rules and h y d r o p h o b i c cores in fl-strand rules were found empirically b y P R O M I S . (i) Structure of the rules The general rules found are illustrated in Figure 3. T h e three rules for predicting a-helices are (la), (2a) a n d (3a), these rules are displayed on helical wheel d i a g r a m s in Figures 4, 5 and 6. T h e three a-helix rules are a m p h i p a t h i c , with hydrophobic and hydrophitic residues a t positions n, n + 3 and n + 4 or a t positions n + 3 and n + 4 (Schiffer & E d m u n d s o n , 1967). A separation of three positions between a pair of h y d r o p h o b i c or hydrophilic residues classes (n and n + 3 , or n and n - 3 ) also occurs (Lim, 1974). Corresponding to this separation between h y d r o p h o b i c and hydrophilic residue classes, it is found in the rules t h a t residue classes t h a t h a v e no preference tend to be found sandwiched between h y d r o p h o b i c and hydrophilic faces. T h e residue proline (symbol p) is excluded from m o s t classes involved in a-helix prediction rules. This is to be expected, as proline disrupts a-helix formation. The residue lysine (symbol k) has a positive charge and is often found to be excluded from residue classes at the s t a r t of a rule (Chou & F a s m a n , 1974). This m a y be because the s t a r t of an a-helix is usually slightly negative. The two rules for predicting fl-strands are (lb) and (2b), these rules are displayed with a rotation of 180 ° per residue, this produces two faces to the fl-strand (Figs 7 a n d 8). B o t h fl-strand rules show a high concentration of h y d r o p h o b i c residue classes, this favours fl-strand formation. T h r e e h y d r o p h o b i c residues in a sequence are less frequently observed in an a-helix, as the three residues would be almost equally spaced out on the helical wheel, allowing no space for a hydrophilic face; h y d r o p h o b i c residues are unlikely to occur in coil secondary structure, as coils occur mainly on the outside of the molecule where hydrophobic residues are not energetically favoured. T h e

2

Individual evaluation of general protein rules found On training data Rule la 2a 3a lb 2b It

Evaluation 0"0339 0.0147 0"0242 0"0067 0.0014 0"1639

% Covered 17 12 12 16 10 78

On test data % Correct 79 64 78 57 52 62

Evaluation 0-0079 --0.0006 @0155 @0049 --0.0006 @1167

% Covered 13 8 7 17 3 8!

% Correct

% Correct based on frequency of secondary structure

56 49 83 54 48 58

28 28 28 20 20 52

Evaluation is the evaluation function described in Methods. % Covered is the amount of secondary structure of the type predicted covered. % Correct is the accuracy of the prediction for the positions covered. % Correct based on frequency of secondary structure gives the frequency of the predicted type of secondary structure in the test set.

R. D. King and M. J. E. Sternberg

448 (la)

(2a)

All minus k-p Small or polar Small or polar Charged minus h Aliphatic or (large minus polar) All minus p (Small minus p) or polar Hydrophobic Aromatic or very-hydrophobic Tiny or polar (Small minus p) or polar

All minus k-p All minus k-p All minus k-p All minus k-p (Small minus p) or (polar minus aromatic) All minus k-p All Hydrophobic or (small minus p) (Polar minus aromatic) or charged Aromatic or m Aromatic or aliphatic o r m All minus p All minus p Hydrophobic or p

,L (x-Helix

(x-Helix

(3a) All (Small minus p) or (polar minus aromatic) All (Aliphatic or (small and hydrophobic)) minus t Tiny or (negative and hydrophilic) or t (Small minus p) or (polar minus aromatic) Very-hydrophobic Large

(lb) Small or polar Aliphatic or (large minus polar) Aromatic or aliphatic or m Aliphatic or (large minus polar) Hydrophobic or(small minus p)

Positive All minus p All {3-Strand

(x-Helix (lt)

All (2b) Hydrophobic (Small and hydrophobic) or tiny Aliphatic Aliphatic Small or (polar minus aromatic) Very-hydrophobic or (small minus p c ) or k Hydrophobic or (small minus p)

Tiny or (small and polar) or p All Tiny or (polar minus aromatic) or p

Coil

1

Coil

Figure 3. Rules found using the complete database. The ordering of each rule goes from top to bottom, with the 1st position in sequence being at the top and the last in sequence at the bottom. The identity of each rule is in parentheses.

rule for coils is It. I t includes two hydrophilic residue classes, these classes favour the outside of a protein where coils occur. (ii) Position of rule occurrences The position of regions of secondary structure predicted b y the rules was examined. Table 3 illustrates the results with rule l a on the test set. Details with other rules on the test and training set can be obtained from the authors. From these diagrams it appears t h a t none of the rules appears

to be biased towards one particular end (N or C) of the type of secondary structure they predict; e.g. no rules are found t h a t are specific for the beginning of a-helices. The rules tend to occur in the middle of a helix, with errors mostly occurring at either end with equal likelihood. The most common error made is a b o u t where a predicted secondary structure region ends. However, it is interesting t h a t when an a-helix rule fails completely to predict an a-helix, then t h a t region is very commonly a fl-strand; and conversely when a fl-strand rule fails to predict a

Machine Learning and Secondary Structure Prediction

449 Intemel

/

(Small or :sminusp) P)'%

Csma, minusp)

AI~

~

or polar

tiny

/ .......

"~b

Ir

Hydrophobic

:~rpolar ~ ~ AI!

J / l ' ~ ,~

/

m,nos~ \

f,

I~

~

~

/

I

I

/_

I

\

\s/

khl'-Ipminua /

t

/N

Aliphagic or • ~ (large minus / ......... ' I polar) ~F ~mall or polar i Aromatic or v e r y - h y d r o p h ~ l ~

"

~Small minus p •

/

/

Smallminus p or ~ ( ~ i c ) )

t I

"~ = / ~ega~.o.,and 1

..

IWaroplllllCJor t /

Intern, External

Figure 4. Helical wheel plan of rule la. The hydrophobic residues are arranged in the classic sequence n (5), n + 3 (8), n + 4 (9): the same is true for the hydrophilic residues n (7), n + 3 (10), n + 4 (11) and n - 4 (3), n - - 3 (4), n (7).

fl-strand, t h e n t h a t region is an a-helix. T h e r e a p p e a r to be two classes of s e c o n d a r y s t r u c t u r e , a-helices a n d fl-strands in one a n d coil in the other. (b) Specific protein domain type rules (i) a Domain rules P R O M I S was used to find specific prediction rules for a d o m a i n proteins (Fig. 9). Five rules were found, tbur for a-helices a n d one for coils. T h e rules for a-helices were m u c h m o r e powerful, b o t h in

Internal

t

*~

Aliphatic or

, All minus p

Figure 7. Display of rule lb showing the position of the classes; it is displayed with a rotation of 180° per residue.

/ _j~ (Small minus p) or ~1 | (polar minus aromaticS" I (Polar minus aromatic) ¢,: J ~charged ~'" I 5:

Extorna

Figure 5. Helical wheel plan of rule 2a. The hydrophilic residues are in a linked pair n (5), n + 4 (9), a separation of 4 places brings the residues close together on the face of the helix: the hydrophobic residues are in a triplet, n - 4 (10), n - - 3 (ll), n (14); the residues n (8) n + 3 (11) are in a linked pair (a separation of 3 is not as good as that of 4, but is much better than l, 2, 5 or 6).

Aliphatic or

(large minus polar) (large minus polar)

" All minus k._p

/

Hydrophobic or (small minus p)

Aromatic or aliphatic or m

I

oHmr,cI

orm

All minus p All minus k_p

higher a c c u r a c y a n d higher coverage, t h a n t h e a-helix rules f o u n d in the general d a t a b a s e (Table 4). I t is n o t clear how m u c h this g r e a t e r success is s i m p l y due to a g r e a t e r p e r c e n t a g e of a-helix s t r u c t u r e in the a d o m a i n d a t a b a s e , a n d how m u c h is due to a g r e a t e r a m o u n t of h o m o g e n e i t y in t h e a-helices. R u l e a l t is n o t successful, h a v i n g

Small or polar

Aromatic

All minus k_p

Figure 6. Helical wheel plan of rule 3a. The hydrophobic residues are in a linked pair, n (4), n + 3 (7): the hydrophilic residues are in triplets, n(2), n + 3 ( 5 ) , n + 4 (6) and n - 4 (5) n - 3 (6) n (9). There is a possible charged pair interaction between residues at position 9 and 5 or 9 and 6. The positive charge is situated towards the end of the helix, which is to be expected because of the helix dipole.

Hydrophobic

1

Small or (polar minus

Hydrophobic or

aromatic)

(small minus p)

Aliphatic

3

(Small and hydrophobicl or tiny

5

Aliphatic

7

Very-hydrophobic or (small minus p_c) or k

Figure 8. Display of rule 2b showing the position of the classes; it is displayed with a rotation of 180 ° per residue.

R. D. King and M. d. E. Sternberg

450

Table 3 The occurrences of rule la in the test set Primary

Secondary

Name

Pos

N

Rule

C

N

plABP plTIM plTIM p2ADK p2MDH p4FXN plTIM p351C p2ADK p2ADK p3DFR plTIM plTIM plTIM p3DFR plSBT plTIM pITIM p2CNA pSPAP

19 179 183 146 299 I1 108 26 24 101 75 235 50 201 34 38 1 54 57 153

epwfq tatpq qqaqe dneet pindf wsgtg esdel kmvgp sgkgt prevk aqgav vggas iyldf sdava yfraq sgids

tewkfadkagk qaqevheklrg vheklrgwlkt ikkrletyyka srekmnetake ntekmaeliak igqkvahalae aykdvaakfag qcekivqkygy qgeeferkigq vvhdvaavfay lkpefvdiina arqkldakigv vqsriiyggsv tvgkimvvgrr shpdlkvagga aprkffvggnw Idakigvaaqn vdkrlsavvsy cgnkvdhavaa

dlgfe wlkth hvsda tepvi laeee giles glgvi qagae thlst ptlll akqhl kh aaqnc tggnc tyesf smvps kmngk cykvp pnada vgygp

- - -'k'k - - -'A'~ ~t'-jl~ - - ~t'~¢'~ - - -'k~ ÷. . . . . - -'k'~

farqk iiyns ifvgp

Rule

C

"k-~r'A'9~'k'k~ "k. . . . . "~~r'J~'~" "~"~¢~ "jl¢~~ ¢ - ~ d " ~ -~t" -~lt~ - ~1~ ~l~'k'k~'~i~'Jl¢~l~ "k- -'~'k ~ ~ ¢ ' ~ t " "k'~'~t"~ ~ I ~ ' A " A " A " A " "k~A" ~'~t"A"bt""~t~'~'~'k'~" . . . . . b+ "~"~"~ ~t'~ - -~'~ ..... "k -jqk-~'~,k,k~f~-k~¢- - + + + +~lk -~'~k-~tf~V-~k-k~. . . . . . +++ ..... q+ + - -'~'k'~"~"~t" ~ +. . . . . . . " A ' ~ t " k ' ~ "A". . . . - ~ ' ~ "~'A~ ..... +++ +++- -'~'~k"k'k "A~A" - + + + +. . . . . . . . "~t'~" . . . . . + + + + +'~t~r "~'k~'+++ + .... +++++"k'J~br'bt~t" .... ++ ++++++++. . . . ++++++ ++ ÷. . . . ÷++ + + +++- -

The Name is the short standard identification name of the protein, Pos is the sequence position of the first amino acid residue covered by the rule {6th from Deft), Primary is tile protein primary structure, Secondary is the corresponding secondary structure (taken from X-ray structure), -~ = a-helix, + = fl-strand, - = coil; Rule is the sequence covered by the rule, N is the 5 residue positions before the rule occurrence and C is the 5 residue positions after the rule. Rule la predicts the sequence covered by the rule to he all ~1¢ (a-helix type). lower t h a n a v e r a g e a c c u r a c y in t h e t e s t set. T h e a - h e l i x p r e d i c t i o n rules can be s p l i t i n t o t w o g r o u p s : rule a l a , which h a s high c o v e r a g e a n d r e l a t i v e l y low a c c u r a c y ( c o v e r i n g 57O/o o f all a-helical r e s i d u e s w i t h 67~/o a c c u r a c y in t h e t e s t set); a n d rules a2a, a 3 a a n d a4a, w h i c h h a v e high a c c u r a c y a n d r e l a t i v e l y low c o v e r a g e (rule a 4 a c o v e r s 13~/o o f t h e d a t a b a s e w i t h 100~/0 a c c u r a c y in t h e t e s t set). T h i s g r o u p i n g is also reflected in t h e a m p h i p a t h i c i t y o f t h e rules, w i t h rule l a l h a v i n g o n l y one p o s i t i o n defined t o be h y d r o p h o b i c , a n d t h e o t h e r rules h a v i n g more. (ii) fl Domain rules P R O M I S w a s used to find fl d o m a i n - s p e c i f i c rules for p r e d i c t i n g s e c o n d a r y s t r u c t u r e f r o m p r i m a r y s t r u c t u r e (Fig. 10). F o u r rules were f o u n d , t h r e e for f l - s t r a n d s a n d one for" coils. T h e r u l e s for f l - s t r a n d s were o f a b o u t t h e s a m e p o w e r as t h o s e f o u n d in t h e g e n e r a l d a t a b a s e ( T a b l e 5). H o w e v e r , rule b l b h a s a h i g h e r a c c u r a c y a n d c o v e r a g e in t h e t r a i n i n g s e t a n d h a s a h i g h e r c o v e r a g e in t h e t e s t s e t t h a n a n y o f t h e g e n e r a l rules. T h e r e a s o n for t h e r e l a t i v e l a c k o f success in f i n d i n g f l - s t r a n d prediction rules c o m p a r e d to rules for a-helices is n o t clear. P e r h a p s t h e fl d o m a i n d a t a b a s e is n o t as h o m o g e n e o u s a s t h e a d o m a i n d a t a b a s e . T h e rule for p r e d i c t i n g coils is q u i t e successful, i t h a s c o n s t a n t a c c u r a c y a n d c o v e r a g e o v e r t h e t r a i n i n g a n d t e s t sets. T h e t h r e e f l - s t r a n d p r e d i c t i o n rules h a v e high c o n c e n t r a t i o n s o f h y d r o p h o b i c r e s i d u e classes, t h i s f a v o u r s f l - s t r a n d f o r m a t i o n ; r u l e b 3 b a p p e a r s t o be o v e r l o n g a n d c o m p l i c a t e d , t h i s m a y be an a r t i f a c t o f t h e P R O M I S

a l g o r i t h m . R u l e b i t h a s t w o h y d r o p h i l i c p o s i t i o n s in s e q u e n c e , w h i c h f a v o u r s coil a n d d i s r u p t s f l - s t r a n d secondary structure. (iii) a/fl Domain rules P R O M I S w a s u s e d to find a/fl d o m a i n - s p e c i f i c rules for p r e d i c t i n g s e c o n d a r y s t r u c t u r e f r o m p r i m a r y s t r u c t u r e (Fig. l l ) . Six rules were t b u n d , f o u r for a-helices, one for f l - s t r a n d s a n d one for coils. T w o o f t h e a - h e l i x rules, a b l a a n d a b 4 a , a r e m o r e p o w e r f u l t h a n a n y o f t h e g e n e r a l rules a n d a n d h a v e b o t h high a c c u r a c y a n d c o v e r a g e ( T a b l e 6). I t is n o t y e t c l e a r h o w useful t h e o t h e r a - h e l i x rules are, a s t h e y d i d n o t o c c u r f r e q u e n t l y e n o u g h in t h e t e s t s e t to m a k e a c o m p l e t e j u d g m e n t . T h e f l - s t r a n d rule is more powerful than any general fl-strand prediction rule. T h e m o s t i n t e r e s t i n g rule for a/fl d o m a i n s is t h a t for f l - s t r a n d s (rule a b l b . I t h a s f o u r h y d r o p h o b i c classes in a row, w h i c h f o r m a f l - s t r a n d w i t h t w o h y d r o p h o b i c faces. T h i s a g r e e s w i t h t h e i d e a o f f l - s t r a n d s in a/fl d o m a i n b e i n g n o r m a l l y b u r i e d within the protein and having hydrophobic environm e n t s on b o t h faces (it w a s f o u n d in t h e o c c u r r e n c e s o f t h e rule t h a t i f 3 a l i p h a t i c r e s i d u e s o c c u r in a row, then that considerably improves the accuracy of the rule). T h i s r u l e s h o u l d be c o n t r a s t e d w i t h t h e f l - s t r a n d r u l e s f o u n d for fl d o m a i n p r o t e i n s . (iv) a + fl Domain rules P R O M I S w a s u s e d t o find a + f l d o m a i n - s p e c i f i c rules for p r e d i c t i n g s e c o n d a r y s t r u c t u r e f r o m p r i m a r y s t r u c t u r e , five r u l e s w e r e f o u n d . T h e s e r u l e s

Machine Learning and Secondary Structure Prediction

451

(a2a)

(ala)

All Not charged Large Charged or hydrophilic or p Aromatic or aliphatic or m Aromatic or very-hydrophobic Not charged Small or polar Hydrophobic

(Small minus p) or (polar minus aromatic) (Small minus p) or polar All minus k._p All minus k p (Small minus p) or polar All minus k p Aromatic or aliphatic or m All minus p (Small minus p) or (polar minus aromatic)

c=-Helix

~-Helix

(a3a)

(a4a)

Hydrogen-bond-accepters Aromatic or aliphatic or m Aliphatic or (large minus polar) Small or polar Tiny or (polar minus aromatic) Tiny or polar

(Small minus p) or hydrophilic Very-hydrophobic or t or p Very-hydrophobic or t All minusk p Large Aliphatic or (large minus polar) Charged or hydrophilic or p

~-Helix ~-Helix

(alt) All Very-hydrophobic or (small minus c) or k (Polar minus aromatic) or charged or p Tiny or (polar minus aromatic) or p Very-hydrophobic or small or k Small Hydrophobic or small Polar or p All Very-hydrophobic or small or k

Coil F i g u r e 9. Rules found using the a d o m a i n protein database. The ordering of each rule goes from top to b o t t o m , with the 1st position in sequence being at the t o p and the last in sequence at the b o t t o m . The i d e n t i t y of each rule is in parentheses.

Table 4

Individual evaluation of a domain protein rules found On training data

On test data

Rule

Evaluation

% Covered

% Correct

ala a2a a3a a4a alt

0"2661 0-0837 0-0645 0'0645 0"1321

73 18 I3 13 55

76 89 93 96 68

Evaluation 0"1490 0"0817 0'0241 0-0673 --0"0577

% Covered

% Correct

% Correct based on frequency of secondary structure

57 16 5 13 21

67 97 92 100 38

52 52 52 52 45

Evaluation is the evaluation function described in Methods. % Covered is the amount of secondary structure of the type predicted covered. % Correct is the accuracy of the prediction for the positions covered. % Correct based on frequency of secondary structure gives the frequency of the predicted type of secondary structure in the test set.

R. D. King and M. J. E. Sternberg

452 (blb)

(b2b)

Tiny or polar All

(Small minus p or hydrophilic) Hydrophobic or (small minus p)

All minus p Very-hydrophobic or t or p Small or (polar minusaromatic) Large Aliphatic Charged or hydrophilic (Small minus p) or polar

13-S!and

13-Strand

(b3b)

(blt)

Hydrophobic or (small minus p)

All Tiny or (small and polar) or p Tiny or (polar minus aromatic) or p

Aromatic or aliphatic or m Hydrophobic or (small minus p) Aromatic or aliphatic or m

Neutral

Small or polar Small or polar Very-hydrophobic or t or p Aromatic Very-hydrophobic or t (Srnall minus p) or hydrophilic (Small minus p) or hydrophilic

Coil

I~-Strand

Figure 10. Rules found using the fl domain protein database. The ordering of each rule goes from top to bottom, with the ist position in sequence being at the top and the last in sequence at the bottom. The identity of each rule is in parentheses.

were found to be unsuccessful in prediction and appeared over-complex in structure and over-fitted to the training set. Most of the reason for the poor performance of the rules seems to be the failure of P R O M I S to find strong unambiguous regularities in the database. This m a y be because the secondary structure of a + fl proteins is less homogeneous than in other types of domain. This lack of homogeneity m a y be due to a n u m b e r of reasons: the domains tend to be small and atypical of proteins in general, they often have cystine bonds and there is relatively little a-helix or fl-strand structure.

(c) Variations on a theme I t was recognized t h a t the rules produced by P R O M I S were complex in form and difficult to interpret. For this reason, experiments were carried out to t r y and improve the comprehensibility of the rules produced. Variations of the rules were searched for using an alternative rule evaluation method. I n this method, a threshold accuracy was set and variations of the original rule sought with accuracies above this threshold, rules above this threshold accuracy were then evaluated by coverage. A range of threshold accuracies was set

Table 5

Individual evaluation of fl domain protein rules found On training data Rule bib b2b b3b bit

Evaluation 0"0814 0"0275 0'0261 0"2420

% Covered 47 II 10 74

On test data % Correct 66 75 76 69

Evaluation -0.0014 0"0014 0"0085 0"2176

% Covered 24 6 5 78

% Correct

% Correct based on frequency of secondary structure

50 51 42 67

40 40 40 56

Evaluation is the evaluation function described in Methods. % Covered is the amount of secondary structure of the type predicted covered. % Correct is the accuracy of the prediction for the positions covered. % Correct based on frequency of secondary structure gives the frequency of the predicted type of secondary structure in the test set.

Machine Learning and Secondary Structure Prediction

453

(abla)

(ab2a)

Small or polar Hydrophobic or (small minus p) All Small or polar All Very-hydrophobic or (small minus p_c) or k Charged or hydrophitic or p Hydrophobic Aromatic or aliphatic or m Large minus aliphatic Tiny or polar All

All minus k_p Hydrophobic Tiny or (polar minus aromatic) or p Tiny or (polar minus aromatic) or p Aliphatic or (small minus polar) Small minus polar Positive All minus p Hydrophobic or (small minus p)

c=-Helix

s-Helix (ab4a) All minus p Not charged Hydrophobic or (small minus p) Tiny or (polar minus aromatic) or p Aromatic or very-hydrophobic Large Large and polar Polar or p Not charged All minus p Small or polar (Small minus p) or polar

(ab3a) Very-hydrophobic or t or p Tiny or (negative and hydrophilic) or t or p Large minus aliphatic Not charged Hydrophobic or (small minus p) Not charged Large Aromatic or m

a-Helix

o~-Helix

(abl b)

(ablt)

Very-hydrophobic or t or p Aliphatic Aromatic or aliphatic or m Very-hydrophobic or t

Hydrophobic or small Tiny or (polar minus aromatic) or p All Polar or p

13-Strand

Coil

F i g u r e 11. Rules found using the ct/fl domain protein database. The ordering of each rule goes from top to bottom, with the 1st position in sequence being at the top and the last in sequence at the bottom. The identity of each rule is in parentheses.

Table 6

Individual evaluation of a/fl domain protein rules found On training data

On test data

Rule

Evaluation

% Covered

% Correct

abla ab2a ab3a ab4a ablb ablt

0"0535 0-0224 0-0164 0"0115 0"0035 0.2235

35 16 ll 12 26 81

69 82 88 89 52 66

Evaluation 0-0427 --0"0039 0.0000 0'0078 0"0246 0"0647

% Covered

% Correct

% Correct based on frequeney of secondary structure

17 4 0 16 39 85

77 46 0 67 61 54

36 36 36 36 17 47

Evaluation is the evaluation function described in Methods. % Covered is the amount of secondary structure of the type predicted covered. % Correct is the accuracy of the prediction for the positions covered. % Correct based on frequency of secondary structure gives the frequency of the predicted type of secondary structure in the test set.

R. D. King and M. J. E. Sternberg

454

and variations of the original rule found at different accuracies. The aim of these experiments was to highlight the i m p o r t a n t constant features of a rule across a range of accuracies (varying the accuracy is like using a contrast control). In all the rules there was an inversely proportional relationship between correctness of prediction and coverage; the higher the accuracy, the lower the coverage. There was also, in general, a proportional relationship between accuracy and complexity of the rule. For a-helix prediction rules, it was found t h a t with the selection for higher accuracies the amphipathicity of the rules became more pronounced; with hydrophobic classes specializing to become more hydrophobic, and hydrophilic classes specializing to become more hydrophilic, undefined regions between hydrophobic or hydrophilic regions became defined to be either hydrophobic or hydrophilic. For selection for lower accuracies, the reverse effects were found. Changing the accuracy of rules for fl-strands produced no interesting results. For coil rules, rule la: [(all) (tiny or (small and polar) or p) (all) (tiny or polar minus aromatic) or p)], was transformed into the high accuracy rule: [((tiny or (negative and hydrophilic) or t or p) (all) proline)], which has 87 ~/o accuracy and 13~/o coverage in the test set. The inclusion of proline in interesting, as it is known to break up a-helices and fl-strands because it cannot form main-chain hydrogen bonds. Experiments were carried out using a form of threshold logic in the representation of rules. In this form of rule, only a threshold n u m b e r of positions in

the rule needs to match a primary structure before the rule "fires" and predicts the secondary structure of the sequence. A threshold of n - 1 positions to fire (1 less than the length of the rule) was used. I t was hoped t h a t the more complex matching scheme would allow the rules to be more easily understood. Such rules were found to be less successful than normal rules in describing and predicting a-helices and fl-strands, but more successful in describing and predicting coils. This different success is probably due to the structural form of the different secondary types. In a-helices and fl-strands, every position is i m p o r t a n t and the inclusion of an incompatible residue at a position m a y disrupt the whole protein structure, e.g. a hydrophilic residue within the internal face of an a-helix. In a coil, every position is not so vital and residues can be added without disrupting the whole structure of the protein, e.g. it is known t h a t mutations in proteins tend to occur in coils (Thornton et al., 1988). This difference in structural nature between coils and other types of secondary structure means t h a t threshold logic rules are badly suited for describing a-helices and fl-strands but well suited for describing coils (Fig. 12). The advantage t h a t threshold logic rules have for describing coils is most noticeable in a domain proteins, with the threshold logic rule having a coverage of 57 % and accuracy of 67 % in the test set compared to a coverage of 21 °/o and accuracy of 38O/o for the normal rule. I t was found t h a t threshold rules were easier to understand than normal rules as they contain fewer terms and more

(1tin)

(atlm)

Glycine (Small and polar) or p All Tiny or (polar minus aromatic) or p

All minus p All Tiny or (polar minus aromatic) or p Glycine (Small and polar) or p Polar or p Polar or p Small or polar

L

Coil

1

Coil

(bltm) Tiny or (small and polar) or p Tiny or (polar minus aromatic) or p Proline

Coil

(abltm) All Tiny or (polar minus aromatic) or p Glycine ((Polar minus aromatic) minus positive) or p

1

Coil

Figure 12. Threshold logic rules found for coil regions of secondary structure. The rule identity of each rule is in parentheses: rule ltm is from the general database, rule altm is from the a domain database, rule bltm is from the fl domain database and rule abltm is from the a/fl domain database. The ordering of each rule goes from top to bottom, with the 1st position in sequence being at the top and the last in sequence at the bottom.

Machine Learning and Secondary Structure Prediction specific classes. The reason for this is that the complexity of the rule is transferred to the more complex matching method. Full details o f the experiments using variations in accuracy and threshold logic is available from the authors. (d) Secondary structure prediction The rules found by PROMIS were combined together to produce a complete secondary structure prediction method. This prediction method takes as input the primary structure of a protein and produces as output a prediction of the protein secondary structure. This produced test accuracies Table 7 The prediction results using the different rule sets on individual proteins in the training set Identity

Rules 1

Rules 2

P155C PLAZA PICAC P 1CC5 Pl CCR P 1CRN PICTS PlCTX P1ECD P1GPD P1HIP Pl HMQ P 11NS P1 LZM P1MBS PINXB P1RHD PISN3 P1SRX P2APP P2B5C P2C2C P2CDV P2CHA P2G RS P2LYZ P2PAB P2SNS P2SOD P2SSI P2STV P2TAA P2TBV P3ATC P3CPV P3FABH P3PGK P3PGM P3TLN P4ADH P4LDH P4RSA

86 59 75 43 74 50 52 69 54 62 73 65 48 64 48 58 71 69 83 48 58 63 73 76 65 73 46 65 74 64 54 75 70 61 70 61 68 70 58 60 59 65

84

Rules 3

Rules 4

67 84 85 70 79 79 79 70 89

84 65 73 77 83 56 58 78 73 78 71 62 81 71 70 75 70 70 75 75 78

Rules 5 84 67 72 84 85 50 51 69 79 70 76 89 48 66 45 55 73 69 79 45 44 78 73 78 7l 67 62 63 81 65 59 74 75 57 70 75 75 78 53 61 59 58

Rule set 1 consists of: la, 2a, 3a, lb, 2b and ltm. Rule set 2 consists of: ala, a2a, a3a, a4a, la, 2a, 3a and altm. Rule set 3 consists of: blb, b2b, b3b, lb, 2b and b l t m . Rule set 4 consists of: abla, ab2a, ab3a, ab4a, ablb, la, 2a, 3a, lb, 2b and abltm. Rule set 5 consists of the combined rules.

455

of 60O/o for all proteins, 73~/o for a domain type proteins, 62 ~/o for fl domain type proteins and 59 ~o for a/fl domain type proteins. There are a number of problems in combining rules together: the rules are not all of equal quality, some rules are more accurate than others, it is possible for rules to overlap (agreeing or disagreeing). These are general problems and there is at present no universally accepted way of combining evidence from rules (Pearl, 1987). In PROMIS it was decided to use a numerically based method of combining the rules together. Each rule has an associated probability for the three secondary structure types. The probabilities were taken from the accuracy of the rule in the training data (no account was taken of the number of examples of a rule, i.e. the coverage and no attempt was made to optimize these parameters). Experiments were carried out with a number of combinations of the rules found (Tables 7 and 8). The combinations chosen for test on the test data were those that performed best on the training data. Great care was taken to ensure this, as there may exist rule sets that perform less well on the training set but better on the test set and to use these rules is to flatter the learning system unless a way is found to recognize such rules before use on the test data (e.g. cross-validation; Watkins, 1987). An iterative prediction method was developed for protein secondary structure prediction. This used knowledge about hierarchical constraints in secondary structure prediction. An initial prediction is made of the secondary structure of the protein under investigation. From this initial prediction, the domain type of the protein is predicted. Rules that are specific for particular domains are then applied to modify the original prediction. This iterative Table 8 The prediction results using the different rule sets on individual proteins in the test set Identity

Rules 1

Rules 2

PI56B P1ABP P1ACX P1ALP P 1BP2 PIOVO P1PYP PISBT P1TIM P2ADK P2CNA P2FDI P2MDH P351C P3DFR P3RXN P4FXN P5CPA PSPAP

32 64 56 54 49 55 60 53 55 57 58 94 49 55 57 85 54 71 66

83

Rules 3

Rules 4

66 55 61 65 66 55 55 68 68

71

See the legend to Table 7.

53 58 63

Rules 5 83 66 56 61 46 59 56 55 55 68 68 88 44 71 53 75 58 63 58

456

R. D. King and M. J. E. Sternberg

scheme is very simple, but it was formed to show the possibilities of using the rules developed by PROMIS in more sophisticated hierarchical prediction methods. The initial prediction of class type is based on the predictions of rule set 1. The rules for deciding domain type are: if Bn > An, then the protein is predicted of fl domain type; if An > = Bn and An > 0 and Bn < 4, then it is predicted to be an a domain protein (An is the number of residues predicted to have a-helix secondary structure, Bn is the number of residues predicted to have fl-strand secondary structure). After the initial prediction, a second prediction was made on the basis of the domain type classification. This was carried out using the highest accuracy rule sets (from the training set) for the different types of domain set: rule set 2 was applied to the proteins predicted to have a domains, rule set 3 was applied to the proteins predicted to have fl domains and the rule set 4 applied to the remaining proteins. This produced a final accuracy of 67°/o in the training set and an accuracy of 60~/o in the test. The accuracy of PROMIS is comparable with the best other general secondary structure prediction methods; e.g. see Gibrat et al. (1987) produced an accuracy of 63 °/o in the test set. Exact comparison with other methods is not possible because of differences in data sets and in testing protocols. There exists a need for a large-scale comparative test of modern prediction methods to discover the relative merits of the different techniques; previous comparisons such as that of Nishikawa (1983) are now out of date. 4. Discussion and Conclusions PROMIS has applied machine learning to the database of proteins of known primary and secondary structure to find a number of rules for predicting protein secondary structure. These rules are expressed in terms of physical and chemical properties of residues. Individually, these rules have high accuracy and coverage, and when they are combined together they produce an overall predictive accuracy of 60~/o. Although the rules are complex, they are more humanly comprehensible than the output of any other automatic method, such as that produced by statistical analysis or neural networks. The improved comprehensibility of the rules makes it possible to see that the rules are consistent with the known broad principles of protein structure. For example, it is possible to recognize in the rules such features as the sidedness of hydrophobicity. However, such recognizable features are often not explicit and occur jumbled together with other features. In addition, the rules contain features that are difficult to interpret and are not understood. The important question is, whether these unexplained complicated features are important in protein structure determination or whether they are all spurious. Some of these features are undoubtedly caused by over-fitting; this means that the rules model features of the

proteins used in the training data that are not generally true for proteins, making the rules more complicated than is necessary. Over-fitting leads to over optimistic estimates of the accuracies of concepts formed, as the rules correctly predict features in the training data that are not present in the test set (Watkins, 1987). This explains part of the decrease in accuracy and coverage found when moving from the training set to the test set. However, over-fitting may not be the sole cause of extra complexity of the rules; and some of the features in the rules may be important in structure formation. It is important to be able to distinguish such features. Therefore, there seems to be three causes for the complexity of the rules: the combination in a single rule of several different understood features of protein structure formation, making it difficult to distinguish the individual features; spurious features caused by over-fitting; and features important in structure formation but not understood. Our results suggest that to produce rules with both high accuracy and coverage, a certain complexity is needed. The factors involved in producing secondary structure may be so complex that simple patterns of residues can never predict secondary structure successfully. The evidence for this is: no simple rule has ever been found to have both high accuracy and coverage; the variation experiments (3(c)) suggest that higher accuracy rules are more complicated than lower accuracy rules; the rules can be modified to make them more complex and yet more successful {e.g. if sequential occurrences of glycine are excluded from rule la, then the number of errors made by the rule is reduced (Fig. 12), such a rule could not be represented in PROMIS and was, therefore, never found). Rooman & Wodak {1988) have analysed the known protein structures for patterns of up to three residues with given separations. They found only a few patterns that were highly specific for a given secondary structure. Accordingly, the estimate that around 1500 protein structures with their sequences are required for an accurate prediction. Our study raises the question of whether specifying only three residues at given 15ositions is sufficient to determine secondary structure, irrespective of the size of the database. The conclusions from our study are tentative and the work has highlighted improvements required in the approach. At present, the different features important in forming secondary are all combined together in individual rules; e.g. the rules tended to incorporate aspects of the preferences of residues at the termini into a general motif. This occurs because the rules only used one-step reasoning (if the conditions of the rule are met, then secondary structure is predicted). In future work, more complicated reasoning processes will be included, using a combination of rules and metarules. For example, specific rules will be sought for the beginning, middle and ends of secondary structures and these rules combined using meta-rules. This arrangement has the advantage that it separates out the important individual features from how

Machine Learning and Secondary Structure Prediction they combine together to produce secondary structure. It should also make the rules easier to understand and make it easier to distinguish over-fitting. This arrangement should also make it easier to extend the work in a natural way to prediction of super-secondary structure and beyond. Such an extension is starting to look possible, due to the emergence of new more powerful generations of learning algorithms and better databases. However, even at this early stage, it seems clear that symbolic machine learning can be applied to the problem of predicting protein secondary structure with some SUCCESS.

Apart from the predicting protein secondary structure, machine learning may have other applications in molecular biology. The amount of information being uncovered by molecular biology and available in databases is already so great that it is beyond human ability to comprehend it all; and with the development of faster sequencing and the proposed sequencing of the human genome, the amount of information is set to increase by several orders of magnitude, "the flow of information is threatening to become a confusing flood" (Maddox, 1988). Machine learning may be able to help control the flood by automatically generating knowledge from databases (Haiech et al,, 1986; King, 1987; Lathrop, 1987). Machine learning is ideal for this because it can handle and assimilate information found in databases much faster and more accurately than a human; it also has the advantage of being much more flexible than traditional statistical methods at uncovering patterns in data as it can exploit and reason with knowledge about the subject. Many databases exist with important knowledge (e.g. Brookhaven, PIR, EMBL, GENBANK), and the problem is now to extract the information from the mass of obscuring details. Machine learning holds the promise of being able to do this. We thank Donald Michie and the rest of the Turing Institute machine learning group for their helpful advice. R.D.K. was supported by grants from the S.E.R.C. and the Department of Computer Science at the University of Strathclyde.

References Angus, J. E. (1989). Int. J. Neural Networks, 1, 1, 42-47. Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer, D. F., Jr, Brice, M. D., Rogers, J. R., Kennard, O., Shimanouchi, T. & Tasumi, M. (1977). J. Mol. Biol. 112, 535-542. Bisiani, P,. (1987). In Encyclopedia of Artificial Intelligence, pp. 56-58, Wiley Interscience, New York. Bratko, I., Mozetic, & Lavrac, N. (1988) In Machine Intelligence 11, pp. 435-454, Clarendon Press, Oxford. Buchanan, B. G. & Feigenbaum, E. A. (1981). In Readings in Artificial Intelligence, pp. 313-322, Tioga Publishing Company, Palo Alto.

457

Carbonell, J, G. & Langley, P. (1987). In Encyclopedia of Artificial Intelligence, pp. 464-488, Wiley Interscience, New York. Chou, P. Y. & Fasman, G. D. (1974). Biochemistry, 13, 211-222. Clark, P. & Niblett, T. (1987). In Progress in Machine Learning, pp. 11-30, Sigma Press, Wimslow, England. Cohen, F. E., Abarbanel, R. M., Kuntz, I. D. & Flatterick, J. R. (1983). Biochemistry, 22, 4894-4905. Cohen, F. E., Abarbanel, R. M., Kuntz, I. D. & Flatterick, J. R. (1986). Biochemistry, 25, 266-275. Dietterich, T. G. & Michalski, R. S. (1983). In Machine Learning: An Artificial Intelligence Approach, pp. 41-81, Tioga Publishing Company, Palo Alto. Gibrat, J. E., Gamier, J. & Robson, B. (1987). J. Mol. Biol. 198, 425-443. Haiech, J., Quinqueton, J. & Sallantin, J. (1986). In EWSL 86 Proceedings of the European Working Session on Learning, Universit~ de Paris-Sud, Paris. Holley, L. H. & Karplus, M. {1989). Proc. Nat. Acad. Sci., U.S.A. 86, 152-156. Kabsch, W. & Sander, C. (1983). Biopolymers, 22, 2577-2637. King, R. D. (1987). In Progress in Machine Learning, pp. 230-250, Sigma Press, Wimslow, England. Lathrop, R. H., Webster, T. A. & Smith, T. F. (1987). Commun. ACM, 30, 909-921. Leech, W. J. (1986). Advan. Instrument. 41, 169-173. Lira, V. I. (1974). J. Mol. Biol. 80, 857-872. Maddox, J. (1988). Nature (London), 333, 11. McGregor, M. J., Flores, T. P. & Sternberg, M. J. E. (1989). Protein Eng. 2, 521-526 McGregor, M. J., Flores, T. P. & Sternberg, M. J. E. (1990), Protein Eng. 3,459-460. Michalski, R. S. (1983). In Maz,hinc Learning: An Artificial Intelligence Approach, pp. 83-134, Tioga Publishing Company, Palo Alto. Michalski, R. S. & Stepp, R. E. (1983). In Machine Learning: An Artificial Intelligence Approach, pp. 331-363, Tioga Publishing Company, Palo Alto. Mitchell, T. M. (1982). Artif. InteU. 18, 203-226. Nishikawa, K. (1983). Biochim. Biophys. Acta, 748, 285-299. Pearl, J. (1987). In Encyclopedia of Artificial Intelligence, pp. 48-65, Wiley Interscience, New York. Qian, N. & Sejnowski, T. J. (1988). J. Mol. Biol. 202, 865-884. Rooman, M. J. & Wodak, S. J. (1988). Nature (London), 335, 45-49. Rumelhart, D. E., Hinton, G.E. & Williams, R . J . (1986). In Parallel Distributed Processing, vol. l, pp. 318-362, MIT Press, Cambridge, MA. Sehiffer, M. & Edmundson, A. E. (1967). Biophys. J. 7, 121-135. Seshu, R., Rendell, L. & Tcheng, D. (1988). In Proceedings of the First International Workshop in Change of Representation and Inductive Bias, pp. 293-305, North American Philips Corp., U.S.A. Spackman, K. A. (1989). In Proceedings of the Sixth International Workshop on Machine Learning, pp. 160-163, Morgan Kaufmann, San Marco. Taylor, W. R. (1986). J. Theoret. Biol. 119, 205-221. Thornton, J. M., Sibanda, B. L., Edwards, M. S. & Barlow, D. J. (1988), BioEssays, 8, 2, 63-69. Watkins, C. J. C. H. (1987). In Progress in Machine Learning, pp. 79-87, Sigma Press, Wimslow, England.

Edited by S. Brenner

Machine learning approach for the prediction of protein secondary structure.

PROMIS (protein machine induction system), a program for machine learning, was used to generalize rules that characterize the relationship between pri...
2MB Sizes 0 Downloads 0 Views