204

PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS

[10]

The insertion-deletion geometries and the side-chain conformations provide a basis for approaching the most difficult aspect of homology modeling, that of correctly predicting the differences in more global aspects in the structure of the target and the template protein. The global fold is qualitatively preserved throughout a structure-function family. However, the individual members of the family may differ significantly in terms of the relative rotation and translation of their secondary structure elements (Fig. 9). Chothia and Lesk 84 have shown that these relative displacements increase as sequence homology decreases. For the homology range of 25-40% in which protein modeling is most frequently attempted, the global backbone rms deviation between the homologous structures tends to be between 1.2 and 3.0 ,A. These differences are induced by the insertion and/or deletion of residues in one sequence with respect to another and by the secondary structure packing differences resulting from residue mutations. Initial homology models consist of the target sequence mounted on a known template backbone. Once "correct" main-chain and side-chain conformations can be predicted in the presence of such backbone coordinate differences, additional algorithms need to be developed to optimize the secondary structure packing and, consequently, produce models with quantitatively correct global structure. It is these models which would be excellent starting points for structure-function studies, inhibitor-drug design, or larger-scale bioengineering problems. The methodologies presented in this chapter provide a procedure for the critical steps of generating "correct" initial main-chain and side-chain conformations. Full global optimization of the resulting homologous structure remains as a task for the future. u C. Chothia and A. M. Lesk, "Computer Graphics and MolecularModelling" Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, 1986.

[10] N e u r a l N e t w o r k s for P r o t e i n S t r u c t u r e P r e d i c t i o n

By L. HOWARD HOLLEY and MARTIN KARPLUS Introduction The prediction of the structure of a protein from its amino acid sequence remains one of the most challenging problems in biochemistry. Although X-ray crystallography has revealed the complexity of the threeMETHODS IN ENZYMOLOGY, VOL. 202

Copyright © 1991 by Academic Press, Inc. All rights of reproduction in any form reserved.

[10]

NEURAL NETWORK MODELING

205

dimensional structure of individual proteins, the comparison of results for many proteins has demonstrated that there are underlying regularities in the structures. In particular, only a small number of secondary structural elements (~ helices,/3 sheets, and turns) encompass 75% or more of the amino acids. As the database of X-ray structures has grown, regularities have also been observed in the ways in which these secondary structure elements are combined. This has led to a rough taxonomy of protein folds.1 These observations suggest that the problem of understanding and predicting protein structures is best approached not by direct calculation but rather by interpretation of the available data. Further, it appears likely that structure predictions can be divided into two simpler problems: (1) that of learning the rules which govern the formation of helices, sheets, and turns and (2) that of understanding the ways in which these elements can be packed together. Substantial effort has been devoted since the 1960s to the prediction of secondary structure.2 Over this time period there has been a small but steady growth in the accuracy of secondary structure prediction methods. An assessment 3 of three widely used pre-1983 methods revealed prediction accuracy for three states (helix, sheet, and coil) of 49-56%. The best modern methods have accuracies of 63-65%. Among these are statistical methods 4'5 and, more recently, methods based on pattern recognition with a neural network. 6-8 Neural networks have now been used for turn prediction, 9 prediction of surface exposure of amino acids,~° prediction of disulfide bonding state of cysteines,~l and prediction of backbone distance constraints.~2 In this chapter we describe the neural network method and review results obtained by applying the method to globular proteins. We also discuss efforts to improve prediction accuracy and the future prospects for the method.

1 j. S. Richardson, Adv. Protein Chem. 34, 167 (1981). 2 G. D. Fasman (ed), "Prediction of Protein Structure and the Principles of Protein Conformation." Plenum, New York, 1989. 3 W. Kabsch and C. Sander, FEBS. Lett. 155, 179 (1983). 4 j. Gibrat, J. Gamier, and B. Robson, J. Mol. Biol. 198, 425 (1987). 5 V. Biou, J. F. Gibrat, J. M. Levin, B. Robson, and J. Gamier, Protein Eng. 2, 185 (1988). 6 N. Qian and T. J. Sejnowski, J. Mol. Biol. 202, 865 (1988). 7 H. Bohr, J. Bohr, S. Brunak, R. Cotterill, B. Lautrup, L. Norskov, O. Olsen, and S. Petersen, FEBS Lett. 241, 223 (1988). 8 L. H. Holley and M. Karplus, Proc. Natl. Acad. Sci. U.S.A. 86, 152(1989). 9 M. McGregor, T. Flores, and M. Sternberg, Protein Eng. 2, 521 (1989). l0 S. Holbrook, S. Muskal, and S. Kim, Protein Eng. 3, 659 (1990). 11 S. Muskal, S. Holbrook, and S. Kim, Protein Eng. 3, 667 (1990). 12 H. Bohr, S. Brunak, R. Cotterill, H. Fredholm, B. Lautrup, and S. Petersen, FEBS Lett. 261, 43 (1990).

206

PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS

[10]

Neural Network Methodology The neural network model has its origins in efforts to model the computation which takes place in the nervous system. The main elements of this model are a large number of identical units (neurons) and a pattern of connections among these units. Each unit, operating independently and in parallel, receives inputs from other units to which it is connected. These inputs are modulated by the connection strengths between units. The unit integrates these inputs and generates an output according to a threshold. This output is then propagated to other units of the network. Early research 13'14established that learning can take place in such a model through the adaptive modification of connection strengths and thresholds (collectively called network weights). In many applications of the neural network model, including the present one, fidelity to actual biological processes is irrelevant. Rather, the neural networks can be regarded as a computational device capable of automatically learning from training examples to reproduce a mapping from a set of inputs to a set of outputs. Once a network has been trained (i.e., optimized) for a given training set, it can serve as a straightforward procedure, without known bias, for studying new data that become available. Whether the mappings learned by the neural network are useful for predictions on novel examples can only be determined empirically. This depends on the details of the neural network, the encoding of the inputs and outputs, and the underlying structure of the problem domain from which the training and test examples are drawn.

Topology and Computation An example of a neural network used in the present study is given in Fig. 1. The boxes represent the computational units, and the lines connecting these units are connections along which output signals flow from one unit to input at another unit. In general, flow in neural networks can be in any direction and between any pair of units. Neural networks for secondary structure prediction, however, are exclusively of the feed-forward variety in which the units are arranged in layers and signals flow in one direction from an input layer through zero or more intermediate layers to an output layer. The learning procedure for feed-forward neural networks permits connections to skip layers in the forward direction, but no connections are allowed within a layer or backward to earlier layers. Thus, these

13 D. Heeb, "The Organization of Behavior." Wiley, New York, 1949. 14 F. Rosenblatt, "Principles of Neurodynamics." Spartan, New York, 1962.

[10]

NEURAL NETWORK MODELING

207

NH2 --~-8 I~YS GLU THR ALA ALA ALA LYS PHE SER SER THR SER ALA I

-7 -6 -5 -4 -3

I

t

ALA-ASN TYR CYS ASN GLN MET LYS SER ARG ASN LEU THR LYS ASP ARG MET--

-2

Helix

-1 0

+1 +2

I

Sheet

+3

LYS PRO VAL ASN THR PHE VAL HIS GLU SER LEU ALA ALA ASN

+4 +5 +6 +7 ---~ +8

oo. Amino acid

Input layer

Hidden layer

Output layer

sequence

(17 X 21 units)

(2 units)

(2 units)

FIG. l. Neural network topology. Each of the 17 blocks shown in the input layer represents a group of network inputs used to encode the amino acid at the corresponding window position. Each group consists of 21 inputs, one for each possible amino acid at that position plus a null input used when the moving window overlaps the end of the amino acid sequence. Thus, for a given window in the amino acid sequence, 17 of the 357 network inputs are set to 1 and the remainder set to 0. A block in the hidden layer or in the output layer represents a single unit. Prediction is made for the central residue in the input window.

208

PROTEINS AND PEPTIDES; PRINCIPLES AND METHODS

Input

[10]

Output

w~

1 Xk = ~

Wik Yi + bk

Yk

i

1 + e-Xk

- - - - ' - ~ Yk

Unit k FIG. 2. Network computation. During forward propagation through each layer of the neural network, the computation illustrated above takes place at each hidden unit and at each output unit. The products of outputs from the preceding layer, Yi, with the connection strengths, Wik, are summed over all inputs to the unit. The resulting sum is adjusted by the bias for the unit, bk. The output of unit k is then generated according to the given formula and propagated to the next layer of the network. Unit outputs are in the range 0.0 to 1.0. Connection strengths may be either positive or negative.

networks are different from those used, for example, to solve certain optimization problems. 15'16 The computation which takes place in each unit is illustrated in Fig. 2. Each unit sums its inputs from earlier layers according to the pattern of connections. Each input is modulated by the connection strength between the two units. The connection strengths, Wik, and threshold biases, bk, in Fig. 2 are the adjustable parameters of the neural network model. (For convenience we bring the biases, bk, into the matrix Wik by adding an additional input unit to each layer of the network. This unit is connected to each unit in the next layer and has a constant input value of 1. The connection strength between this additional unit and unit k of the next layer is assigned the value bk .) The integrated input is used to generate an output according to an activation function as shown at right in Fig. 2. The particular form of this function may vary as long as certain requirements are satisfied. First, the learning rule (described below) used in the present work to train the network requires that the activation function be continuous and differentiable. Second, for neural networks to gain additional generality from intermediate layers (discussed below), this function must be nonlinear. 17 The particular sigmoidal function chosen supplies a 15 j. Hopfield, Proc. Natl. Acad. Sci. U.S.A. 79, 2554 (1982). ~6j. Hopfield and D. Tank, Science 233, 625 (1986). 17 D. Rumelhart, G. Hinton, and R. Williams, in "Parallel Distributed Processing: Volume 1" (D. Rumelhart and J.E. McClelland, eds.), p. 318. MIT Press, Cambridge, Massachusetts, 1986.

[10]

NEURAL NETWORK MODELING

209

switchlike behavior for the units as inputs swing from inhibitory (negative) to excitatory (positive). The input and output layers are dictated by the encoding chosen for the particular problem. The optimum number and size of intermediate layers and the pattern of connections must be determined empirically for the particular application. The simplest neural networks have no intermediate layers; inputs are directly connected to outputs. These two-layer networks have been extensively studied.18 For such networks the outputs are simply computed by multiplying the vector of inputs times the matrix of network weights followed by transformation with the activation function given in the equations of Fig. 2. However, these simple networks are capable of realizing only a limited set of mappings, in which similar inputs are mapped to similar outputs. Thus, for example, a two-layer network cannot solve the parity problem. For this problem the inputs consist of a pattern of 1's and O's. The desired network output is 1 if the input pattern contains an odd number of 1-bits and 0 otherwise. In this case a similar input differing by only one bit must map to a dissimilar output. Consider a given input which changes from 0 to 1. We require a contribution from this input to the sum of Fig. 2 so that the network output changes from 0 to 1 in some cases and from 1 to 0 in other cases depending on the other inputs to the network. In a two-layer network, however, only a single weight is available to express these opposite effects. The parity problem can be solved by a network with intermediate layers.17 These intermediate layers, sometimes called hidden layers, serve to recode the inputs into an internal representation from which the output can be derived. A parity problem with m binary inputs is solved by a network having a single hidden layer of rn units. Each input is connected to each hidden unit, and all the outputs of the hidden units are connected to the single network output unit which produces the parity. Each hidden unit "counts" the number of 1-bits as follows: Hidden unit j comes on (has an output near 1.0) w h e n j or more 1-bits are present in the input. So, for example, if there are 3 1-bits in the input, hidden units 1, 2, and 3 come on. The other hidden units are off (having outputs near 0.0). This is possible since each hidden unit has its own threshold (bias) and its own set of weights for the network inputs. The weights between the hidden layer and the output layer are of equal magnitude but alternating sign. Thus, the weighted sum of the hidden unit outputs will be nonzero and turn the network output unit on only when an odd number of hidden units are on, that is, when there are an odd number of 1-bits in the input.

18M. Minsky and S. Papert, "Perceptrons." MIT Press, Cambridge, Massachusetts, 1969.

210

PROTEINS AND PEPTIDES" PRINCIPLES AND METHODS

[10]

N e t w o r k Training After making initial choices of network topology and input and output encoding, a neural network is specialized to perform a particular mapping between inputs and outputs in a training phase. In the training phase a set of input/output pairs are presented to the network, and adjustments are made to the network weights, Wik, to reproduce the desired mappings. The procedure used to adjust weights for networks with hidden layers, the generalized delta rule, is a relatively recent development in neural network research. 17.19Network training may be expressed as a minimization problem. Suppose that at some stage of the network training the total output error for the set of training cases is as defined as in Eq. (1), where Oj,c is the observed output on unit j for training case c and Di, c is the desired output. E = E • (Oj, c - Dj,~): c

(1)

j

Training consists of minimizing the total output error with respect to network weights by gradient descent. The generalized delta rule provides a procedure for recursively computing the partial derivatives of the output error with respect to network weights, starting with the output layer and propagating backward to the input layer. (See Rumelhart et al.17 for additional details.) Thus, training proceeds as follows. First, the network is initialized with small random weights, in our case values in the range - 0.1 to 0.1. Then, the network weights are adjusted in several cycles over the training cases. In each training cycle each case is presented to the network, inputs are propagated forward through the network, and contributions to the gradient are propagated backward. At the end of the set of training cases small adjustments are made to the network weights according to the modified steepest descent cycle 19 given in Eq. (2). The adjustment made at step t is a fraction, e, of the accumulated gradient AWik(t ) =

-- ~ ~

dE

+ Ot A W i k ( t -- 1)

(2)

smoothed with a fraction, a, of the previous step; typical values are e = 0.001 and o~ = 0.9. Training is halted when the reduction in E becomes sufficiently small. In the results below we have considered the network converged when the fractional change in E from the last training cycle is less than 2 × 10 -4. In the case of a two-layer network, the training procedure leads toward a global minimum, but in the case of networks with hidden layers the 19 D. Rumelhart, G. Hinton, and R. Williams, N a t u r e (London) 323, 533 (1986).

[10]

NEURALNETWORKMODELING

211

possibility exists that the network will become trapped in a local minimum. 17Extensive experience with a variety of problems indicates that this is not an issue in most cases and that the global minimum is generally found by the given optimization procedure provided the network is sufficiently large. 19 We have described above the general properties of neural networks. Now we address the specialization of neural networks for the problem of protein secondary structure prediction. This involves specifying the input and output encoding, network topology, and the database for network training and testing. In what follows we describe the implementation used by us; others have employed similar schemes.

Input and Output Encoding For protein secondary structure prediction the input is derived from a sliding window in the amino acid sequence as shown in Fig. 1. Prediction is made for the central residue in the window. Each amino acid in the input window is encoded in binary form in a group of 21 network inputs: one for each of the 20 amino acid types and one padding position to use when the window overlaps the ends of the protein sequence. Thus, for a window size of 17 amino acids there are 17 groups of 21 inputs. In each group of 21 inputs the input corresponding to that amino acid is set to 1, and all other inputs are set to 0. Thus, for a window size of 17, exactly 17 of the inputs are set to 1 and the rest are set to 0. Alternative input encodings are possible, and one of these is discussed below. The output of the neural network is a secondary structure assignment for the central residue in the input window. To facilitate a comparison with other secondary structure prediction methods, 3 we initially consider three secondary structure states: helix, sheet, and coil. We encode these states as shown in Fig. 1. The network has two outputs, a helix output and a sheet output, and the desired result for a helix or sheet residue is that the corresponding output is 1.0 and the other output is 0.0. For coil the desired outputs are both 0.0. Actual network outputs are numbers in the range 0.0 to 1.0, and these are converted to predictions with the aid of a single cutoff value. If both outputs are below the cutoff, coil is assigned. Otherwise, the state corresponding to the highest output is assigned. The cutoff is optimized for the training examples, and in the present application a value of 0.37 was found to be best. We have used the DSSP program 2° to assign secondary structure to proteins of known structure. This program assigns a residue to one of eight 2o W. Kabsch and C. Sander,

Biopolymers 22, 2577 (1983).

212

PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS

[10]

classes based on the observed pattern of hydrogen bonds. These classes are a helix (H), 310 helix (G), ¢r helix (I),/3 strand (E), isolated/3 bridge (B),/3 turn (T), bend (S), and coil (C). We collapse all states other than H and E into coil. In the DSSP assignments the minimum helix is four contiguous residues, and the minimum sheet strand is two residues. Thus, as a final refinement in interpreting the output data, we convert stretches shorter than these minima to coil. Similar input and output encodings have been used by others 6 and are compared in detail below.

Results

Protein Database In the initial application we have used the 62-protein database assembled for the evaluation of pre-1983 prediction methods 3,2° to facilitate a comparison with these methods. We have divided this database into a training and prediction set. The training set consists of 48 proteins (8315 residues) with a composition of 26% helix, 20% sheet, and 54% coil. (As noted above, in the three-state classification coil includes not only portions of the structure normally throught of as coil but also well-defined turn regions.) The prediction set contains the remaining 14 proteins (2441 residues) and has a composition of 27% helix, 20% sheet, and 53% coil.

Evaluation of Prediction Accuracy In comparing secondary structure prediction methods there are two prominent problems. The first is a lack of agreement on standard definitions of secondary structure. This problem has largely been addressed by the widespread usage of the DSSP program. A second problem has been the use of a single figure of merit, the percentage of correct predictions. This figure of merit can exaggerate the accuracy of methods.21 For example, suppose that 50% of residues are assigned coil, 25% helix, and 25% sheet. Random guesses distributed in this fashion would be expected to be correct 38% of the time (0..502 + 0.252 + 0.252). An algorithm which always guesses "coil" will be correct 50% of the time, substantially higher than random! To avoid these problems we follow the suggestion of Matthews 22 and report the correlation coefficients for each state. In addition we report the percent correct of those predicted in each state. This quality 21 G. Schulz and R. Schirmer, "Principles of Protein Structure," p. 123. Springer-Verlag, New York, 1978. 22 B. Matthews, Biochim. Biophys. Acta 405, 442 (1975).

[10]

NEURAL NETWORK MODELING

213

TABLE I NEURAL NETWORK TRAINING AND PREDICTION ACCURACY FOR THREE STATES

Quality measure Percent correct (three-state total) Correlation coefficients C~ C~

Ccoil Percent correct of predicted PC(a) PC(fl) PC (coil)

Training set (48 proteins)

Prediction set (14 proteins)

68.5

63.2

0.49 0.47 0.43 65.3 63.4 71.1

0.41 0.32 0.36 59.2 53.3 66.9

index is a direct measure of the probability that a given helix prediction is actually a helix.

Training and Prediction Results We have obtained the best results with the network shown in Fig. 1. This network uses an input window of 17 residues, a hidden layer of 2 residues, and output layer of 2 residues. The effect of varying these parameters is discussed below. The network was trained on the 48-protein database for 100 cycles and then used to make predictions for the 14-protein database. Results are given in Table I. Prediction accuracy for/3 strands is less than that for helix or coil. This is typical for secondary structure methods and may be due to the fact that/3 sheet formation is really a tertiary structure interaction which brings two or more strands together and thus is only partially manifested as a local sequence tendency. We also observed during training that the network learns to recognize/3 strand residues far more slowly than helix residues. This may indicate that the sequence patterns which give rise to/3 strands are more varied and thus less well represented in the training set than those for helices.

Effect of Input Window Size The effect of varying the window size is shown in Table II. Whereas the percentage of correct predictions rises slowly with window size and appears to reach a maximum at 17, an interesting change is seen in the correlation coefficients. The largest increase in the sheet correlation coefficient occurs between a window size of 3 and 5. Helix correlation coeffi-

214

PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS

[10]

T A B L E II EFFECT OF INPUT WINDOW SIZE ON PREDICTION ACCURACY "

Correlation coefficients Window size

Percent correct

C~

C~

C~o~

3 5 7 9 11 13 15 17 19 21

60.0 60.6 59.6 62.3 61.6 62.7 62.9 63.2 62.6 62.9

0.34 0.32 0.31 0.37 0.38 0.38 0.41 0.41 0.39 0.39

0.21 0.29 0.27 0.33 0.31 0.33 0.32 0.32 0.33 0.31

0.29 0.33 0.32 0.35 0.33 0.37 0.35 0.36 0.35 0.35

cients increase most between a window size of 7 and 9. In each case this may be due to inclusion of residues one turn away in the secondary structure, so that the side chains of the added residues are on the same side of the structure as the residue to be predicted. Sheet correlation coefficients reach a maximum with a window of 9 residues (4 residues on either side), whereas helix correlations coefficients continue to rise slowly. This may reflect the longer average length of helices compared to /3 strands, z°

Effect of Hidden Layer Size Prediction accuracy as a function of hidden layer size is shown in Table III. The most striking observation from these data is that a neural network with no hidden units performs close to the optimum. A similar observation has been made by others, 6 where it has been interpreted to mean that little "second-order" information is available to be learned by the neural network. In other words, the secondary structure effects in the local sequence are additive contributions of individual amino acids at various window positions acting on the backbone of the protein. Side chain-side chain interactions are second-order effects which would require a hidden layer to detect. Another possible explanation is that the training set has too few instances of each of the putative pairwise patterns for the network to learn these second-order contributions. There are 1200 possible interactions (20 x 20 x 3 states) for the network to learn from the approximately

[10]

NEURAL NETWORK MODELING

215

TABLE III EFFECT OF HIDDEN LAYER SIZE ON TRAINING AND PREDICTION ACcuRAcY

Percent correct Hidden layer size

Training set

Predictionset

0 2 5 10 20

68.4 68.5 81.5 89.9 90.9

62.3 63.2 60.9 59.5 59.3

8000 training cases. Thus, on average only about 4 instances of each are available in the training set. As shown in Table III, training accuracy grows with increased hidden layer size. Indeed, unless the training set contains identical contradictory examples, a neural network of sufficient size could reproduce the training set to any desired accuracy. On the other hand, prediction accuracy falls. A network with 20 hidden units has over 7000 weights, a number close to the 8315 residues in the training set. With this many free variables the network can "memorize" the details of the training set. However, this close matching of the details of the training set is accompanied by a reduced capacity to generalize. Comparison with Other M e t h o d s

Prediction accuracy for the 14 proteins is compared to three widely used methods in Table IV. The method of Robson and co-workers 23 has recently been revised. 4 The early method used statistics based on single amino acid frequencies over a window extending eight residues on either side of the residue to be predicted. In the revised method frequencies of amino acid pairs are considered; one member of the pair is the residue to be predicted, and the other member ranges over the input window. A database of 68 protein chains screened to include only structures with resolution better than 2.8/~, crystallographic R factor less than or equal to 0.25, and sequence homology less than 50% was used by these authors to evaluate the revised method. 4 Each protein to be predicted was first removed from the database. Statistics were then derived from the remaining proteins and used to predict the removed protein. Prediction accuracy for three states in the revised method is 63%. To compare with 23j. Gamier, D. Osguthorpe, and B. Robson, J. Mol. Biol. 120, 97 (1978).

216

[10]

PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS TABLE IV NEURAL NETWORK PREDICTIVE ACCURACYCOMPAREDTO OTHER METHODS Percent correct predictions b

Brookhaven a identification

Residues

Chou c

Robson d

Lim e

Neural net

1GPD 4ADH 2GRS 2SOD 1LH1 f ICRN IOVO 2SSI 1CTX 1MLT 1NXB 2ADK 1RHD 2PAB Totals g

333 374 461 151 153 46 56 107 71 26 62 194 293 114 2441

47 39 45 56 52 37 48 51 68 42 50 52 55 46 48

55 44 49 72 69 33 54 63 65 42 61 73 54 42 55

58 52 48 64 50 44 77 54 69 46 60 65 54 40 54

66 53 65 69 71 54 65 65 72 39 71 69 65 44 63

All coordinate sets are from the Brookhaven Protein Data Bank, F. C. Bernstein, T. F. Koetzle, G. J. B. Williams, E. F. Meyer, M. D. Brice, J. R. Rodgers, O. Kennard, T. Shimanouchi, and M. Tasumi, J. Mol. Biol. 112, 535 (1977). b Comparative data are taken from Table l(b) of W. Kabsch and C. Sander, FEBS Lett. 155, 179 (1983). c p. Chou and G. Fasman, Biochemistry 13, 222 (1974). d j. Gamier, D. Osguthorpe, and B. Robson, J. Mol. Biol. 120, 97 (1978). e V. Lira, J. Mol. Biol. 88, 873 (1974). f Coordinate set 1LHI is substituted for the obsolete 1HBL in the earlier work. g Totals are computed by weighting each protein by the number of residues.

this method we have followed a similar procedure. We have removed one chain at a time from the set of 68, trained a neural network on the remaining 67, and made predictions for the removed protein. To limit the amount of computation, we examined the first 20 protein chains from the database (the first 20 proteins in Table 9 of Ref. 4). For these 20 proteins the neural network method three-state prediction accuracy is 63.2 versus 64.4% for the revised Robson method (detailed data not shown). If statistics are derived for the revised Robson method without removing each protein, then accuracy for the 68 protein chains rises to 69.7%. We note the similar value reported in Table I for our training accuracy. Qian and Sejnowski 6 (QS) used a neural network approach that is very similar to the one employed independently by us. 8 There are a number

[10]

NEURAL NETWORK MODELING

217

of differences, however. First, although input encoding is identical, an optimum input window of 13 residues was found versus 17 residues in our results. This difference is not so great as might be supposed since, as shown in Table II, prediction accuracy improves only slightly as the window increases from 13 to 17. Also, in the QS networks coil is explicitly encoded with a third output. Secondary structure is assigned to the highest output in contrast to our use of two outputs and a cutoff value below which coil is predicted. Best results were obtained by QS with a hidden layer of 40 units versus our best results with only two hidden units. A network of this size is in the region where we have identified that "memorization" occurs and prediction accuracy falls (Table III). This apparent discrepancy seems to be due to a difference in convergence criteria. Whereas we monitor the total output error on the training set and stop training when this has become asymptotic according to the criterion stated above, the QS method monitors prediction performance on the test set and stops training when prediction performance is at a maximum. Although it is perfectly appropriate to use a sample test set in this fashion we feel that the requirements of a blind test demand that final prediction results be made on a set of proteins excluded completely from the training procedure. Finally, in the QS approach outputs are refined by cascading the outputs of the network into a second network. This second network was trained with a window of 13 residues with three network inputs for each window position. The inputs from this network are the outputs from the first network. Outputs of this second network are the same as the first network: the helix, sheet, and coil outputs for the central residue in the window.The idea is that this second network can "clean up" the first network outputs by eliminating unlikely or impossible stretches of secondary structure, for example, one sheet residue in the middle of a helix. The "cleanup" function performed by this second network is duplicated in part by our external rule requiring helix and sheet segments to be at least the minimum lengths dictated by the Kabsch and Sander definitions. The rule improves our accuracy by about 1.5%, similar to the 1.6% gain from the second network. Despite these differences, the final results of the QS approach and ours are remarkably similar, as shown in Table V. Bohr e t al. 7 also applied neural networks to secondary structure prediction. These authors have used a neural network with input encoding (20 binary inputs per residue) that is nearly identical to ours to predict twostate secondary structure assignments. The networks have an input window of 51 residues, a hidden layer of 40 units, and an output layer of 2 units (helix and nonhelix). They report a prediction accuracy of 73% with a helix correlation coefficient of 0.38. They also report that inferior results were obtained with a window size of fewer than 51 residues. It is difficult

218

PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS

[10]

TABLE V COMPARISON OF NEURAL NETWORK PREDICTIONS

Quality index Percent correct (three state total) Correlation coefficients Ca C~ Ccoil

Qian and Sejnowskia (one network)

Holleyand Karplusb (one network)

62.7 0.35 0.29 0.38

Qian and SejnowskJ (two cascaded networks)

63.2 0.41 0.32 0.36

64.3 0.41 0.31 0.41

a Data from N. Qian and T. J. S e j n o w s k i , J. Mol. Biol. 202, 865 (1988).

b Data from L. H. Holley and M. Karplus, Proc. Natl. Acad. Sci. U.S.A. 86, 1521 (1989).

to reconcile this observation with our results (Table II) and those of Qian and Sejnowski6 on the effect of window size. To make a direct comparison, we have made helix predictions using a network with an input window of 17 residues like that in Fig. 1, except that there is a single output which is set to 1.0 for helix and 0.0 otherwise. The network was trained on the 48protein database given above and tested on the set of 14 proteins. The optimum cutoff value for the training set is 0.50 in this case. We obtain a prediction accuracy of 78% with a helix correlation coefficient of 0.39. Attempts to Improve Prediction Accuracy A number of attempts have been made to improve prediction accuracy for the neural network method. 6'8'24 We summarize some of these below.

Alternative Input Encoding In the results described above, we have encoded the amino acid sequence directly according to amino acid type in the belief that other features useful for secondary structure prediction are implicitly available in the amino acid sequence and could, in principle, be discovered by neural networks with hidden layers. Nevertheless, we have also tried describing amino acid sequence by physicochemical properties. Each amino acid is encoded in a set of 9 binary inputs divided into three groups of three inputs each. The first group classifies amino acids according to one of three charge states. The second group classifies amino acids as hydrophobic, 24 D. G. Kneller, F. E. Cohen, and R. Langridge, J. Mol. Biol. 214, 171 (1990).

[10]

NEURAL NETWORK MODELING

219

hydrophilic, or moderately polar according to the criteria of Rose et al. 25 The final classification is based on backbone flexibility and treats Gly and Pro as two separate classes and groups all other amino acids into the remaining class. Thus, for any amino acid three of the nine inputs are set to 1 and the remaining inputs are set to 0. An input window size of 15 is chosen for this study. The hidden layer and output layer are as shown in Fig. 1. Training and predictions are on the proteins reported above. This network achieved a predictive accuracy of 61.1% with correlation coefficients of C~ = 0.37, C~ = 0.27, and Ccoil = 0.37. This result is somewhat remarkable considering the omitted detail. For example, the network cannot distinguish Ala, Val, Ile, Leu, Met, Cys, Phe, and Trp since these are all uncharged hydrophobic residues with default backbone flexibility. Kneller et al. 24 have also used hydrophobicity in a recent attempt to improve prediction accuracy. Since it is well known that helices and sheets are often characterized by distinct patterns of hydrophobic and hydrophilic alternation, Kneller et al. have supplied this information to the neural network in the form of two additional inputs which are set to the helix and sheet hydrophobic moments at the central residue as defined by Eisenberg et al. 26 The helix hydrophobic moment is computed for an input window of 13 residues assuming the helical repeat of 3.6 residues per turn, and the sheet moment is computed assuming a repeat of 2.0. Since these inputs are derived solely from the input sequence without reference to the actual secondary structure, they may be supplied as inputs both for training and for test. In these studies the network is otherwise identical to that of Qian and Sejnowski. 6 The addition of these two inputs results in a prediction gain of only 1%. Similar attempts to improve results by recoding inputs with physicochemical properties are reported by Qian and Sejnowski. 6 Despite the failure of these efforts to provide a significant improvement in prediction accuracy, it is possible that some other input encoding may be found which will be more effective than simple amino acid sequence. Alternative Output Encoding

In the results above we have measured three-state prediction accuracy in order to facilitate a comparison with other methods. Four-state predictions, involving helix (H), sheet (E), turn (T), and coil, are shown in

25 G. Rose, A. Geselowitz, G. Lesser, R. Lee, and M. Zehfus, Science 229, 834 (1985). 26 D. Eisenberg, R. M. Weiss, T. C. Terwillinger, and W. Wilcox, Faraday Symp. Chem. Soc. 17, 109 (1982).

220

PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS

[10]

TABLE VI NEURAL NETWORK TRAINING AND PREDICTION FOR FOUR STATES

Quality measure Percent correct (four-state total) Correlation coefficients Ca C~ Cturn Ccou Percent correct of predicted PC(a) PC(/3) PC(turn) PC(coil)

Training set (48 proteins)

Prediction set (14 proteins)

61.4

50.3

0.52 0.52 0.25 0.38 64.9 63.5 46.9 60.4

0.35 0.31 0.13 0.26 52.1 48.2 28.0 53.2

Table VI. In these measurements the 48-protein training set and 14-protein prediction set were used as above. In the K a b s c h and Sander classifications 12% of the residues in the training set are classified as turn. A network similar to that shown in Fig. 1 is used except that there are three hidden units and three output units, helix, sheet, and turn. As before, coil is predicted when all three outputs are below the cutoff value, 0.37. Fourstate prediction accuracy is expected to be lower than three-state accuracy, but even when turn and coil data are collapsed together the resulting three-state prediction accuracy is only 60.7%. Thus, including turns has not improved the power of the network to discriminate helix or sheet from coil. Indeed, the turn correlation coefficients are quite poor, and the raw statistics indicate that turns are strongly confused with coil and somewhat confused with helix. This result may be due in part to the Kabsch and Sander turn classifications. 2° Generally, turn (T) is assigned to residues bracketed by i, i + n hydrogen bonds (n = 3, 4, or 5) which do not repeat for at least two successive residues. Otherwise these residues will be counted as ot helix, 310 helix, or ,r helix. H e n c e , some turns can be expected to be similar to helices. Also when more than one classification is possible for a given residue, turns are assigned after all other types except the " b e n d " type (S). In our 62-protein database we have stretches o f turn residues as short as one and as long as five residues. It may be that a more restricted definition of turns would be more readily learned by the neural network. In addition, the relatively p o o r performance o f the network on turns may be due in part to lumping all turns together despite rather different

[10]

NEURAL NETWORK MODELING

221

amino acid preferences for the various types. 27 This has been taken into account by McGregor et al. in a neural network for turn recognition which predicts turns according to type. 9 These authors use the turn database assembled by Wilmont and Thornton 27 using a somewhat different definition of turns than that of Kabsch and Sander. A neural network is used with an input window of 4 residues (80 binary inputs), a hidden layer of 8 units, and 4 outputs: type I turn, type II turn, other turn, and nonturn. After training, this network is able to assign 71% of the residues correctly. It may be asked whether the Kabsch and Sander classifications are optimal for secondary structure prediction. Two alternatives have been considered. In the first we use one of the earliest methods of secondary structure classification, that based on backbone dihedral angles ~b and qJ. In the second we consider a classification based on the bond angles and dihedral angles of the virtual C,,-C~ bonds of an a carbon backbone. In each case we match the Kabsch and Sander classifications as closely as possible with the size and placement of angle regions. We have also optimized these regions for training accuracy within the overall constraint that the proportions of residues classified as helix, sheet, and coil remain roughly the same as those of the Kabsch and Sander classifications. In no case were we able to improve prediction accuracy over the results given in Table I, and in most cases results were a few percent lower (detailed data not shown). Best results were obtained when classifications were augmented by a rule which requires at least four contiguous helix residues and at least two contiguous sheet residues as in the Kabsch and Sander classifications. In part, these experiments are motivated by the idea that the lower performance of the neural network on/3 strands may reflect that the network has learned to recognize an extended region which will be counted as coil in the Kabsch and Sander classifications when the hydrogen-bonded partner is not available and sheet when it is. From these limited results, however, it seems that secondary structure definitions which are not based on hydrogen bonds are somewhat less easily recognized by the neural network. Future Prospects The basic results reviewed above show a three-state prediction accuracy for the neural network method to be about 63%. What are the prospects for improving this result? Are tertiary structure influences limiting our ability to improve prediction accuracy? We have approached these questions by trying to discover correlates of prediction accuracy and have 27 C. M. Wilmot and J. M. Thornton, J. Mol. Biol. 203, 221 0988).

222

PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS

[10]

discussed some of these earlier. 8 The most important observation is that the magnitude of network output is correlated with prediction accuracy. So, for example, when predictions are filtered to include only helix and sheet predictions with the highest output values and coil predictions with the lowest output values, prediction accuracy rises substantially. So, for example, the 31%"strongest" predictions are 79% accurate. One interpretation of these observations is that the neural network has captured the local tendency of the protein sequence and that when this tendency is strong, as revealed by the magnitude of network outputs, the local tendency predominates in determining the final structure. Sequences less strongly biased by a local tendency are more apt to be strongly modulated by tertiary structure. Hence, the network accuracy is lower for these sequences. This idea is potentially useful in folding simulations where one could bias chain segments to helix or sheet conformations in proportion to the magnitude of network outputs.

Pentapeptides An opportunity for studying the relative importance of local sequence versus tertiary structure is provided by identical stretches of amino acid sequence in unrelated proteins. Because of the limited size of the structure database the longest stretches we can expect in significant numbers are pentapeptides. Structure conservation in pentapeptide pairs has been examined, 28'29 and low structure conservation has been observed. This has been interpreted to imply a dominant role for the influence of tertiary structure. We have examined 87 pentapeptide pairs from 75 proteins screened to include only refined structures of resolution better than 2.6 ,A (manuscript in preparation). The chains compared were restricted to those having only a single pentapeptide pair in common. Three-state secondary structure is conserved in 46% of the residues of the pentapeptide pairs (39% would be expected at random, given the secondary structure distribution of the pentapeptide residues), confirming the low structure conservation which had been observed previously. A neural network like that shown in Fig. 1 was trained on the 13,037 residues of these 75 protein chains excluding the pentapeptide sequences themselves. Prediction accuracy on 2050 residues from 17 additional protein chains is 64. I%. Predictions have been made for the pentapeptide residues excluded from training. Prediction accuracy on these residues is 61.9%. Since the input window of 17 residues extends beyond the pentapeptides to include inputs from the flanking za W. Kabsch and C. Sander, Proc. Natl. Acad. Sci. U.S.A. 81, 1075 (1984). z9 p. Argos, J. Mol. Biol. 197, 331 (1987).

[10]

NEURAL NETWORK MODELING

223

sequences for these predictions, helix and sheet outputs will reflect the varying sequence context. Among these pentapeptide pairs are 11 pairs in which three or more residues are helix in one member of the pair and sheet in another. Since identical information is supplied to the neural network from the pentapeptide itself, it is of interest to determine whether the conformational change can be predicted by the influence of the differing flanking sequences. We have calculated the average helix and sheet neural network outputs for the residues which change state. In 8 of I l cases these residues have both higher average helix output and lower average sheet output in the helix context than in the sheer context. Two additional pairs have essentially identical values in both contexts. In only one case are the changes in helix and sheet outputs opposite to that expected. Thus, in the majority of these cases the flanking sequences influence the conformational change in the observed direction. These data do not exclude tertiary structure effects, but they do indicate that the sequence context provides one factor to help in explaining the observed lack of secondary structure conservation.

Specialization to Protein Classes It has been shown that if data are available from circular dichroism or other measurements which allow an estimate of the helix and sheet content, improvements can be made in secondary structure predictions. 23,3° A related approach has been followed for neural networks, where it has been shown that neural networks trained exclusively on proteins belonging to a single structure class are more accurate on members of the same class. 24 The structural classes used are those proposed by Levitt and Chothia31: all-a, all-/3, a/fl (helices and sheets in alternation), and a + /3 (helix and sheet segregated in separate domains). Neural networks were trained exclusively on members of each of the first three classes. Prediction accuracies are 79% for the all-a class, 70% for the all-/3 class, and 64% for the a//3 class. Even though the first two classes reduce the problem to two-state prediction, the authors argue that the improvements are more than would be expected on this basis alone. Of course, to obtain this improvement in a predictive context, a procedure must be available to make the initial structural classification. We propose that the improvements observed by these authors might be recovered in a general method by using a neural network trained on all proteins independent of class to make the initial classification. Our results lend themselves to such an 30 p. y . Chou, in "Predictions of Protein Structure and the Principles of Protein Conformation" (G. D. Fasman, ed.), p. 549. Plenum, New York, 1989. 31 M. Levitt and C. Chothia, Nature (London) 261, 552 (1976).

224

PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS

[10]

application since one could consider only the "strong" predictions described earlier for the classification; for this purpose precise helix and sheet boundaries are not important. More General View of Structure Prediction

As mentioned above, the ultimate justification for secondary structure prediction is as a step toward tertiary structure prediction. It is doubtful that secondary structure information alone is sufficient for this next step when accuracies are limited to 63%. However, secondary structure predictions are only one class of geometric information that might be extracted from protein sequences automatically by the neural network method. Other kinds of geometric information might also be useful in the buildup to tertiary structure. As an example, Bohr et al ) 2 have made an ambitious attempt to train a neural network to predict a-carbon distance constraints for a trypsin mutant. They report a 96.6% accuracy in predicting whether or not a carbons within a 61-residue window are closer than 8 A. However, since the training set contains other trypsins with a "significant" degree of homology, it is not clear how the results should be interpreted. A more generally applicable result has been obtained by Holbrook et al.10 They have used a neural network and a window of length 7 to predict the solvent exposure of amino acids. Training to recognize a three-state classification of residues as buried, intermediate, or exposed yields a prediction accuracy of 54%. (An accuracy of 34% would be expected for random guessing based on the observed distribution of the three states.) It is expected that the segregation of residues into buried versus exposed acts as an important constraint on the three-dimensional structure of proteins. Conclusion From the original success of the neural network method in secondary structure prediction 6-8 and from the other applications which have appeared since then, it is clear that neural networks are useful tools for the determination of aspects of protein structure which can be learned from amino acid sequence. Neural networks may also turn out to be useful in a broader context for the investigation of other aspects of protein folding. An advantage of the neural network method for such studies is that one can search for relationships (e.g., between sequence and structure) without assuming a particular theoretical framework in advance.

Neural networks for protein structure prediction.

204 PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS [10] The insertion-deletion geometries and the side-chain conformations provide a basis for appro...
1MB Sizes 0 Downloads 0 Views