PROTEINS: Structure, Function, and Genetics 14:372-381 (1992)

Limits on a-Helix Prediction With Neural Network Models S. Hayward and J. F. Collins Biocomputing Research Unit, Znstitute of Cell and Molecular Biology, Edinburgh EH9 3JR, Scotland

ABSTRACT Using a backpropagation neural network model we have found a limit for secondary structure prediction from local sequence. By including only sequences from whole a-helix and non-a-helix structures in our training and test sets-sequences spanning boundaries between these two structures were excluded-it was possible to investigate directly the relationship between sequence and structure for a-helix. A group of non-a-helixsequences, that was disrupting overall prediction success, was indistinguishable to the network from a-helix sequences. These sequences were found to occur at regions adjacent to the termini of a-helices with statistical significance, suggesting that potentially longer a-helices are disrupted by global constraints. Some of these regions spanned more than 20 residues. On these whole structure sequences, 10 residues in length, a comparativelyhigh prediction success of 78% with a correlation coefficient of 0.52 was achieved. In addition, the structure of the input space, the distribution of P-sheet in this space, and the effect of segment length were also investigated. o 1992 Wiley-Liss, Inc. Key words: secondary structure prediction, input space, parallel processing INTRODUCTION Attempts to predict protein secondary structure from amino acid sequence are based on the premise that for globular proteins, the global structure, and consequently secondary structure, is determined by the amino acid sequence of the protein.' However, efforts t o predict secondary structure to a satisfactory level using structural data usually rely further on the assumption that the structure of local polypeptide chain is determined solely by the local sequence. If, however, regions exist where global constraints prevent the formation of the most favored conformation a t the local level, then these secondary structure prediction methods will mispredict such regions. Despite the increasing size of the structural data bank, and the application of various and novel methods, including the recent use of neural network model^,'^-'^ no result better than 65% for the prediction of the three structures a-helix, p-sheet, and coil has been reported.'-lo This suggests that this 0

1992 WILEY-LISS, INC.

apparent limit on prediction success is not caused by insufficient local sequence information, or the inadequacies of the methods. In particular Qian and Sejnowski," who also used a neural network model, have suggested that no method based solely on local sequence information will improve on their results. However, in the light of Rooman and Wodak's work,15 where some tripeptide patterns of high predictive power were found, the question of whether local sequence is sufficient for accurate secondary structure prediction is still open to debate. In this paper we concentrate on predicting a-helices. The discovery by Kabsch and Sander" of 6 pentapeptides found both in a-helix and p-sheet structures shows that for 5 residue segments there is no precise mapping between sequence and the ahelix structure. By excluding sequences that span boundaries between the secondary structures, our approach bears more directly on whether the mapping between sequence and the a-helix structure is also ill defined for longer residue segments.

MATERIALS AND METHODS Neural Network Model The networks used were all feedforward layered networks (see Fig. l), trained using the backpropagation algorithm. Backpropagation is a steepest descent algorithm to minimize the error between an input pattern's actual output and its associated target output. For details see Rumelhart et al.,17 or previous papers on secondary structure prediction with neural network^.'^-^^ In addition to networks with one layer of hidden nodes, a single layer network was used. In this case the input layer is connected directly to the output layer. A single layer network can only be used successfully on problems that are linearly separable." That is, for a problem with two sets of input patterns, the two sets must be separable by a decision plane in the input space. For problems with sets bounded by more complicated hy-

Received September 6, 1991;revision accepted February 7,

1992. Address reprint requests to Dr. S. Hayward, Department of Chemistry, Faculty of Science, Kyoto University, Kitashirakawa, Sakyo-ku, Kyoto 606, Japan.

LIMIT ON HELIX PREDICTION WITH NEURAL NETWORK

373

Fig. 1. Topology of a feedforward layered network with 11 input nodes, 3 hidden nodes, and a single output node.

300

250

L

200

?

"

"

'

'

"

'

"

"

1

'

'

+ +

1

+

1

+

Oo@ 0

'

-

-1

0.54

0

+

0

L

a,

1 5

O 0

~

~ 50

~

~

100

~ 150

"

L ~ iii ~Lj~ 0 4 4 ' 200

250

"

Fig. 2. Top: Linearly separable region solvable by a single

~ layer " network. ~ ~ Bottom: ~ ' A more complicated division of the input

space solvable only by a network with hidden nodes.

cycles Fig. 3. Error (+) and correlation coefficient of test set prediction ( 0 ) plotted against training cycle for a network with 5 hidden nodes.

persurfaces, a network with hidden nodes must be used (see Fig. 2).

The Training and Test Sets In the work by Qian and Sejnowski," 106 proteins were selected from the Brookhaven data bank. Of these, 15 were selected for testing the prediction ability of the network and had little homology with the remaining 91 that were used for training. We used the same sets for training and testing. The primary sequences of these proteins along with their Kabsch and Sander secondary structure assignm e n t ~ were ~ ~separated into a-helix, @-sheet,and coil segments by a sliding window N residues in length. In contrast to all the other methods using neural networks to predict a-helix10-12s14we did not include boundaries between a-helix, P-sheet, and coil regions. The P-sheet and coil sequences were grouped togehter as non-a-helices. Sequences spanning boundary regions of P-sheet and coil were not

included so that P-sheet prediction in the context of non-a-helix prediction could be easily assessed. The input patterns were coded from the N residue segments from the sliding window. We assigned to each of the 20 amino acids, N input nodes corresponding to the length of window. If amino acid x were present at position i in the window, then x's ith node would output 1, otherwise 0. With this coding scheme the network needs 20 x N input nodes. A single output node was used to indicate whether the sequence a t the input was a-helix (from here on simply referred to as helix) or nonhelix. The target output was chosen to be 1for helix and 0 for nonhelix. In training and testing a threshold of 0.5 was used. An output of greater than 0.5 from a test set sequence was taken as a prediction for helix, an output of less than 0.5, a prediction for nonhelix. For the training set, any pattern that achieved its target to a tolerance of 0.5 was taken as being learnt.

Assessment of Performance In assessing the overall performance of the network, two measures were used: 1. The percentage of correct predictions is defined by

374

S. HAYWARD AND J.F. COLLINS

where N is the total number of residues, ph, the number of correctly predicted helix residues, and p,, the number of correctly predicted nonhelix residues. 2. The correlation coefficient is defined as PhPn

-

O

+

c

0

q h ) (ph

+ qn)

where

=

q,

=

I

+

65

(2)

+

O

L

L L

c)

L

ph = number of correctly predicted helix residues, pn = number of correctly predicted nonhelix resi-

qh

:I

0

qhqn

+ qh) (pn + q n ) (Ph +

CO

O+

0

d(pn

oo

+ 0: + +

dues, number of incorrectly predicted helix residues, number of incorrectly predicted nonhelix residues.

This measure is equal to 1.0 for totally correct prediction, 0.0 for totally random predictions, and - 1.0 for totally false prediction.

Network Program The network program used is general purpose and command It is implemented on a T800 transputer array connected in the torus configuration with the weight matrices divided amongst the slaves. The matrix-vector multiplications involved in both forward and backward passes are carried out in parallel. All the training patterns are loaded for the forward pass, the total error calculated, and the weights updated according to the backpropagation scheme. This process forms one training cycle. Here the program was run on a 17 transputer array consisting of 16 slaves and 1 master. Weights and thresholds were assigned random starting values in the range -0.6 to 0.6. A single run on a network with 200 input nodes, 5 hidden nodes, and a single output node, for example, would take approximately 15 min to complete 100 training cycles with 2,322 patterns. RESULTS AND DISCUSSION Balancing the Training Data Initially a window length 10 residues was chosen (unless otherwise stated this window size is used throughout this work). With this window size the training set consisted of, 1,161 helix and 2,839 nonhelix sequences, 130 of which were P-sheet. The test set constructed from the test set proteins, consisted of 244 helix and 597 nonhelix sequences, 25 of which were P-sheet. Training a 5 hidden node network with this training set produced a maximum prediction value of 77%, with a correlation coefficient of 0.468. However, balancing the training set by including only as many nonhelix sequences as helix sequences in the training set (making a total of

d

5 0 1 45

L

50

A

L

60

70

I

L

A

80

I

90

I 100

percentage learnt Fig 4 Percentage of nonhelices correctly predicted plotted against percentage of nonhelices learned (+), and percentage of helices correctly predicted plotted against percentage of helices learnt ( 0 )

2,322 helix and nonhelix examples), a prediction success of 78%, with a correlation coefficient of 0.52 was achieved. For helix prediction alone, the larger unbalanced training set achieved 63%, whereas the balanced training set achieved 78%. This means that the overabundance of nonhelix examples in the training set is detrimental to the correct prediction of helix examples in the test set. If we take the correlation coefficient as a more reliable measure of overall prediction success, then the balanced training set is clearly superior. For this reason all the training sets used in this worked were balanced.

Prediction During Training Figure 3 shows how the error and the prediction success on the test set behave during training for a network with 5 hidden nodes. The error initially decreases rapidly and then more slowly. As the network switches over from the fast learning phase to the slow learning phase, prediction success peaks. During slow learning prediction success decreases. After 200 cycles of learning 99% of the training set had been successfully learnt. Figure 4 shows test set prediction success for helix and nonhelix plotted against, respectively, the percentage of helix and nonhelix learnt. During the rapid learning phase the test set prediction success for the two structures has a linear relationship with the percentage learned. During the slow learning phase the prediction success for nonhelix remains relatively constant, but for helix it decreases significantly. As already stated, a t the peak of prediction success, a typical result for this network was 78% correct, with correlation coefficient of 0.52. Different sets of random starting values for the weights had little effect on the correlation coefficient a t the peak (k0.02). This high value for prediction is compara-

375

LIMIT ON HELIX PREDICTION WITH NEURAL NETWORK 300

i

0.61 250

0 C

L

?!

0

+ +

I-

a

X

b

M

0 0

200

+

X

150

0

'+

++++

I

100

50

,

50

100

150

. ,

,-A

200

0.44

250

'I! '

cycles Fig. 5. Error ( + ) and correlation coefficient of test set prediction (0) plotted against training cycle for a single layer network.

ble with that of Kneller et al.14 for helix prediction on all a-proteins. Figure 5 is an analogous plot to Figure 3, but for a single layer network. Here the correlation coefficient does not decrease appreciably. The error decreases to approximately the same value at which the network with hidden nodes switches from the fast learning phase to the slow learning phase, but the single layer network is unable to decrease the error any further and it remains constant during further training. This is due to a number of sequences that the network is unable to learn.

'

'012'' '014' ' '016'

'

'018' '

'

'

'

" 1.2

fraction of whole training set Fig. 6. Effect of size of training set and number of hidden nodes on test set prediction success. The x s denote results from a single layer network, the + s from a network with 5 hidden nodes, and the 0s from a network with 20 hidden nodes. Below training set size 0.5 each network was trained on 3 different training sets derived from the whole training set (2,322 patterns). At training set size 0.5 each network was trained on the 2 halves of the whole training set. At training set size 0.66 each network was trained on two training sets having one-half of their patterns in common.

a,

Effect of Training Set Size and Number of Hidden Nodes Figure 6 shows the effect of the size of the training set for networks with 0, 5, and 20 hidden nodes. Each point in this figure is the correlation coefficient at its maximum value during training. Two main conclusions can be drawn from Figure 6. Firstly, as others have prediction success does not depend to any discernible extent on the number of hidden nodes. In fact a single layer network does equally well as those with hidden nodes. This indicates that either the formation of helix is not dependent on correlations between residues in the window, or that the training set is of insufficient size for the network to generalize upon higher order features. Second, the prediction success plateaus at a correlation coefficient well below the desired correlation coefficient of 1.0 corresponding to perfect prediction. The Weight Values Given that the prediction success does not depend on the number of hidden nodes, one can easily analyze the weights of the single layer network to determine the preference of each residue for the helix structure in the 10 possible window positions. Our

3 -

m9

E

0 ._

a,

3

position in window Fig. 7. Weight values from a single layer network plotted against window position for histidine.

results are in broad agreement with those of previous workers in this field.10,11*14The weight values for histidine, however (see Fig. 7), show that it is antihelix a t the start of the window and prohelix at the end. A possible reason for this is that the histidine ring with an excess of positive charge has a favorable electrostatic interaction with the C-terminus of the helix which has a n excess of negative charge due to the helix dipole.22 All hydrophobic residues without polar groups are prohelix apart from proline.

376

S. HAYWARD AND J.F. COLLINS

Anomalous Sequences At the peak of Figure 3 about 91% of helix sequences and 80%of nonhelix have been successfully learned. For the test set around 80% of helix and 75% of nonhelix sequences were correctly predicted. The correlation coefficient was 0.52. For the training set unlearned sequences were separated from those that were successfully learned. For a network trained on only the successfully learned sequences and tested on the original test set, prediction success does not decrease but increases slightly to a correlation coefficient of 0.54 with further training. This shows that it is the unlearned sequences that cause the decrease in prediction success. The unlearned sequences were the same as those that the single layer network was unable to learn, and the successfully learned sequences could also be learnt to 100% by a single layer network. This means that the successfully learned helix sequences are separated from the successfully learned nonhelix sequences by a simple decision plane and that the network with hidden nodes is behaving essentially as a single layer network during the fast learning phase. Therefore the unlearned and falsely predicted nonhelix sequences are located among the successfully learned and predicted helix sequences (referred to as majority helix sequences from here on) on the opposite side of the decision plane to the successfully learned and predicted nonhelix sequences (referred to as majority nonhelix sequences from here on). These nonhelix sequences will be referred to as anomalous nonhelix sequences. Analogously, the unlearned and falsely predicted helix sequences are located among the majority of nonhelix sequences on the opposite side of the decision plane to the majority helix sequences. These will be referred to as anomalous helix sequences. The ability of the network with hidden nodes to learn the anomalous sequences in the training set during the slow learning phase causes the decrease in prediction success. The single layer network is unable to learn these sequences and consequently shows a peak in prediction success. In order to discover whether it is possible for a network t o distinguish the anomalous nonhelix sequences from the majority helix sequences among which they are located, a new training set was constructed from the 236 anomalous nonhelix sequences and 236 majority helix sequences from the original training set. A new test set was also constructed from the 153 anomalous nonhelix sequences and the 207 majority helix sequences from the original test set. At the peak in Figure 3 these four sets of sequences all had outputs greater than 0.5. A 5 hidden node network was trained to output 1 for the majority helix sequences and 0 for anomalous nonhelix sequences. This network was able to learn 98% of the training data. When tested on the

test set with the target output of 1 for the majority helix sequences and 0 for the anomalous nonhelix sequences, only 49% of sequences were correctly recognized, the correlation coefficient being -0.017. This means there exist a number of nonhelices, coil, and p-sheet with sequences that are indistinguishable to the network from the majority of helix sequences. If one trains a network to distinguish two groups of randomly generated patterns and then tests it on another two groups of randomly generated patterns, one typically gets this result; that is, provided the network is large enough, all the training set can be learned and the correlation coefficient is 0.0 on testing. One also gets the same result if one trains a network with the same parameters, but replaces the anomalous nonhelix sequences in the new training set with majority helix sequences from the original training set and the anomalous nonhelix sequences in the new test set with majority helix sequences from the original test set, i.e., one tries to distinguish helix from helix. This result, then, suggests that the anomalous nonhelix sequences are intrinsically identical to the majority helix sequences. However, according to Baum and H a u s ~ l e r ?a~fully connected network with one hidden layer and W weights, trained on fewer than W k training patterns, will fail on a finite number of occasions to correctly classify more than a fraction of 1-Eof the test patterns. So if a prediction success of 90% is required, then roughly 10 times as many training patterns as weights will be needed. The network used here had roughly 1,000 weights and was trained on roughly 500 patterns. If there exists an intrinsic difference between anomalous nonhelix sequences and majority helix sequences, then provided this network is able to learn a training set 20 times as large (1,800 structures), it will be able to accurately distinguish anomalous nonhelix sequence from majority helix sequence. If, however, there exists no intrinsic difference between anomalous nonhelix sequences and majority helix sequences, larger training sets will require ever larger networks and the correlation coefficient will always remain 0.0 on testing. The most plausible physical explanation for our result is that the anomalous nonhelix sequences are indeed intrinsically identical to majority helix sequences because they are potential helix sequences that have been prevented from forming helices due to global constraints during the formation of the tertiary structure. As mentioned in the Introduction this is true for a window length 5 residues.16 Our result is a possible generalization of this for a window length 10 residues. The parallel experiment to distinguish the anomalous helix sequences from the majority nonhelix sequences had a different outcome. The new training set consisted of the 104 anomalous helix SPquences and 104 majority nonhelix sequences from the original training set. The new test set consisted

LIMIT ON HELIX PREDICTION WITH NEURAL NETWORK

of the 37 anomalous helix sequences and the 444 majority nonhelix sequences from the original test set. These four sets of sequences had outputs less than 0.5 at the peak in Figure 3. The result was that, overall, the network correctly recognized 68% of sequences with a correlation coefficient of 0.25. So this network could, to some extent, distinguish the anomalous helix sequences from majority nonhelix sequences. At the peak in Figure 3 the average output value for the anomalous helix sequences in both the training and test sets was 0.27 with a standard deviation of 0.17. Most anomalous helix sequences, therefore, had an output value greater than 0.1. This threshold was chosen to divide the majority nonhelix sequences in both the training and test sets into those with output values greater than 0.1 and those with output values less than 0.1. A network with 5 hidden nodes was unable to distinguish the anomalous helix sequences from the majority nonhelix sequences with outputs greater than 0.1 (correlation coefficient= -0.13), but was able to achieve a much better result (correlation coefficient= 0.22) in distinguishing the anomalous helix sequences from the majority nonhelix sequences with outputs less than 0.1. Although this result is probably not surprising it helps to make clear the relation between anomalous helix sequences and majority nonhelix sequences in the input space. That is, anomalous helix sequences are distributed in an effectively random fashion among the majority nonhelix sequences, but become less common in regions corresponding to lower output values. Physically, as anomalous helix sequences sometimes contain very antihelix residues such as glycine, they may belong to helices that are positioned within the overall structure of the protein such that helix formation is strongly favored. The greater the number of antihelix residues in the sequence, however, the less likely it will be that it is found in the helix conformation. This is the reason why there are fewer anomalous helix sequences with output values of less than 0.1. With our coding scheme each sequence is represented by a point at a vertex of a 200-dimensional hypercube. However, from the results above, a schematic two-dimensional picture of the distribution of helix and nonhelix sequences in the input space can be drawn. It is illustrated in Figure 8. The shaded areas represent nonhelix sequence regions, the white areas helix sequence. The horizontal line at the bottom of the figure can be thought of as an axis representing the output values a t the peak of prediction success, which vary from 0 at the far left of the picture to 1 a t the far right. The vertical thick black straight line represents the decision plane that is established after the fast learning phase and corresponds to an output value of 0.5. The anomalous nonhelix sequences are represented by the shaded regions to the right of the decision line, and are distributed effectively randomly among the ma-

377

Fig. 8. Schematic two-dimensional representation of input space for helix and nonhelix sequences. Shaded areas show nonhelix regions, white areas helix regions. See text for further explanation.

jority helix sequences in accordance with the result that they were indistinguishable from majority helix sequences. The anomalous helix sequences, represented by the white regions to the left of the decision line, however, are not distributed equally throughout the space occupied by the majority nonhelix sequences as they were found to be more distinguishable from majority nonhelix sequences with outputs less than 0.1 than those of outputs greater than 0.1 (this division is represented by the vertical thick white line in the figure). Figure 3 can now be explained in detail as follows. After the decision plane has been established, the network then begins to learn the anomalous sequences during the slow learning phase. Bounding the anomalous nonhelix sequences creates regions of nonhelix prediction in the majority helix region. These regions of nonhelix prediction are effectively randomly distributed and are therefore more likely to coincide with actual helix than anomalous nonhelix sequences in the test set, so contributing to a decrease in helix prediction success. The regions of helix prediction established by the anomalous helix sequences in the majority nonhelix region are unlikely to coincide with the anomalous helix sequences in the test set and will effect little or no contribution to helix prediction success, but instead help to disrupt nonhelix prediction. Thus overall there will be a disruption of both helix and nonhelix prediction during the slow learning phase, although, due to the smaller number of anomalous helix sequences nonhelix prediction will be less affected (see Fig. 4). The peaking behavior in Figure 3 is frequently seen with neural networks trained by

378

S. HAYWARD AND J.F. COLLINS

the backpropagation algorithm and is often called overlearning. In this case it has been shown that it is the learning of around 20%of nonhelix sequences that causes the decline in overall prediction success. It has also been shown that these sequences are located throughout the space occupied by 90% of helix sequences on the other side of a hyperplane to the other 80% of nonhelix sequences. Therefore the structure in the division between helix and nonhelix sequences in the input space explains not only the mechanism behind this “overlearning,” but is in itself of significance. Figure 6 also has a simple explanation. The networks with hidden nodes do not improve upon the single layer network as they also establish the same decision plane as the single layer network. This plane gives maximum prediction success for the reason explained above. The fact that the correlation coefficient is well below 1 is largely then due to the anomalous sequences in the test set. These results put a limit on the success of helix prediction. One can, however, achieve some certainty in prediction, but a t a cost. It was already mentioned that most of the anomalous helix sequences have output values of greater than 0.1. In Figure 8 this fact is represented by the small number of white regions to the left of the white line. Therefore, whenever a test sequence has an output value of less than 0.1, one can be fairly sure that the sequence belongs to the nonhelix structure. The cost, however, is that one will be very uncertain to which structure the sequence belongs if the output value is greater than 0.1. Using 0.1 then as the decision threshold, 94% of helix sequences were correctly predicted, but only 50% of nonhelix. Previously these two figures were 80 and 75%. As now only 6%of helices are predicted as nonhelix this represents an increase in the accuracy of nonhelix prediction, together with a loss of helix prediction accuracy. This method will be of use if one requires high nonhelix prediction accuracy.

P-Sheet Prediction There are relatively few examples of P-sheet 10 residues in length: 130 in the training set, and only 25 in the test set. Partly for this reason the network has greater difficulty in predicting p-sheet correctly. As coil and helix examples far outnumber p-sheet examples, the decision boundary is largely determined by the division between helix and coil. Using data from both the training and test sets and determining the percentages of all three structures in the three regions divided by the original 0.5 decision plane (0.5 threshold, thick black line in Fig. 8) and the 0.1 decision plane (0.1 threshold, white line in Fig. 81, one can get some idea of how these three structures are distributed in relation to each other in the input space. Figure 9 illustrates this. P-sheet has a distribution in the input space that lies



0

°

7

region 1 region 2 region 3 Fig. 9. Distribution of helix (o),P-sheet (o),and coil (A) in the three regions of the input space defined by the two decision planes (see Fig. 8): Region 1 to the left of the 0.1 decision plane, Region 2 between the 0.1 and 0.5 decision planes, Region 3 to the right of the 0.5 decision plane.

roughly between helix and coil and has a greater preference than either for the region between the two planes. In addition a large percentage of P-sheet sequences, 38%,are anomalous nonhelix sequences, compared with 15%for coil.

Prediction for Boundary Residues In other neural network approaches to predicting he1ix,10-12,14 it is the residue situated a t the center of the window for which a prediction is made. In contrast we have trained and tested the network on its ability to learn and predict the structure of sequence segments known in advance to be wholly helix or nonhelix. Predicting the structure of a sequence segment known to be wholly helix or nonhelix is equivalent to predicting the structure of the central residue in the window, whereby the central residue is always a t least ( N - 1)/2 residues from a boundary. For the purposes of argument the window length N is taken to be an odd number of residues, although our conclusion will also apply to N being even. Helix prediction using windows not restricted to exclude boundary regions resulted in a lower correlation coefficient, 0.3812 (using much longer windows), than the value of 0.52 obtained with this restriction using a window length 10. This indicates that sequence spanning boundaries between the two structures are particularly difficult to predict.

Effect of Window Size All the work described above was done using a window length of 10 residues. To see how window size effects these results, window sizes 7 and 13 were also used. Figure 10 and 11 show the results of parallel runs to that of Figure 3 for both window sizes.

379

LIMIT ON HELIX PREDICTION WITH NEURAL NETWORK o.6 600

---rl 1

500

0.54

1 0 5

1

c

-

0

2

.-0

'p

a, 0

z.

- 0.42

L

3

0.4

0

a,

0 - 0.38

2

0.25

-

t

03

20

40

60

80

120

100

140

1

O.Zo

160

Fig. 10. Error (+) and correlation coefficient of test set prediction ( 0 ) plotted against training cycle for a network with 5 hidden nodes with data from a window size 7.

'

+

'

~

t

r

r '

A

1000

2000

3000

4000

5000

number of training patterns

cycles

150

H

f

0'3

A 0

41

0.35

_. (D 3

200

E t

1

0-5

0

0

L

4-4

a, ._

T

- 0.46

?

3

'

'

'

'

Fig. 12. Effect of size of training set on test set prediction Success forwindow sizes: 7 ( 0 ) ,10 (A), and 13 (0).Error bars are two standard deviations in length.

'7 0'54

- 0.455

0 4

ter than 10 for the same number of sequences in the training set and 7 does worse than both. The difference in performance is far greater between 7 and 10 than it is between 10 and 13.

7

0

'p

0

%. 0

Further Analysis of Anomalous Nonhelix Sequences

0

One possible explanation for the anomalous nonhelix sequences is that they are caused by errors in the structural data. Indeed, one region containing anomalous nonhelix sequence was found to be really in the helix conformation in an improved determination of the lysozyme structure (2LZM) in a more recent edition of the Brookhaven data bank. However, regions of anomalous nonhelix sequence for window size 10 were found in around 70% of the proteins used in training and testing suggesting that errors in the structural data are not to blame. To investigate this further, only those proteins whose structures were determined to a resolution of better than 2.5 A were selected for training and testing. Also, proteins whose sequences in Brookhaven showed a large degree of discrepancy with their sequences in the protein sequence database NBRF24 were rejected. Training with this set also showed the telltale peaking behavior, thus demonstrating that anomalous nonhelix sequence is probably not caused by errors in the structural data. A further important feature is that as the predictive window translates the protein sequence one res-

- 0.37

0

- 0 285

s n. 2

&

I

+

0

+

cycles Fig. 1 1 . Error ( + ) and correlation coefficient of test set prediction ( 0 ) plotted against training cycle for a network with 5 hidden nodes with data from a window size 13.

For both window sizes the same peaking in the correlation coefficient is observed. Table I shows the percentages of helix and nonhelix learned at the peak of prediction success for all three window sizes. The proportion of unlearned, or anomalous sequences is greatest for window size 7 and lowest for window size 13. Figure 12 shows the effect of the size of the training set on prediction success for the three window sizes. Again no dependence on the number of hidden nodes was found. Window size 13 does bet-

TABLE I. Results for Windows 7,10,and 13

Window size 7

Training set

10

13

Helix

Nonhelix

Helix

Nonhelix

Helix

Nonhelix

87%

71%

91%

80%

94%

90%

380

S. HAYWARD AND J.F. COLLINS

idue at a time, windows of anomalous nonhelix sequence (false helix prediction) are often predicted consecutively. In the case of the window 10 residues in length, regions existed where more than 10 consecutive false predictions occurred, i.e., spanning more than 20 residues. What is more, many of these regions are falsely predicted helix with all the three window sizes used. The distribution of anomalous nonhelix sequences in the coil conformation was also tested statistically using the window length 10. The hypothesis that windows of anomalous nonhelix sequence in the coil conformation are more likely to occur near the C-termini of helices was upheld at the 5% significance level up to the fourth residue from the C-terminal. The hypothesis that windows containing anomalous nonhelix sequence in the coil conformation are more likely to occur near the Ntermini of helices was upheld to the 10% significance level up to the fourth residue from the N-terminal. Although there were a greater number of windows of anomalous nonhelix sequence at the Ctermini of helices than a t the N-termini, this difference was not found to be significant. Figure 13 shows such a region for sickle cell hemoglobin with the sequence VLDAFTQGLKHLDDLKGNFAQ (lHDS, p 1 strand only). It spans 21 residues and lies between two helices. From its structure it is tempting to conclude that this region arose when a potentially longer helix was prevented from forming. Indeed, in the p 2 strand of this protein the residues VLDAF and LKGN are in the helix conformation. Figure 14 shows a region of anomalous nonhelix sequence located at the C-terminus of a helix in arabinose binding protein. It has the sequence DMKVIADDQFVNAKGK. Figure 15 shows another region of anomalous nonhelix sequence located at the N-terminus of a helix in cytochrome cg and has the sequence SLEFRDKANAKDIKLVES. The result that most of the anomalous nonhelix sequences in the coil conformation occur in structures adjacent to actual helices explains why boundary regions are difficult to predict and is independent ~~

~

Fig. 13. Backbone of 1HDS, hemoglobin (sickle cell, p 1 strand only). Highlighted is a region of anomalous nonhelix sequence in the coil conformation between residues 66 and 86, inclusive, on the p 1 strand. Consecutive prediction of helix occurs as the window translates the sequence VLDAFTQGLKHLDDLKGNFAQ. It is situated between two helices. Fig. 14. Backbone of 1ABP, arabinose binding protein. Highlighted is a region of anomalous nonhelix sequence in the coil conformation between residues 82 and 98 inclusive. Consecutive prediction occurs as the window translates the sequence DMKVIAVDDQFVNAKGK. It is situated at the C-terminus of a helix. Fig. 15. Backbone of 1CY3, cytochrome c,. Highlighted is a region of anomalous nonhelix sequence in the coil conformation between residues 69 and 87 inclusive. Consecutive prediction occurs as the window translates the sequence SLEFRDKANAKDIKLVES. It is situated at the N-terminus of a helix

evidence for the hypothesis that anomalous nonhelix sequences are from potential helix forming structures. That is, potentially longer helices have been prevented from forming by global constraints, in

LIMIT ON HELIX PREDICTION WITH NEURAL NETWORK

particular the requirement of placing hydrophobic residues in the interior of the protein. It has been reported by Kabsch and Sander” that the ends of a-helices are often to be found in an overwound structure, such as a 3-turn or 3-helix, or the underwound 5-turn. One could speculate further that the formation of these structures is a likely response when a potentially longer helix is prevented from forming by long range constraints.

381

6. Lim, V.I. Algorithms for prediction of a-helices and

p-

structural regions in globular proteins. J . Mol. Biol. 88: 873-894, 1974. 7. Garnier, J., Osguthorpe, D.J., Robson, B. Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. J. Mol. Biol. 120:97-120, 1978. 8. Gibrat, J.F., Garnier, J., Robson, B. Further developments of protein secondary structure prediction using information theory. J . Mol. Biol. 198:425-443, 1987. 9. Kabsch, W., Sander, C. How good are predictions of protein secondary structure? FEBS Lett. 155:179-182, 1983. 10. Qian, N., Sejnowski, T.J. Predicting the secondary structure of globular proteins using neural network models. J . CONCLUSION Mol. Biol. 202:865-884, 1988. 11. Holley, L.H., Karplus, M. Protein secondary structure preOur main result is that anomalous nonhelix sediction with a neural network. Proc. Natl. Acad. Sci. quences were found to be indistinguishable from U S A . 86152-156, 1989. 12. Bohr, H., Bohr, J., Brunak, S., Cotterill, R.M.J., Lautrup, majority helix sequences. Although a large increase B., Nerskov, L., Olsen, O.H., Petersen, S.B. Protein secin the amount of data may reveal a difference in ondary structure and homology by neural networks (The a-helices in rhodopsin). FEBS Lett. 241:223-228, 1988. these two sets, the most plausible physical explana13. McGregor, M.J., Flores, T.P., Sternberg, M.J.E. Prediction tion is that the mapping of sequence to a-helix strucof p-turns in proteins using neural networks. Protein Enture is ill defined for local sequence because of global gineer. 2:521-526, 1989. 14. Kneller, D.G., Cohen, F.E., Langridge, R. Improvements constraints. This result has a precedent in the findin protein secondary structure prediction by a n enhanced ing of Kabsch and Sander“ for a window length 5 neural network. J . Mol. Biol. 214:171-182, 1990. residues, and is a possible generalization of their 15. Rooman, M.J., Wodak, S.J. Identification of predictive sequence motifs limited by protein structure data base size. result for longer windows. This view is supported Nature (London) 33545-49, 1988. further by our finding that anomalous nonhelix se16. Kabsch, W., Sander, C. On the use of sequence homologies quence in the coil conformation occurs predomito predict protein structure: Identical pentapeptides can have completely different conformations. Proc. Natl. Acad. nantly in regions adjacent to real helices, suggesting Sci. U.S.A. 81:1075-1078, 1984. perhaps that a longer helix forms in a partially 17. Rumelhart, D.E., Hinton, G.E., Williams, R.J. Learning internal representations by error propagation. In: “Parfolded state and is subsequently distorted by strucDistributed Processing: Explorations in the Microtural adjustments late in the folding p r o ~ e s s . ~ ~ , ~ ~allel structure of Cognition,” Vol. 1. Rumelhart, D.E., McClelThe structure of some of these regions supports this land, J.L. (eds.). Cambridge, MA: MIT Press, 1986: 318362. interpretation. Regions of anomalous nonhelix se18. Minsky, M., Papert, S. “Perceptrons: An Introduction to quence have been found with windows up to 13 resComputational Geometry.” Cambridge, MA: MIT Press, 1969. idues in length and sometimes span more than 20 19. Kabsch, W., Sander, C. Dictionary of protein secondary residues in consecutive prediction as the window structure: Pattern recognition of hydrogen-bonded and translates the sequence. These findings put any geometrical features. Biopolymers 22:2577-2637, 1983. 20. Richards, G.D. Implementation and capab method of secondary structure prediction based feed-forward networks. Ph.D thesis, Edinburgh Universolely on local information with the data bank at its sitv. 1990. present size in serious doubt, and suggest that no 21. Riihards, G.D., Tollenaere, T. Documentation for Rhwydwaith Version 2.1. Edinburgh Computing Service Note method based solely on local information should ECSP-UG-7, 1989. ever be relied upon to give satisfactory results. 22. Sali, D., Bycroft, M., Fersht, A.R. Stabilization of Drotein structure by interaction of a-helix dipole with a h a r g e d side chain. Nature (London) 335:740-743, 1988. REFERENCES Baum, E., Haussler, D. What size net gives valid general23. 1. Anfinsen, C.B. Principles that govern the folding of proization? Neural Comp. 1:151-160, 1989. tein chains. Science 181:223-230, 1973. 24. Lesk, A.M., Boswell, D.R., Lesk, V.I., Lesk, V.E., Bairoch, 2. Chou, P.Y., Fasman, G.D. Conformational parameters for A. A cross-reference table between the protein data bank amino acids in helical, p-sheet, and random coil regions of macromolecular structures and the national biocalculated from proteins. Biochemistry 13:211-222, 1974. medical research foundations-Protein identification re3. Chou, P.Y., Fasman, G.D. Prediction of protein conformasource amino acid sequence data bank. Protein Seq. Data tion. Biochemistry 13:222-245, 1974. Anal. 2:295-308, 1989. 4. Lewis, P.N., G6,N., G6, M., Kotelchuck, D., Scheraga, 25. Udgaonkar, J.B., Baldwin, R.L. NMR evidence for an early H.A. Helix probability profiles of denatured proteins and framework intermediate on the folding pathway of ribonutheir correlation with native structures. Proc. Natl. Acad. clease A. Nature (London) 3 3 5 5 9 4 4 9 9 , 1988. Sci. U.S.A. 65:810-815, 1970. 26. Roder, H., Elove, G.A., Englander, S.W. Structural char5. Lim, V.I. Structural principles of the globular organization acterization of folding intermediates in cytochrome c by of protein chains: A stereochemical theory of globular proH-exchange labelling and proton NMR. Nature (London) tein secondary structure. J . Mol. Biol. 883357-872, 1974. 335:700-704,1988.

Limits on alpha-helix prediction with neural network models.

Using a backpropagation neural network model we have found a limit for secondary structure prediction from local sequence. By including only sequences...
1016KB Sizes 0 Downloads 0 Views