20

PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS

[2]

[2] H y d r o p h o b i c P o t e n t i a l s f r o m Statistical A n a l y s i s of P r o t e i n S t r u c t u r e s

By CHARLES E. LAWRENCE and STEPHEN H. BRYANT Introduction The hydrophobic effect has been viewed as a key factor in protein folding since the publication of Kauzmann's 1seminal review in 1959. Since the first protein crystal structure was solved, the "hydrophobic effect" has been apparent from the finding of a substantial overabundance of nonpolar residues within the interior. However, its role in protein stability has remained a debated topic to this day. 2-4 Furthermore, quantitative assessment of the contribution of hydrophobic interactions to protein stability remains in a primitive state compared to other components, for example, electrostatics .5 This circumstance is not surprising given that the hydrophobic effect is a systems effect, stemming from the interaction of the protein with surrounding water molecules. Because of the difficult theoretical problems presented by this system, considerable effort has been focused on empirical approaches. Analyses of structural features thought to arise from hydrophobic interaction provide a summary description, and perhaps insight into its underlying physical basis. The focus of this chapter is on statistical analysis of protein crystal data. First we present a framework for the statistical analysis of these data as a means to parameterize empirical potential functions. Then we review empirical studies of hydrophobicity from the perspective offered by this framework, with an eye to assessing progress to date toward quantitative descriptions of hydrophobic interactions.

Statistical Framework Several observations suggest that substructure distributions in proteins follow a Boltzmann-like probability law, such that the relative frequencies of exchangeable substructures are related log-linearly to the difference in I W. Kauzmann, Adv. Protein Chem. 14, (1959). z H. Bloemendal, Y. Marcus, A. H. Sipkes, and G. Somsen, Int. J. Pept. Protein Res. 34, 405 (1989). 3 p. L. Privalov and S. J. Gill, Adv. Protein Chem. 39, 191 (1988). 4 K. A. Dill, Biochemistry 29, 7134 (1990). 5 S. C. Harvey, Proteins 5, 78 (1989).

METHODS IN ENZYMOLOGY,VOL. 202

Copyright © 1991by Academic Press, Inc. All fightsof reproduction in any form reserved.

12]

ANALYSIS OF HYDROPHOBIC POTENTIALS

21

their contributions to conformer stability. The log frequencies with which the various residue types are accessible to solvent are correlated with water/organic phase partition coefficients of amino acid analogs. 6-8 Recent analysis has demonstrated a quantitative relationship of log ion-pair frequency with calculated electrostatic potential. 9 From the theoretical side, a Boltzmann-like probability law is suggested by the success of statisticalmechanical treatments of conformer equilibration in predicting the general features of proteins, l°'H These theories are based on an assumption that substructure contributions to stability are additive, and they predict a Boltzmann-like or maximum entropy substructure distribution in a conformer ensemble. It has also been argued that accumulated point mutations produce an "evolutionary ensemble" with respect to substructures in the current database. 12 Existence of a Boltzmann-like probability law provides the basis for a general statistical methodology which may be used to derive apparent energetic parameters from structural data. In the remainder of this section we provide a mathematical statement of this model. Its application to ionpair substructures has been discussed in greater detail elsewhere. 9 A Boltzmann-like probability model suggested by statistical mechanical theories 1°-13 may be written

p( Si) - exp[-flS/x( Si,S°)] P(S °)

(1)

Here P( S i) and P( S °) are probabilities of observing substructures of types i and 0, and 81z(Si,S °) is the change in conformer free energy resulting from their exahcnge./3 is a positive constant analogous to the term 1/kT in the Maxwell-Boltzmann equation. S i and S O refer in general to the chemical and positional parameters of two or more amino acid residues, where nonbonded interaction contributes significantly to conformer stability. When i and 0 differ with respect to a subset of structural parameters Eq. (1) may be written 6 j. Janin, Nature (London) 277, 491 (1979). 7 G. D. Rose, A. R. Gaselowitz, G. J. Lesser, R. H. Lee, and M. H. Zehfus, Science 229, 834 (1985). 8 C. Lawrence, I. Auger, and C. Mannella, Proteins 2, 153 (1987). 9 S. H. Bryant and C. E. Lawrence, Proteins 9, 108 (1991). I0 K. A. Dill, Biochemistry 24, 1501 (1985). 11 I. M. Lifshitz, A. Y. Grosberg, and A. R. Khokhlov, Reo. Mod. Phys. 50, 683 (1978). 12 O. G. Berg and P. H. yon Hippie, J. Mol. Biol. 193, 723 (1987). 13 S. Miyazawa and R. L. Jernigan, Macromolecules 18, 534 (1985).

22

PROTEINS AND PEPTIDES; PRINCIPLES AND METHODS

[2]

p ( R i I X ' O ) = e x p [ - f l S l z ( R i m R ° IX, O)] P ( R ° I X , O)

(2)

H e r e R i and R ° are the structural parameters which distinguish i and 0, and X are structural parameters describing a protein environment where this substitution may occur. R i and R ° may refer, for example, to the chemical type of a reference residue, and X to relative coordinates and types of neighbor residues.~°-18 It is useful to consider such " c o n s e r v a t i v e " exchanges because they are known to occur in proteins, and hence may be described by Eq. (1). Furthermore, the d e p e n d e n c e of potential difference on environmental parameters may be known from chemical physics or model systems. The energetic cost o f substituting valine for threonine, for example, should depend on accessibility to water. 6-8 The parameters O are employed to denote explicitly the d e p e n d e n c e of these frequencies on energetic parameters. Given observed substructure frequencies, we wish to derive apparent energetic parameters O. For this purpose it is useful to adopt a data analysis strategy based on maximum likelihood estimation. 19 This procedure finds best-fit parameter values and in addition provides access to the likelihood ratio test statistic. The " b e s t fit" is achieved by finding the parameter values that maximize the likelihood, the " c h a n c e " of the observations that have been collected. The likelihood ratio statistic may be used to evaluate goodness of fit, define confidence limits on O, and determine whether alternative potentials and/or substructure definitions lead to better agreement of observed and predicted substructure frequencies. Given that a substructure of type i = {0, 1. . . . . I} occurs at a particular site, it is easily shown from Eq. (2) that individual substructure probabilities may be written p(RiIX,O)

= exp[-flSlx(Ri,R°lX,O)]/Z

i = O, 1 . . . . .

I

(3a)

where Z = ~ ii:0 e x p [ - - f l 6 l ~ ( R i , R ° l X , O ) ] • For continuous characteristics this b e c o m e s f ( R IX,O) = exp[-flS/x(R I X , O ) ] / Z

(3b)

where Z = f exp[-flS/~(R IX,O)]. Bryant and L. M. Amzel, Int. J. Pept. Protein Res. 29, 46 (1987). 15S. K. Burley and G. A. Petsko, Adv. Protein Chem. 39, 125 (1988). 16D. J. Barlow and J. M. Thornton, J. Mol. Biol. 168, 867 (1983). t7 M. Sundaralingam, Y, C. Sekharudu, N. Yathindra, and V. Ravichandran, Proteins 2, 64 (1987). 18K. A. Thomas, G. M. Smith, T. B. Thomas, and R. J. Feldmann, Proc. Natl. Acad. Sci. U.S.A. 79, 4843 (1982).

14 s . H.

19 M. Kendall and A. Stuart, "The Advanced Theory of Statistics, Volume 2: Inference and Relationship." Macmillan, New York, 1979.

[2]

ANALYSIS OF HYDROPHOBIC POTENTIALS

23

These expressions constrain substructure probabilities to sum to 1. Type 0 is designated arbitrarily as a reference state, such that the potential difference corresponds to an exchange of 0 for i. Given N i substructure observations of each type, the log likelihood for a data set S = {Rj,X~; i = 1, 2 . . . . . I, j = 1, 2 . . . . . N i} may now be written 1 Ni

log Z = ~] ~] i=0j=l

ln[e(RjlXj,O)]

(4)

The values Omaxwhich maximize this expression are the maximum likelihood estimates of the unknown energetic parameters. Substituting P ( R ~ I X , O) from Eq. (3a) into Eq. (4) and illustrating the maximization explicitly yield 1 N~

max log L = max ~] ~ { - ~ O l z ( R ~ , R ° [ X , O ) - ln[Z(O)]} O

(4a)

i=0j=l

Thus, the maximum likelihood problem is closely related to the maximum entropy problem, but instead of maximizing across frequencies for fixed values of the parameters, and observed averages, the maximization is over the unknown parameters given observed individual frequencies. For a goodness-of-fit test, data for each substructure type are separated into B bins according to the value of Xj. Intervals should be chosen such that there are meaningful differences among observed substructure counts in each bin, ni,i, i = 1 . . . . . B . Predicted counts eid, i = 1. . . . . B are obtained by summing Eq. (3a) over observations in each bin, using maximum likelihood estimates of the parameters. The likelihood ratio test statistic may now be written

(5) Li=0j=I

This statistic is asymptotically chi-squared with I ( B - Q ) degrees of freedom, where Q is the number of free parameters O = {O1, • . . , O Q } in the model used to obtain predicted counts .20The probability that differences in observed and predicted counts are statistically significant may thus be obtained from standard tables• It has been shown that the LRTS is asymptotically equivalent to a conventional X2 goodness-of-fit statistic. 2° The likelihood ratio statistic may be used more generally to test hypotheses represented by specific values of O. The test statistic for the hypothesis Oi = O,-°, i = 1, 2 . . . . . P, P < Q, may be written 20 y . M. M. Bishop, S. E. Fienberg, and P. W. Holland, "Discrete Multivariate Analysis: Theory and Practice." MIT Press, Cambridge, Massachusetts, 1975.

24

PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS

LRTS = 2[ max

[2]

lnL(Oi . . . . . Oa)]

O t . . . . . OQ

- 2[

max

lnL(O~ . . . . . O~,,Oe+l . . . . . Oo)]

(6)

Of+ I . . . . . Oo

This statistic is known to be asymptotically chi-squared with P - Q degrees of freedomJ 9 A confidence interval for O,. may be defined by calculating the probability of Oi = Oi for various values 0. Alternative potentials may be evaluated by devising a general model which reduces to the alternatives in question for specific values of O. These may involve dependence on alternate environmental parameters X, or an alternate functional dependence on X. In applying this model to structural data two caveats should be borne in mind. The first is that the assumptions of an additive free energy relationship and Boltzmann-like probability law are at best approximations. Derived energetic parameters describe potentials of mean force 2~with respect to the current database, and their accuracy with respect to specific predictions by molecular mechanics calculation and their consistency with other force field parameters remain to be addressed. The second caveat is that interpretation of relative substructure frequencies in terms of energetic parameters is reasonable only when exchange may be assumed to have occurred freely, in the course of the folding and evolutionary processes governing protein conformation, and when exchange is constrained primarily by conformer stability. Although serine is always observed in the active site of serine proteases, for example, the reason is not that its interactions with nearby residues are necessary for folding, but rather that the serine residue is required for catalytic activity. This is probably the exception rather than the rule, however, and it seems reasonable to assume that the statistical model presented above is generally applicable to residue-environment interactions. Exchangeability, stability constraint, and approximate additivity have in any case been demonstrated for this class of substructure by mutagenesis experiments. 22-24 Empirical Measures of Residue Hydrophobicity The assumption that residue interactions with water are additive and independent has some theoretical basis in "iceberg" models for hydrophobic effectsJ '25 Empirical studies of hydrophobicity have thus focused on 21 T. L. Hill, " I n t r o d u c t i o n to Statistical T h e r m o d y n a m i c s . " A d d i s o n - W e s l e y , Reading, M a s s a c h u s e t t s , 1960. 22 W. A. L i m and R. T. Sauer, Nature (London) 339, 31 (1989). 23 j. A. Wells, Biochemistry 29, 8509 (1990). 24 D. Shortle, J. Biol. Chem. 264, 5313 (1989). z~ S. J. Gill and I. W a d s w o r t h , Proc. Natl. Acad. Sci. U.S.A. 73, 2955 (1976).

[2]

ANALYSIS OF HYDROPHOBIC POTENTIALS

25

solvent accessibility of amino acid residues and derivative quantities. Accessibility is formally defined as the estimated residue surface area accessible to water. The accessibility of the residues is normally estimated through the use of an algorithm which rolls a sphere with the diameter of a water molecule over the three-dimensional (3D) crystal structure. 26 Although the mapping is involved, accessibility can be described as a function of the pairwise distances at an atomic level. 27 Nearly all authors who have examined accessibility have reduced the complexity of the problem by treating the accessibility data as if it were an independent random sample across all the proteins in the database, and across all of the positions within each protein. They further assume that there are no interactions among the residue types. That is, these analyses fix residue type Ri, and examine differences in potential associated with changes in accessibility, aio. The application of Eq. (3b) yields f ( a i l R i , O ) ---

exp[-fl01z(ailR i = r,Or)]/Z

(7)

These assumptions are equivalent to the following: (1) all residues of a given chemical type from all proteins are governed by the same free energy model; (2) the free energy contributions from each of these are additive; and (3) the free energy may be fully specified on some accessibility-related scale. Many using this approach have assumed that the full hydrophobic potential of a residue may be captured in a single parameter, its hydrophobic index, H ( R ) . Janin 6 employs a binary description of accessibility. In this case ~H~ifa ~bcwhen Y = 0. The probabilities of these two errors are, respectively, P(~b < ~b~[Y= 1) and P(6 > 6cl r = 0).

[2]

29

ANALYSIS OF HYDROPHOBIC POTENTIALS

Predictions based on hydrophobic profiles employ average hydrophobicities to predict that a sequence of residues is either buried or accessible. Letting Y = 1 if the residues from i to i + p are buried, then a hydrophobic profile predicts Y = 1 if ~b = (l/p) 2; wfl(Rj) > $c, where wj is the weight for residue j. The most commonly employed profiles are unweighted, that is, wj -- 1 forj = i to i + p. The profile is generally plotted for all overlapping segments of length p and the critical value taken as that of the o-most extreme of these. To a surprising extent the literature on hydrophobic profiles focuses on the choice of hydrophobic indices, as referenced by Rose and Dworkin. 31 To a lesser extent the length of the segment, p, has been considered. The predictive success of these methods has focused on a comparison of these profiles with profiles of accessibility for proteins of known structure. 31 Rao and Argos have described hydrophobic indices useful for the identification of transmembrane helicies and described the relatively high predictive accuracy that may be obtained. 32 Lipman and Pastor 33 examine a histogram of tripeptide accessibilities with the histogram of tripeptide hydrophobicities using several hydrophobicity scales. They note that the accessibility histogram has an excess of highly buried tripeptide segments compared to shuffled data, whereas hydrophobicity histograms lack this feature. This excess is clear evidence that the last term in Eq. (1 1) is nonzero. Yet, the obvious extension of hydrophobic profile analyses to consider the joint characteristics of neighboring residues has received scant attention. Hydrophobic indices have also been used for the prediction of amphipathic helices. An amphipathic helix that is hydrophobic on one face and hydrophilic on the other. To achieve this configuration, a periodicity in the hydrophobicity of the residues in the sequence that forms the helix is required. This periodic pattern is captured by the following prediction function:

[i+p q~(Rj . . . . . Ri+p) = [~.=. H(R)sin(to:)

]2

[i+p + L:=i ] X H(R)sin(toj)j

]2 (12)

where to is the frequency. The key literature has been reviewed by Cornette et al. 34 and Eisenberg et al. 35 3t G. R. Rose and J. E. Dworkin, "Prediction of Protein Structure and the Principles of Protein Conformation" (G. D. Fasman, ed.) pp. 625-634. Plenum, New York, 1989. 32 j. K. M. Rao and P. Argos, Biochim.Biophys. Acta 869, 197 (1986). 33 D. J. Lipman and R. W. Pastor, Biopolymers26, 17 (1987). 34 j. L. Cornette, K. B. Cease, H. Margalit, J. L. Spouge, J. A. Berzofsky, and C. DeLisi, J. Mol. Biol. 195, 659 (1987). 35 D. Eisenberg, M. Wesson, and W. Wilcox, "Prediction of Protein Structure and the Principles of Protein Conformation" (G. D. Fasman, ed.) pp. 635-646. Plenum, New York 1989.

30

PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS

[2]

Evaluations of alternative conformations by means of hydrophobic potentials have been attempted by simulations of folding using lattice models. 36-39 Residue potentials have generally been assumed to take a simple form, such as a unit potential difference for each contact to a watercontaining lattice site. A remarkable result from these simulations is that specific, sequence-dependent folding reactions may be observed, in the computer, when only a simplistic hydrophobic potential is assumed. Protein folding is more complex than this, and one may question the relevance of the simulations on these grounds, but the congruence of these simulations with many known properties is perhaps some of the best evidence that hydrophobic potentials will be a dominant term in an empirical energy function capable of predicting approximate global conformation. Comparisons of hydrophobic energy for native and "misfolded" proteins have also been used in attempts to evaluate the power of hydrophobic potentials to identify correct model structures.14'35,4° A "misfolded" structure has a backbone conformation similar to that of a known protein, but some other sequence. Calculations for random sequence assignments may be viewed as enumeration of a " s e q u e n c e ensemble" specific to a particular backbone conformation. Results from these calculations are reminiscent of lattice model simulations. Very few, if any, nonnative sequences have lower hydrophobic energy than the native conformer, even when very approximate, single-point-per residue representations are employed. This again suggests that hydrophobic components will be important in any empirical force field useful for discriminating among the alternative tertiary structures accessible to a polypeptide species.

Conclusion Two questions arise from review of progress to date in parameterization of hydrophobic potentials. The first is how can hydrophobic interaction per se be separated from other forces at play in protein folding, such as electrostatics? Potentials parameterized on accessibility-related scales may not address this question, yet it will be at the heart of attempts to combine hydrophobic potentials with force fields from molecular mechanics. The statistical model described above would appear to provide some means to separate residue-residue and residue-solvent interactions into 36 D. G. Covell and R. L Jernigan, Biochemistry 29, 3287 (1990). 37 H. Taketomi, F. Kano, and N. Go, Bioploymers 27, 527 (1988). 3s j. Skolnick, A. Kolinski, and R. Yaris, Proc. Natl. Acad. Sci. U.S.A. 85, 5057 (1988). 39 K. F. Lau and K. A. Dill, Proc. Natl. Acad. Sci. U.S.A. 87, 638 (1990). 4o j. Novotny, A. A. Rashin, and R. E. Bruccoleri, Proteins 4, 19 (1988).

[3]

PROTEIN SECONDARY STRUCTURE PREDICTION

31

major components, by judicious choice of reference state, and its application to hydrophobic interactions may provide some interesting results. The second question is how should one evaluate hydrophobic potentials? There are many reasons to believe that the hydrophobic effect is a dominant force in protein folding, 4 yet it remains to be shown that hydrophobic potentials can predict specific folding to a near-native conformational state, or distinguish among plausible alternatives. Predictions based on partition of local segments to a "buried" milieu may not be the answer, since this description may apply to transmembrane helices but not to globular protein folding. Recent advances in computer technology make it possible, however, to simulate globular protein folding and/or evolution of amino acid sequence, albeit with simplified representations. One may expect that this approach will continue to provide insight as to the terms in an empirical potential which are most important for prediction of native structure.

[3] P r e d i c t i n g P r o t e i n S e c o n d a r y S t r u c t u r e B a s e d on Amino Acid Sequence

By KEN NISHIKAWA and TAMOTSU NOGUCHI Introduction Prediction of protein secondary structure is a well-defined digital and linear problem: It is digital because the conformational state which each amino acid residue can adopt is assumed to be either an a helix,/3 strand, or a coil so that correctly predicted residues are easily countable. It is linear because prediction is carried out to locate secondary-structure elements along the primary structure (traditionally, the pairing problem of 13 strands to form a /3 sheet has been ignored in secondary-structure prediction). Therefore, the goal is to seek correspondence between two one-dimensional digital series of primary and secondary structures. It is a simple and ideal system of prediction, and appears to be easily resolved. At the beginning of the history of this field, researchers regarded it so and in fact claimed that a" simple prediction method was enough to give satisfactory results. But difficulties became apparent when the existing methods were applied and reevaluated against those proteins whose threeMETHODS IN ENZYMOLOGY, VOL. 202

Copyright © 1991 by Academic Press, Inc. All fights of reproduction in any form reserved.

Hydrophobic potentials from statistical analysis of protein structures.

20 PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS [2] [2] H y d r o p h o b i c P o t e n t i a l s f r o m Statistical A n a l y s i s of P r o t e...
667KB Sizes 0 Downloads 0 Views