Introduction A fundamental tenet of molecular biology is that the amino acid sequence of a protein determines its structure and therefore its function. Indeed, the spontaneous folding of proteins is the point where Nature makes its leap from the one-dimensional genetic message of DNA to the three-dimensional world we inhabit. For a long time molecular biologists have been seeking a Rosetta stone that would elucidate the translation between the one-dimensional language of sequcnce and the three-dimensional language of structure. Although the Rosetta stone itself has not turned up yet, perhaps traces of it have recently appeared, and from these it is clear where people are concentrating their search. Recent progress has shown that whereas the connection between one sequence and one structure is too tenuous to interprct, conclusions may more confidently be derived from a table of aligned homologous sequences. Particularly noteworthy is the work of Benner and Gerloff(l) who have achieved remarkable results in the a priori structure prediction of the catalytic domain of protein kinase C, by a method that involves careful analysis of the patterns of residue variation in a family of related sequences. This is a spectacular achievement and it is our feeling that it will come to be recognized as a major breakthrough, but so far it is an isolated success. More typical of the currents in the recent literature is the work of David Eisenber and his colleagues Roland Luthy and James Bowie(’35. (Similar work has been carried out by Overington, Johnson, Sali and Blundell(4).) They have been developing ways to derive information in the typical and important case of the many-known-sequences-few-known-structurcs scenario. First they addressed the problem(2): Given that sequence x folds into structure X,can we find other sequences xl,xz,. .. that also fold into structure X?But then (it would seem) they pulled up short and asked(?): ‘Wait a moment, folks, does the original sequence x really fold into structure X?’ This is not a trivial question. BrandCn and Jones(’) discuss several examples of crystal structures containing gross errors in folding, and these among others are treated by Luthy, Bowie and Ei~enberg(~).

The former question: ‘Given an amino acid sequence and the corresponding protein structure, can one identify other sequences that share a common folding pattern?’ has been the subjcct of many investigations. It is well known that when a new amino acid sequence has significant similarity to the sequence of a protcin of known structure, then homology, and similarity of the corresponding folds can safely be inferred. In the course of evolution, however, structure changes more conservatively than sequence. Proteins with common evolutionary origin have similar folds, even though they may have diverged so far as to have no detectable sequence homology. A classic example occurs in rhodanese. the two domains of which presumably arose by gene duplication and divergence: the sequences have diverged beyond recognition as homologous, but the structure has been well prcserved. One would therefore like a way to recognize similar structures among sequences that have diverged to what W. F. Doolittle has called thc ‘twilight zone’, of sequence similarity too distant to prove homology. We would also point out that recognition of sequence similarity is not a Rosetta stone that links sequence with structure; instead it is a method for recognizing related sequences. (Comparably, it would be possible to recognize similarity between two panels of Egyptian hieroglyphics without being able to translate either into Greek.) Characterization of Residue Environments in Protein Structures What have Bowie, Liithy and Eisenberg done? Their approach has been to try to classify the environments of each position in known protein structures - rather than any particular sequence of residues that occupies these environments - in a way that can be linked to a set of preferences of the 20 amino acids for these structural contexts. If the essence of a known structure can be abstracted in such a way that the properties of positions in the structure can be related to general characteristics of the amino acids rather than to the particular sequence, then sequences that share these abstract properties might be expected to adopt similar structures. A particular advantage of this method is that it can be automated, with a new sequence being scored against every 3D profile in the library of known folds in essentially the same way as a new sequence is routinely screened against a library of known sequences. Given a protein structure, Luthy, Bowie and Eisenberg characterize the environment of each amino acid sidechain on the basis of three categories: (1) its main-chain hydrogen-bonding interactions, that is, its secondary structure; (2) the extent to which it is on the inside or the outside of the protein structure; and (3) the polar/nonpolar nature of its environment. The secondary structure may be one of three possibilities: helix, sheet and coil. A sidechain is

0

S

Fig. 1. Bowie, Luthy and Eisenberg characterize the environments of residues in proteins by their degree of exposure to solvent, and the polar vs. non-polar nature of the protein atoms with which they are in contact. Six classes are shown here. A third dimension of the classification is secondary structure - the mainchain hydrogen-bonding interactions of the residues. There arc thrcc classes of secondary structure: helix, shect and other. This gives a total of 6 ~ 3 = 1 classes. 8 The statistical prefcrence of certain amino acids for certain classes permits identification of folding pattcrns and detection of errors in structures.

considered buried if the accessible surface area is less than 40 A’, partially buried if the accessible surface area is between 40 and 114 A2, and exposed if the accessible surface area is greater than 114 A2. The fraction of sidechain area covered by polar atoms is measured. The authors define six classes on the basis of accessibility and polarity of surroundings (see Fig. 1); sidechains in each of these six classes may be in any of the three types of secondary structures, giving a total of 18 classes. Assigning each sidechain to one of 18 categories means that it is possible to write a coded description of a protein structure as a message in an alphabet of 18 letters. Bowie, Luthy and Eisenberg call this a ‘3D structure profile’. Not only is this reminiscent - in form - of writing the amino acid sequence as a message in an alphabet of 20 letters, but it means that software and algorithms developed for amino acid sequence data can be applied to ‘sequences’ of encoded structures. For example, one could try to align two distantly-related sequences by aligning their 3D structure profiles rather than their amino acid sequences. How can one relate the encoded structure to the corpus of known sequences and structures? It is clear that some amino acids will be unhappy in certain kinds of sites; for example, an arginine would not be buried in

an entirely nonpolar environment. Other preferences are not so clear-cut, but Bowie, Luthy and Eisenberg have derived a preference table from a statistical survey of a library of well-refined protein structures. They selected a set of 16 known structures, and encoded the environment of each residue in each structure. They also used sets of sequences homologous to the selected known structures, to increase the amount of data on the amino acids that occupy each of the sites. Why are residues sensitive to these characteristics of their environment? Protein structures are stabilized by a number of factors, prominent among which are the burial of hydrophobic residues and the relatively dense packing of protein interiors, and the formation of internal hydrogen bonds by buried groups that, in entering the protein interior, must give up their hydrogen bonds to water. A native protein structure contains a self-consistent assembly of residues such that each sidechain contributes to providing a comfortable environment for its neighbors. It is not true, of course, that a unique set of sidechains is required to create this set of environments - if it were, proteins could not evolve -but some sets of residues are better than others in providing certain kinds of interactions, and statistically these are reflected in sets of preferences of certain amino acids for certain environments. Use of 3D Profiles to Predict Folding Patterns Suppose now that we are given a sequence and want to evaluate the likelihood that it takes up, say, the globin fold. From the 3D structure profile of the known sperm whale myoglobin structure we know the environment class of each position of the sequence. Consider a particular alignment of the unknown sequence with sperm whale myoglobin, and suppose that the residue in the unknown sequence that corresponds to the first residue of myoglobin i s phenylalanine. The environment class in the 3D structure profile of the first residue of sperm whale myoglobin is: Exposed, no secondary structure. We can score the probability of finding phenylalanine in this structural environment class from the table of preferences of particular amino acids for this class. (The fact that the first residue of the sperm whale myoglobin sequence is actually valine is not used, and in fact that information is not directly accessible to the algorithm. Sperm whale myoglobin is represented only by the sequence of environment classes of its residues, and the preference table is averaged over proteins with many different folding patterns.) Extension of this calculation to all positions and to all possible alignments (not allowing gaps within regions of secondary structure) gives a score that measures how well the given unknown sequence fits the sperm whale myoglobin profile. This evaluation of the unknown sequence in terms of its potential fit to a fold need not be restricted to myoglobin. Bowie, Luthy and Eisenberg have constructed a library of protein folds, corresponding to

structures in the Protcin Data Bank. This has the form of a set of 3D structure profiles. A new amino acid sequence can be compared with all members of this library in order to determine whether the protein with that sequence is likely to adopt any of the folds represented. How well does the method work? Using the environment method, Bowie et al. constructed a 3D structure profile from sperm whale myoglobin which was able to distinguish myoglobins and (with lower scores) othcr globins. from unrelated proteins. The highest-scoring nonglobin sequcncc was 511th out of 544 in rank order in the database searched. Interestingly, sperm whale myoglobin was not the highest-scoring sequence, being bettered by 7 other myoglobins. This apparently anomalous result emphasises that the 3D profile does not retain detailed sequence information from the structure from which it was derived. Coding the environnient involvcs a greater degrec of abstraction than simple sequence matching, in which a single test sequence must give a better match to itsclf than to any other sequence. The authors discuss additional examples, including cyclic AMP receptor protein, ribose binding protein, and actin. They show that they can detect relationships too tenuous to be picked up by simple sequence homology search. A good example is the detection of the similarity in fold between actin and the 70kD bovine heat shock cognate protein HSC70. Use of 3D Profiles to Assess Structures We have pointed out that the 3D profile derived from a structure depends only very indirectly on the amino acid sequence. It is therefore meaningful to ask, not only whether other amino acid sequences are compatible with the given fold, but whether the score of the 3D profile for its own sequence is a measure of the compatibility of the sequence with the structure. Naturally, if real sequences did not generally appear to be compatible with their own structures, one would be forced to conclude that a useful method for examining the relationship between sequence and structure had not been achieved. (Where would we be if the Rosetta stone had contained three diffeevent texts in three different languages?) There is one case, however, when the question is serious and important. It does happen - fortunately only occasionally, and it is likely to occur less frequently from now on as a result of Liilhy, Bowie and Eisenberg’s work - that a structure determination contains qualitative errors in the folding pattern of the chain. A 5pectacular example was thc Azotohacter vinelandii ferredoxin structure, in which a chain was traced through the mirror image of the electron density. But even in structures that have the right folding pattern in general, errors in tracing particular regions of the chain have occurrcd, usually when the available data are relatively poor. One example occurred within

Dr Eisenberg’s own laboratory, in the case of the small subunit of ribulose-l,5-his-phosphate carboxylase (RUBISCO), and it is to his credit that this example is presented with complete frankness. An example of the kind of problem that can arise is the interchange of two /%rands; the trap for the crystallographer is that the electron density may reveal the strands themselves much more clearly than the loops connecting them. This can easily happen when as is typical - the strands are largely buried in thc interior of the protein, and well-ordered, but the loops between them are on the surface and subject to fluctuations in conformation. In this case the mainchain may be properly built into thc regions of electron density corresponding to the strands, but they may be wrongly connected, which implies that they are not properly assigned in the sequence. However, the patterns of accessibility may be different from strand to strand - for example, a strand on the edge of a sheet may be more exposed - and the patterns of residue preference may thereby detect the sequcncc-structure mismatch. Based on the scoring scheme for detecting similar folds discussed above, Luthy, Bowie and Eisenberg demonstrate the general principle that ‘correct structures give 3D structure profiles that are compatible with their own sequences’. Thc data for this study are collected by taking the experimental structures in the Brookhaven data bank, plus some theoretical models, and measuring the score of each sequence to the 3D profile of the corresponding structure. Some of the theorctical models are wrong by design: they were constructed as test cases by copying the sequence of one protein onto the structure of another. They find three relatively clear-cut classes. Correct, well-rcfined structures have high scores. Structures that are cntirely incorrect - by accidcnt or by design - have low scores. Structures that are largely correct but contain some errors, are intermediate, but can be distinguished from the correct structures. In thc case of a partially correct structure, a graph of the scores of the individual residues sometimes gives hints about where the errors may be. Needless to say, this provides ~ e r yuseful clues during the course of a structure determination! It requires that the structure determination has reached the stagc where a complete or at least nearly-complcte model has been crcated. It could also be used to test models, and. in principle, lo search for correct models. The fact that numerous incorrect structures were published, and made their way into the data bank, implies that no alternative method of assessing structures is equally reliable. There are a number of checks; some involving crystallographic data - notably the R-factor, which mcasures how well the structural model accounts for the measured X-ray data, but which is not a reliable test ot structural correctness - and others involving stereochemical checks that depend only on the coordinates, including the quality of

hydrogen bonding, and the distribution of residues in a Ramachandran plot. Some of these checks require access to experimental data, which are usually not generally available. Comparison with Other Methods Granting the utility of the 3D structure profiles, one may nevertheless ask whether, for any of the specific tasks to which it has been applied, other methods perform as well or better. Consider the problem of identifying the family of protcim sharing a fold with a given protein or set of proteins; for example: extract all globins from the sequence data banks. The flexiblc pattcrn matching method@^') is a procedure in which a pattern, derived from a set of aligned globin sequences and using additional structural data, does a perfect job of separating globins from nonglobins. In a search of a data basc containing 345 globin sequences (PIR Release 14), this method picked up all 345 globins as having higher scores than the first nonglobin. These results are not strictly coqmparable with those of Bowie, Liithy and Eisenberg(-). since Barton and Sternberg set out to match all globins, while Bowie et al. based their search solely upon sperm whale myoglobin, and used a residue preference library based on proteins with many different folds. However, neithcr method picks up phycocyanin, a protein with a fold very similar to that of the globins, although it does not contain the conserved histidines that in globins ligate the haem group. As for the question of assessing the correctness of structures, wc have already noted that alternative criteria did not prevent the appearance and dissemination of errors that would have been picked up by the 3D profile check. It is not, of course, that the erroneous structures failed to arouse suspicion. Hut part of the problem is the natural reluctance of (some) scientists to criticize others’ work. Thc 3D profile score approach therefore has another advantage: By providing an unambiguous, easily-measured, objective criterion for correctness, the degree of tact with which attention may be called to errors is greatly increased. Compare ‘The recent structure determined by Bloggs et al., looks a mitc wonky to mc ...’, with ‘The recent structure determined by Bloggs et al. does not fall within the acceptcd range of correct structures according to the criterion of Luthy, Bowie and Eisenberg.’ Conclusion In their efforts to abstract the properties of a protein

structure by coding a sequence of residue environments in a form independent of particular sequences, Bowie, Luthy and Eisenberg have established a general framework for sequence-structure correlation, and applied it to interesting and important problems, including detection of similar folds, and assessing correctness of structures. The versatility and utility of thcir approach make it conducive to extensive testing and thereby to rapid enhancement. And what are we to conclude about the title question‘?Apparently protein structure does determine amino acid sequence - at least among natural molecules (see Pastore and L e d 8 ) for a discussion of this rescrvation) - but not uniquely. What Bowie, Luthy and Eisenberg havc shown is that the structure demands a sequence such that the sidechains are reasonably compatible with the environments into which they are asked to fit. Acknowledgements We thank Dr P. McLaughlin for helpful discussion. A.M.L. is grateful to the Kay Kendall Foundation for generous support, and to the University of Otago for the William Evans fellowship, during the tenure of which this work was performed.

References 1 Benner, S. A. and Gerloff, D. (1991). Patterns of divergence in homologous

protcins as indicators of secondary and tertiary structure: A prediction of the structure of the catalytic domain of protein kinases. Adv. Enz. Reg. 31:121-181. 2 Rotvie, J . B., Liithy, R. and Eisenherg, D. (1991). A method to identify protein sequznces that fold into a known three-dimensional structure. Science 253, 164170. 3 Liithy, R., Bowie, J. U. and Eisenberg, D. (1992). Asscssment of protein models with 3D profiles. Nature 356, 83-85. 4 Overington, J., Johnson, M. S., h i , A. and Blundell, 1‘. L. (1990). Tertiary structural constraints on protein evolutionary divzrsity: templates. key residues and structure prediction. Proc. Roy. .Tor. R (lundon) 241, 132-145. 5 Brandb, G I . and Jones, T. A. (1990). Between objectivity and subjcctivity. Naturr 343. 687-489. 6 Barton, G. J. (1990). Protein multiple sequencc alignment and flexible pattern matching. Methods in Enzymolog)) 183, 403-328. 7 Barton, G. J. and Sternberg, M. J. E. (1990). Flexible protein sequencc patterns. A sensitivc method to detect weak structural honiulugies. J. iMol. Biol. 212. 389-402. 8 Pastore, A . and Lesk, A. M. (1991). Brave new proteins: What cvolution reveals about protein structure and what it cannot reveal. Curr. Op. in B U J ~ P C ~ . 2 , 592-598.

Arthur M. Lesk is at the Dcpartrnent of Hacmatology, University of Cambridge Clinical School, MRC Centre, Hills Road, Cambridge CB2 2QH, UK. D. Ross Boswell is at the Department of Pathology, Christchurch School of Medicine, Christchurch, New Zealand. -

Does protein structure determine amino acid sequence?

Introduction A fundamental tenet of molecular biology is that the amino acid sequence of a protein determines its structure and therefore its function...
517KB Sizes 0 Downloads 0 Views