Structure prediction and modelling Mark B. Swindells Protein Engineering Research Institute, Osaka, Japan

Cracking the second fundamental code of molecular biology (how the tertiary structure of a protein is determined by its amino acid sequence) remains an elusive goal. However, the impetus to establish credible approximations, if not a definitive solution to this relationship, has never been greater. In the past year significant progress has been made through a series of novel approaches. This review describes the most important developments and outlines how they can be usefully employed by those whose specialization lies outside the field.

Current Opinion in Biotechnology 1992, 3:338-347

Introduction

Sequence alignment

Successful sequence determination represents a significant achievement in a research project. More importantly, this information can be used to search for putative relationships a m o n g the wealth of previously determined sequences and structures. Armed with such information, a more rational approach can be taken in the design of future experiments, particularly those involving mutagenesis protocols.

For researchers not intimately associated with sequence alignment techniques, there are a baffling number of methods n o w available. As a result, there is a tendency to stick with the familiar, rather than to try new approaches. In this section, I shall summarize the problems that different methods are trying to overcome and describe the most recent advances.

Developments in sequence analysis are well documented and the human g e n o m e project has reinforced the requirement for improvements in the efficiency and sensitivity of current comparison methods. It is also well k n o w n that the number of proteins whose structures have been determined at atomic resolution does not compare favorably with the number of sequences n o w available. Nevertheless, the past few years have witnessed developments in this area also. Although the number of structures that have been determined remains small, it has b e c o m e apparent that many sequences can match an observed fold despite having no statistically significant sequence similarity using conventional alignment procedures. Most of the major advances of the past year make use of this information in order to propose the most likely fold for a test protein. I shall refer to this area as fold recognition.

The detection of sequences with statistically significant similarities to a test protein is important, as it implies that the proteins have a similar fold and provides information about which residues might be important for maintaining structural integrity and functionality. W h e n two proteins of at least 50 residues (determinations of statistical significance are not reliable for smaller proteins) have a high percentage residue identity (PRI), say more than 50%, the detection of a relationship and the alignment of the sequences is fairly trivial. In these cases the algorithms are not required to detect distant relationships and can concentrate on speed. Consequently, the methods that fall into this category use 'neat tricks' to speed u p the search process, while maintaining, sufficient sensitivity. Although one might expect the availability of more powerful computers to render this type of research redundant, the corresponding increases in the number of sequences available for comparison in fact make this area extremely important.

Once a relationship b e t w e e n a test protein and a known fold has been established, the construction of a detailed three-dimensional model becomes possible. This uses the standard knowledge-based modelling procedures which are outlined in Fig. 1 and described below.

Background

Pairwise sequence alignment When the researcher is only interested in sequences that are clearly homologous to one another, an algorithm designed for rapid alignment will provide

Abbreviations ECEPP/2--Empirical Conformational Energy Program for Peptides; LZS--leucine zipper sequence; PRI--percentage residue identity; SCR--structurally conserved region.

338

© Current Biology Ltd ISSN 0958-1669

Structure prediction and modelling Swindells

Stage 1

Stage 2

1

Stage 3

Sequence of known structure 1 loop 1 2 loop 2

3

1

3

loop 3

4

N terminus

2 Insertion

4 Deletion

Sequence of test protein

Stage 4

Stage 5

Stage 6

Cy

N

/'

,,,

C¢~ Mutation of alanine C' ~ h e n y l a l a n i n e

C' ~

0

Fig. 1. Standard modelling procedure. Stage 1. Establish a relationship between the protein of interest and a known fold (e.g. a four-helix bundle) using either conventional sequence alignment or the more recent fold-recognition approaches. Stage 2. Align the sequence(s) of known structure with that of the protein to be modelled, taking care to limit insertions and deletions to the loop regions. Stage 3. Using the coordinates of the known structure, construct a model for the structurally conserved regions (SCRs). The lengths of these SCRswill vary with percentage residue identity. However, they should at the very least include regions from the elements of regular secondary structure and may in certain cases include conserved loops. Stage4. Search a structural database for loops that possessthe appropriate end point geometries, the correct number of intervening residues and, where appropriate, key residues such as glycine and proline. Attach preferred conformation to the framework. Stage 5. Model side chains of the new protein using the sequence of the test protein, the orientations observed in the known structure anetother rules governing side-chain placement. Take care to avoid steric clashes with the main chain and other side chains. Stage 6. Remove the remaining stereochemical violations and improve overall geometry using energy minimization. Finally, assessthe quality of modelled structure using procedures such as accessibility profiles and hydrogen bonding criteria.

the necessary capabilities. There are now numerous methods available and all major databases come bundied with adequate software for searching their entries. Two particularly popular packages are FASTA [1,2] and BLAST [3] which can detect fairly weak sequence similarities. If the aim is to locate distant relationships, speed must give way to sensitivity. The major problem here is that the only way of determining whether an alignment is globally significant is by comparing the actual alignment score with that for randomized sequences of equivalent composition and length. When the actual score is no higher than that for the randomized alignment, a relationship cannot b e established even though one might exist. Furthermore, even when distant relationships have been reliably detected, the accuracy of the corresponding alignment is not guaranteed at all positions. Parameters such as gap penalties exacerbate the problem and unfortunately hindsight is the

only way of knowing whether a particular choice was prudent. One way to refine such methods would be to alter the scoring matrix used to compare residue similarities. Despite a number of attempts in this area, however, the results have been disappointing and most researchers still use matrices based on the work of Dayhoff [4]. A more promising alternative is to delineate reliably aligned regions within a complete alignment. Such an approach has b e e n implemented by Vingron and Argos [5"]. The procedure is elegant and, what is more, it simultaneously tackles the problems faced w h e n choosing suitable gap penalties for an alignment. This is achieved by finding regions in a sequence alignment which remain unaffected by parametric alterations. The procedure uses conventional alignment methods (ie. dynamic programming, a scoring matrix and gap penalties) but applies the algorithm twice, starting at each terminus of the protein chain in turn. This modification enables the display of a range of suboptimal

339

340

Proteinengineering alignments (regions w h o s e scores are within certain predefined limits of the optimal alignment) that can be used to determine regions that are reliably aligned. The efficacy of this algorithm was assessed using various sequence families. Although the method performed well in all test cases, one particularly notable success was for the alignment of erythrocruorin with s p e r m whale myoglobin, w h e r e all structurally aligned residue pairs were correctly aligned. This result is impressive, given the low PRI of 22% b e t w e e n the two sequences.

Multiple sequence alignment If the m e t h o d s used to c o m p a r e two sequences were also applied to the alignment of m a n y sequences, the computation would rapidly b e c o m e unrealistic. This is because the matrix from which an alignment is derived has the same n u m b e r of dimensions as the n u m b e r of sequences to be aligned. Therefore, although a pair, wise alignment procedure is only an N 2 problem the alignment of four sequences would rapidly expand into an N 4 task. As a result, multiple alignment methods tend to b e based on the iteration of pairwise alignment. One p o p u l a r method which employs this approach is that of Barton and Sternberg [6]. As one might expect, multiple alignment procedures are subject to the same deficiencies as the pairwise methods. In a second paper, Vingron and Argos [7"'] describe methods to detect reliable regions in multiple sequence alignments. Their approach searches for consistent similarities by applying a simple matrix multiplication procedure to the dot matrices generated for each pairwise comparison. This method is particularly useful as it does not require the use of gap penalties. For more detailed reviews of the developments in sequence alignment techniques, the reader should refer to Argos et al. [8] and the recently published volume of M e t h o d s in E n z y m o l o g y [9].

Prediction of low level structural information If no relationship with a sequence of k n o w n structure is detected, one must resort to predicting low level structural information. Over the past twenty years innumerable groups have tried to predict secondary structure from sequence information. Perhaps the best k n o w n methods are those of Chou and Fasman [10], Gamier et al. [11] and more recently, Gibrat et al. [12]. However, despite the wide range of approaches developed, n o n e is able to predict secondary structure to an accuracy of more than 65%, which is insufficient to begin the construction of a three-dimensional model. However, this situation can improve dramatically w h e n a family of aligned sequences with sufficient diversity is available. This is because insertions and deletions, w h e n available, can be used to distinguish loop regions from those of regular secondary structures, and the pattern of sequence variability at each aligned po-

sition can be used to refine the prediction of secondary structure type. This idea is not particularly n e w and has b e e n implemented in a semi-automatic prediction method b y Zvelebil et al. [13], but until n o w the approach has not b e e n explored in depth. This situation could well change, however, as such an approach has b e e n used to excellent effect by Benner and Gerloff [14"'] in their prediction of the catalytic domain of the protein Mnases. In fact, w h e n the tertiary structure for this protein was finally determined by X-ray crystallography, 13 out of the 16 loop regions had been correctly identified, and a similarly impressive result was achieved for the regions of regular secondary structure. Over-prediction w a s minimal and there was only one significant error in the core of the molecule. Although the predictive accuracy remains around the 65% level, the opportunity for extending the prediction to a tertiary fold is much i m p r o v e d using this approach. The availability of 79 sequences with sufficient diver, sity certainly helped Benner and Gerloff [14"q to parse the loops with remarkable accuracy, but unfortunately such a b u n d a n c e is rare. In addition, current automated methods cannot accurately align sequences that have less than 35% PRI without reliable anchor points (such as k n o w n active site residues) and manual intervention, despite the developments listed in the previous section. Given the low PRIs b e t w e e n many protein kinase sequences and the remarkable accuracy of the resulting alignment, one must conclude that specialist k n o w l e d g e of the protein kinase family was used to guide the alignment process. Of course, this is only one example and it is n o w well k n o w n that meaningful assessments of n e w procedures can only be achieved using automated methods, tested over a large data set of proteins. Nevertheless, it remains an exciting development. Given the limitations of conventional secondary structure prediction, there has b e e n renewed interest in analysing the different regions occupied in a ¢//g plot. Gibrat et al. [15"] have u s e d the information theory approach, previously e m p l o y e d in secondary structure prediction, to predict the ¢ / ~ regions of residues. Resuits of this w o r k were similar to those for secondary structure prediction, with an accuracy of 65% being reported again. The inevitable conclusion of this w o r k is that no love level structural information can b e assigned with 100% accuracy from local sequence information and longer tertiary interactions must be considered. Another analysis of ¢//g angles has concentrated on their application to sequence alignment [16" ]. The idea in this case is to substitute the standard Dayhoff matrix with a structure-derived correlation matrix containing ¢ / ~ information. Niefind and Schomberg [16- ] suggest that their matrix, which is derived solely from torsion angle preferences, may b e m o r e effective than the Dayhoff matrix, which mixes both genetic and structural information. However, given that the Dayhoff matrix has retained a competitive edge over other structurally derived matrices and has withstood rigorous testing for over a decade, it is unlikely that researchers will change allegiance.

Structure prediction and modellingSwindells 341 Ab inltio approaches to protein folding

the problem of prediction can be rephrased as: 'Can one identify the correct fold given a sequence?'.

Although it has been recognized for a long time that progress can only be achieved through the consideration of long range interactions, there has been little success in developing procedures to make use of such information. Generally, such approaches have concentrated on computationally intensive a b initio energy minimization and molecular dynamic simulations, but the results have b e e n disappointing. As the aim of this review is to concentrate only on developments that may be of immediate use, I shall only provide a limited description of these areas.

Bowie et al. [20"] have addressed this question with encouraging results. Their m e t h o d uses procedures more c o m m o n l y found in the world of sequence alignment, but now, instead of matching a sequence against a sequence, it must match a sequence against a structure. Bowie et al. chose to describe a fold as a string of structural environments. Eighteen environments were defined, consisting of six different accessibility states for each of the secondary structure classifications: helix, strand and coil. Using a matrix derived from k n o w n structures and homologous sequences, the probability of finding different residue types in each environment was quantified. Dynamic programming was used to find the best match of a test sequence with a structural profile. The method was tested on numerous families of proteins and was able to identify a relationship between actin and the heat shock proteins. However, no relationship with the topologically similar hexokinases was identified.

Bovine pancreatic trypsin inhibitor is frequently used to test such procedures as it is small and contains three disulphide bridges that can be used to limit the search space. Simon et al. [17"] try to predict the structure of this protein b y employing a build-up procedure which starts with tripeptides in a limited number of minimized conformations, chosen by considering their energies as calculated by the Empirical Conformational Energy Program for Peptides (ECEPP/2) potential. From this starting point the tripeptides are joined, minimized and reassessed in an iterative manner until a complete structure has been formed. Full advantage is taken of the natural limitation for overlapping segments to have complementary conformations, but although the method is elegant, the results are disappointing despite the inclusion of disulphide b o n d information. Another w a y of constructing such models is through lattice polymer simulations (for an example, see Chan and Dill [18]). This method has b e c o m e popular as lattice simulations are computationally efficient. Previous work using this type of model has suggested that compactness of form may present an overwhelming driving force in protein folding. In a recent paper, however, Gregoret and Cohen [19"] compare the results of Chan and Dill [t8] with those from a method that is unaffected by lattice constraints. Their conclusions are that although compactness of form is important, the magnitude of this effect is less than previously proposed. It would seem, therefore, that the projection of a complex set of protein coordinates onto a simplified lattice is currently flawed by the type of lattice used. Although one could alter the lattice used, one may then end up trying to answer a question which is more complicated than the folding problem itself.

Fold recognition So what other methods are there to consider tertiary interactions? One way forward is to take advantage of the observation that three-dimensional structure is more conserved than sequence. In fact, there are n o w many examples of functionally similar and dissimilar proteins with the same fold, despite having PRIs of less than 20%. If the stereochemistry of the polypeptide chain imposes a limit on the number of folds available,

Of course this method can only be used w h e n the fold of the test protein has been observed in a k n o w n structure. Finkelstein and Reva [21"] suggest a solution to this problem by generating model folds. In a limited test, a lattice representation of an eight-stranded, antiparallel, ~barrel structure was constructed. Various sequences were then tested for suitability using a combination of molecular field theory and one-dimensional statistical mechanics. Using an iterative search procedure to find the optimal threading for each sequence, two different tests were performed. In the first, the aim was to detect the optimal alignment w h e n given the correct loop connectivities between the elements of regular secondary structure. In the second test, a search for the correct loop connectivity was performed by threading the sequence through all 60 possibilities for this fold. The results for each approach were equally impressive, with the majority of non-native patterns being strictly rejected. The approaches described here represent a significant development in the field of structure prediction as they enable the detection of structural similarities at much lower PRls than was possible previously. However, this area is still relatively new and it is likely that the coming year will see the development of more sophisticated techniques that have the same general aim. To achieve their full potential, these methods must be able to detect similar folds within m u c h weaker sequence similarities. For example, the phycocyanins have the same fold as the globins, and soybean trypsin inhibitor has the same [3barrel fold as interleukin-1, despite having a PRI of only 7%.

Model building Given that a relationship has been established with a previously observed fold and that an accurate alignment has b e e n achieved, the next stage is to construct a

342

Proteinengineering full set of model coordinates in three dimensions. The procedures used to perform such tasks are now well known and were discussed in detail in my review last year [22]. Given that loop regions are generally more susceptible to conformational change than the regions of regular secondary structure, the first step is to define a set of structurally conserved regions (SCRs) onto which more appropriate loops can be attached. Inevitably, this is an over-simplification, as in practice the size of the SCRs will d e p e n d on the PRI between the two sequences. To take extreme examples of the modelling problem, if only one residue differed between the sequences, the SCRs w o u l d probably consist of all regions except for those sequentially within a few residues of the mutated position and perhaps those that are close by in three dimensions. Conversely, if a similarity has been detected between two proteins with low PRIs, it is likely that even the regular secondary structures will have significant structural shifts. In these cases the model will be even more difficult to construct and probably less reliable. It is difficult to construct reliable models at low Pills by hand as one must continually check that any alterations do not violate polypeptide chain stereochemistry. This is particularly difficult w h e n attaching the loop regions to the SCRs and mutating the side chains to the residues observed in the new sequence. Because of these problems, various groups have developed sophisticated modelling packages that automatically check for violations and make appropriate alterations. One of the first programs to achieve this was Composer [23,24] which aims to provide a realistic model with the minimum manual intervention. Although the method works well for PRIs d o w n to 40%, it lacks the necessary sophistication to allow SCRs to vary in a realistic manner, and attempts to circumvent this by applying principles such as distance geometry have not been as successful as was h o p e d initially. Until recently, the researcher w o u l d have had to resort to manual modifications using one of the many graphics packages available, while drawing on expert knowledge as well as extreme patience. However, a new modelling package called 'What if [25" ] assists many of these problems by integrating fast routines for building loops, mutating side chains, regularizing geometry, searching databases and superposing structures (as well as innumerable other capabilities) with a graphics interface that enables frequent manual intervention. The structural superposition tool is particularly useful [26"] as it is able to detect structural similarities in coordinate sets without any need to define starting positions. Although this concept is not new, its speed and integration within this suite of programs makes it particularly useful. Vriend and Sander [26"'] document a previously undetected similarity between ubiquitin and (2Fe-2S) ferredoxin structures as evidence of their method's suitability for the task.

Naturally, observations of such structural similarities reiterate the importance of algorithms involved in fold recognition. This package is perhaps the first to be designed with the protein modeller in mind and owes its success not only to the main author but also to m a n y other researchers w h o generously provided their code. The 'What if' package is available free of charge to academic users and at a modest price to commercial organisations. Towards similar aims, Huysmans et al. [27-'] describe a relational database for macromolecular structures, which is able to interface with sequence databases such as SWISS-PROT [28], as well as their own graphics package. Both 'What if' [25"] and the relational database [27"q have relative advantages, with the latter concentrating on providing comprehensive solutions to all types of modelling problems.

Assessing the accuracy of a model structure Once a model has been constructed, assessment of its potential reliability is required. Since the initial reports by Hendlich et al. [29] that a potential of m e a n force, derived from proteins of k n o w n structure, could be used to distinguish native conformations, this field has seen further developments. One method originates from the group w h o devised profile-based fold recognition [30"]. This should come as no surprise, as the aim of both is to detect whether the test sequence is suitably modelled onto the fold. The major difference is that modelling errors are usually more subtle than a gross misrepresentation of fold. In a different approach, Holm and Sander [31"] compare full model coordinates, generated using a database algorithm and an initial set of Ca coordinates, with those derived from the original electron density. The authors suggest that in regions where the two models differ, it is frequently possible that the original model derived from electron density is in error. In particular, they focus on data sets that have unusually large numbers of peptide flips and chain breaks, or abnormally low numbers of backbone hydrogen bonds. l

Whilst these results are encouraging, one must remember that although the methods can frequently detect where a model structure is wrong, they cannot guarantee that it is correct.

Hydrophobicity The concept of hydrophobicity and its contribution to protein stabili W continues to interest many groups of researchers. One paper o n this subject [32"] questions the suitability of approximating the hydrophobic core to that of a liquid hydrocarbon, such as cyclohexane. Such a question is relevant, as numerous scales, previously developed to describe the hydrophobicity of

Structure prediction and modellingSwindells 343 each amino acid type, have investigated the partition coefficients between water and organic solvents exhibited b y each residue. By taking two residues, which were buried and apparently in contact within the bacteriophage gene V protein (the structure of which has been determined crystallographically), a comprehensive set of double mutations were performed. From estimations of protein stability, it was suggested that all mutations destabilized the structure, including a simple positional exchange of the original two residues. However, mutations involving polar residues were less destabilizing than predicted using a conventional cyclohexane model. In addition, the energy contributions from each mutation in the pair were found to be additive. The results suggest that current scales do not adequately describe the concept of hydrophobicity. Furthermore, it w o u l d appear that, at least in this example, the two residues do not interact despite their apparent proximity in the crystal structure. A suitable model for packing protein interiors must, therefore, take into account both polar interactions and site-dependent packing energies. In a similar paper, Lim and Sauer [33"] investigate the effects of mutating three hydrophobic residues (putatively in contact with one another) within the five residue subset leucine, valine, isoleucine, methionine or phenylalanine. The results show that 70% of the isolated mutants were active but only two retained the stability and activity of the wild type. Thus, although the gross structure is tolerant towards core mutations which maintain hydrophobicity, the 'detailed' structure is perturbed and this almost always has an adverse effect on functionality. Taken together, these papers suggest that although the core and the fold are tolerant towards certain mutations that maintain hydrophobicity, the biological activity has often b e e n optimized in Nature.

Helix stability It is now well known that helices possess a substantial dipole. A vogue theory of the late 1970s was that the field resulting from the dipoles of the individual peptide groups could be approximated to the field produced by two charges of opposite sign, separated by a distance equivalent to the helix length. The net effect w o u l d be similar to that produced by a capacitor in a simple electronics device. Over the intervening years this concept has lost popularity and in their most recent paper [34"], the Warshel group provide evidence that charge stabilization at helix termini derives from local peptide dipoles, rather than the helix macrodipole. Their calculations agree with data from two experimental studies, and indicate that the first turn of the helix accounts for most of the observed effect. This paper reinforces the importance of modelling long-range electrostatic effects with care and highlights the inade-

quacies of the simple 'dielectric constant' model, used in most commercial modelling packages.

Matching molecular surfaces If one is lucky enough to have two molecules of k n o w n structure that are k n o w n to interact, the next step is to investigate the details of this interaction. Jiang and Kim [35"] describe a hierarchical method to dock two molecules, in which each molecular surface is summarized using a series of small cubes, with surface normals related to each atom. Using this simplification, shape complementarity can be investigated using a global search strategy. Matches that are devoid of steric clashes can then be assessed for further suitability using methods such as electrostatic complementarity. These two steps drastically reduce the number of solutions, leaving a small subset to be screened in detail. Tests suggest that the accommodation of limited conformational change, a capability which is vital if such methods are going to appeal to a wider audience, was successful. However, even with such simplifications it remains computationally intensive and despite the real intellectual challenge posed, the current state of the art is not sufficient to reco m m e n d routine application.

Applications of structure prediction and modelling techniques This review has concentrated on developments in methodology for predicting protein structure. As the field is n o w showing signs of maturity, the number of papers describing the use of such techniques as a standard tool in an overall experimental strategy is also increasing. The papers I have chosen to discuss in this section highlight the variety of research areas that harness the power of a modelling component. The A2 subunit of crustacyanin, an astaxanthin-binding protein from the lobster carapace, exhibits weak sequence similarity (25% residue identity) with the retinol binding superfamily [36"']. This family of proteins is of particular interest as the spectral shifts observed on ligand-binding do not seem to result from a conformational change in the ligand. Construction of this model was complicated by the observation that the crustacyanin A2 sequence is most similar to those in the ~lactoglobulin family, while retaining a disulphide bonding pattern more indicative of the porphyrin-binding subfamily. Although this work will provide a useful working model for the study of spectral shift phenomena, there will inevitably be significant errors at such a low PRI. It is therefore exciting to see that the same group is conducting a full crystallographic analysis. A similar degree of difficulty is experienced w h e n modelling h u m a n aromatase cytochrome P-450 using the

344 Proteinengineering structure of bacterial cytochrome P-450cam [37"]. Nevertheless, a model was built of the active-site region and two regions predicted to be close to the substratebinding pocket were mutated. Both sets of mutants reduce activity and lend some support to the model. Although short repeating sequence motifs are not frequent in globular proteins, one notable exception is the leucine zipper sequence (LZS) in which leucine side chains positioned in a regular manner along the protein chain interact in a zip-like manner. In an elegant e x a m p l e of modular design [38"], a 35-residue LZS was t a k e n from yeast transcription factor GCN4 and appended to the m o n o m e r i c maltose-binding protein, MAIE. The hybrid protein MAIE-LZS was efficiently exported into the periplasmic space of E s c h e r i c h i a coli w h e r e it was found as a dimer, presumably held together b y the leucine zipper tail. Interleukin-4, a m e m b e r of the fast growing group of leukocyte communication proteins, has b e e n modelled using a combination of circular dichroism spectroscopy, secondary structure prediction, disulphide bridge constraints and k n o w n topologies for four-helix bundles [39"]. Interleukins are of major therapeutic importance and models such as these are of use to a wide range of research groups. In this paper, a good logical a p p r o a c h to tertiary structure prediction was e m p l o y e d and the proposals s e e m to be in line with preliminary NMR studies which have also interpreted the protein as a four-helix bundle. However, the model and experimental topologies differ and the correct solution remains unclear, as there are currently insufficient distant constraints in the l o o p regions of the experimental structure. Although Pastore et al. [40"] suggest various ways around the errors encountered during NMR-determination, involving the combined use of energy evaluations, qb/~t/plots, chirality and solvent accessibility, n o n e of these methods can guarantee an accurate structure and the best solution seems to be to collect more data. The study of entire protein families, w h e n available, can be extremely beneficial w h e n considering an overall strategy for protein engineering. In fact, one of the seminal references to h o m o l o g y modelling [41] details both sequence and structural variations within the serine proteases. It c o m e s as no surprise therefore that other groups should perform similar analyses w h e n appropriate. Recently, Siezen et al. [42. ] have reviewed subtilisin-related proteinases, for which there are n o w over 40 k n o w n sequences and four structures. In their paper, the structural framework for the catalytic domain is identified and variations in the variable regions are neatly summarized. This information is then used to suggest a strategy for introducing disulphide bonds in order to increase thermal stability. No review would b e complete without the mention of antibody modelling. In fact, this area is n o w so detailed that a w h o l e review could easily b e devoted to describing the specific methodologies employed. Although this year has not seen any further developments in modelling principles, the applications of such meth-

ods s h o w no sign of decreasing. In one p a p e r [43"], antibodies raised against morphine were sequenced, modelled and then used to predict the binding site for the inhibitor in the h o p e that the antibody-combining site w o u l d be similar to the morphine receptor. Unfortunately, few such inferences could be drawn despite the efforts made by Kussie et al. [43"]. Modelling strategies usually concentrate on predicting the conformation of the six hypervariable loops, as they vary more than the framework regions. Nevertheless, frameworks do vary and these alterations can h a v e far-reaching effects. W h e n m o u s e m o n o clonal antibody mAb425 was h u m a n i z e d b y grafting the m o u s e complementarity-determining regions onto the h u m a n framework [44"], reduced avidity led the authors to investigate the influence of the f r a m e w o r k on the structure of the combining site. Using a model of the humanized antibody, six residues of particular interest w e r e identified on the framework and mutants were subsequently constructed. The mutants s h o w e d a wide range of avidities for the antigen, despite complete conservation in the loop regions. Two mutations were found to b e particularly important, one probably through its direct interaction with the antigen a n d the other b y influencing the structure of a complementarity-determining region loop.

Conclusion The emergence of n e w ideas for predicting protein structure promises a great deal and in this review I have described a range of developments. The most important advances this year have clearly c o m e in fold recognition. The preliminary results suggest that although we are still a long w a y from understanding protein folding from first principles, it may s o o n be possible to recognise three-dimensional structures solely from sequence. If realised, rational design will take another step towards its ultimate goal.

Acknowledgement f

The author would like to t h a n k Professor Janet T h o r n t o n for assistance in preparing this review.

References and recommended reading Papers of particular interest, published within the annual period of review, have b e e n highlighted as: of special interest •. of outstanding interest 1.

PEARSON XVR, LIPMAN DJ: I m p r o v e d T o o l s f o r B a s i c Sequence C o m p a r i s o n . Proc Natl Acad Sci USA 1988, 85:2444-2448.

2.

PEARSONW: Rapid and Sensitive Sequence C o m p a r i s o n w i t h FASTP a n d FASTA. Methods Enzymol 1990, 183:63-98.

Structure prediction and modelling Swindells 345 3.

ALTSCHULSF, GISH W, MILLER W, MYERS EW, LIVMAN DJ: Basic Local Alignment Search Tool. J Mol Biol 1990, 215:403410.

4.

DAYHOFFMO, SCHWARTZ RIM, ORCLrln: BC: T r a n s f e r RNA. In Atlas of Protein Sequence and Structure. Washington DC: National Biomedical Research Foundation; 1978:345-358.

5. ..

VINGRONM, ARGOS P: D e t e r m i n a t i o n o f Reliable Reg i o n s i n P r o t e i n S e q u e n c e A l i g a u n e n t s . Protein Eng 1990, 3:565-569. This paper describes a method that automatically delineates reliably aligned regions within a s e q u e n c e alignment. The procedure simultaneously tackles the problems faced w h e n choosing suitable gap penalties for an alignment by detecting regions w h i c h remain unaffected by parametric alterations. T h e procedure applies the dynamic programming algorithm twice, starting at each terminus of the protein chain in turn. This enables the display of a range of regions w h o s e scores are within predefined limits of the optimal alignment. These can subsequently be u s e d to determine which regions are reliably aligned. 6.

BARTONGJ, STERNBERG MJE: Evaluation a n d Improvemerits i n t h e Automatic Alignment o f Protein Seq u e n c e s . Protein Eng 1987, 1:89-94.

7. •.

VINGRONM, ARGOS P: M o t i f Recognition a n d A l i g n m e n t f o r M a n y S e q u e n c e s b y C o m p a r i s o n o f Dot-matrices. J Mol Biol 1991, 2 1 8 : 3 3 4 3 . A m e t h o d is described for detecting reliably aligned regions in multiple s e q u e n c e alignments. The procedure applies a simple matrix multiplication to the dot matrices generated for each pairwise comparison. 8.

ARGOSP, VINGRON M, VOGT G: Protein S e q u e n c e C o m p a r i s o n : Methods a n d Significance. Protein Eng 1991, 4:375-383.

9.

DOOLITTLE RF (El:)): M o l e c u l a r Evolution: Computer A n a l y s i s o f P r o t e i n and Nucleic Acid Sequences. Methods Enzymol 1990, 183.

10.

CHOU PY, FASMANGD: P r e d i c t i o n o f S e c o n d a r y Structure o f P r o t e i n s f r o m T h e i r -Amino Acid Sequence. Adv Enzymol 1976,47:45-147.

11.

GARNIERJ, OSGUTHORPE DJ, ROBSON B: A n a l y s i s o f the A c c u r a c y and Implication o f S i m p l e Methods for Predieting t h e S e c o n d a r y Structure o f G l o b u l a r P r o t e i n s . J Mol Btol 1978, 120:97-120.

12.

GIBRATJ-F, GARNIERJ, ROBSON B: F u r t h e r D e v e l o p m e n t s o f P r o t e i n S e c o n d a r y Structure Prediction Using I n f o r m a t i o n Theory. J Mol Biol 1987, 198:425-443.

13.

ZVELEBILMJ, BARTON GJ, TAYLOR WR, STERNBERG MJE: Pred i c t i o n o f P r o t e i n S e c o n d a r y Structure a n d Active Sites u s i n g A l i g n t n e n t o f H o m o l o g o u s S e q u e n c e s . J Mol Biol 1987, 195:957-961.

BENNERSA, GERLOFF D: Patterns of D i v e r g e n c e i n H o m o l o g o u s P r o t e i n s as I n d i c a t o r s o f S e c o n d a r y a n d Tert i a r y Structure: the Catalytic D o m a i n o f P r o t e i n Kin a s e s . Adv Enz Reg 1991, 31:121-181. The catalytic d o m a i n of the protein kinase family was predicted using information from a diverse set of aligned sequences. The procedure parsed the alignment for insertions and deletions a n d used t h e m to infer the positions of the loops. The remaining regions were subsequently s c a n n e d for patterns indicative of helices a n d strands. W h e n the tertiary structure for this type of protein was finally determined by X-ray crystallography, 13 out of the 16 loop regions h a d b e e n correctly identified and a similarly impressive result was achieved for the regions of regular secondary structure.

observed for secondary structure prediction with a n accuracy of 65% being reported. 16.

NIEFIND K, SCHOMBERG D: A m i n o Acid Similarity Coeff i c i e n t s for P r o t e i n M o d e U i n g and S e q u e n c e Alignm e n t f r o m M a i n - c h a i n F o l d i n g A n g l e s . J Mol Biol 1991, 219:481--497. Similarities b e t w e e n 0/V m a p s for e a c h amino acid type are used to construct substitution matrices for u s e in sequence alignment a n d modelling applications. 17.

SIMONI, GLASSERL, SCHERAGA HA: C a l c u l a t i o n o f P r o t e i n C o n f o r m a t i o n as a n A s s e m b l y o f Stable Overlapping Segments: Application to B o v i n e P a n c r e a t i c T r y p s i n I n h i b i t o r . Proc Natl Acad Sci USA 1991, 88:3661-3665. Predictions for bovine pancreatic trypsin inhibitor structure, using energy minimization, are described. The authors start with minimized tripeptides w h i c h are repeatedly 'grown' a n d energy minimized until a complete structure is formed. Although the results are not particularly successful, the idea is elegant. 18. 19.

GREGORETLM, COHEN FE: P r o t e i n Folding. Effect o f Packi n g D e n s i t y o n C h a i n C o n f o r m a t i o n . J Mol Biol 1991, 219:109-122. An investigation into the claims by C h a n and Dill [18] that compactness of form m a y be an overwhelming driving force during protein folding. The conclusions are that although a compact structure is required, the magnitude of this effect is not as large as previously claimed. 20. •,

BOWIE B, LOTHY R, EISENBERG D: A M e t h o d to I d e n t i f y P r o t e i n S e q u e n c e s t h a t F o l d i n t o a K n o w n Three-dim e n s i o n a l Structure. Science 1991, 253:164-169. The application of templates consisting of accessibility and secondary-structure information from a protein of k n o w n structure to the identification of other sequences w h i c h adopt a similar fold. This p a p e r s h o w s that the inclusion of structural information in a sensible m a n n e r enables the detection o f proteins w h o s e folds are similar despite having PRIs of less than 25%. However, the fold must have already b e e n observed in Nature. 21. FINKELSTEINAV, REVA BA: A Search for t h e M o s t Stable •. Fold o f P r o t e i n C h a i n s . Nature 1991, 497~i99. In this paper, a procedure for determining the complementarity of a s e q u e n c e with a model fold is discussed. A lattice representation for an eight-stranded, antiparallel, [~barrel structure w a s constructed a n d in a limited test, sequences were tested for suitability, using a combination of molecular field theory a n d one-dimensional statistical mechanics. 22.

SWINDELLSMB, THORNTON JM: Structure Prediction and M o d e l l i n g . Curt Opin Biotechnol 1991, 2:512-519.

23.

SUTCLIFFEMJ, HANEEF I, CARNEY D, BLUNDELL TL: K n o w l e d g e B a s e d M o d e l l i n g o f H o m o l o g o u s P r o t e i n s , P a r t 1. Protein Eng 1987, 1:377-384.

24.

SUTCLIFFEMJ, HAYES FRF, BLUNDELLTL: K n o w l e d g e Based M o d e l l i n g o f H o m o l o g o u s P r o t e i n s , P a r t 2. Protein Eng 1987, 1:385-392.

14. ..

15.

GIBRATJ-F, ROBSON B, GERNIER J: I n f l u e n c e o f t h e Local _Amino Acid Sequence u p o n the Z o n e s o f t h e T o r s i o n a l A n g l e s ¢ and V A d o p t e d b y Residues i n P r o t e i n s . Biochemistry 1991, 30:1578-1586. An approach b a s e d on information theory is applied to the prediction of (p/Ig angles. The results reported are similar to those previously

CHAN HS, DILL KA: C o m p a c t P o l y m e r s . Macromolecules 1989, 22:45594573.

25. VPaEND G: W h a t If: a M o l e c u l a r M o d e l l i n g a n d D r u g •. D e s i g n P r o g r a m . Graphics 1990, 8:52-56. A sophisticated n e w program for modelling proteins that is designed to tackle problems frequently e n c o u n t e r e d w h e n modelling macromolecules. VRIEND G, SANDER C: D e t e c t i o n o f C o m m o n T h r e e d i m e n s i o n a l S u b s t r u c t u r e s i n P r o t e i n s . Proteins 1991, 11:52-58. A rapid procedures which does not require any starting equivalences is described for aligning protein structures. The authors document a previously undetected similarity b e t w e e n ubiquitin a n d (2Fe-2S) ferredoxin structures as evidence of their method's suitability to the task. 26. ..

27. ..

HUYSMANS M, RICHELLE J, WODAK S: SESAM: a Relat i o n a l Database for Structure and S e q u e n c e o f Macrom o l e c u l e s . Proteins 1991, 11:59-76.

346

Protein engineering A relational database is described that contains a variety of information a b o u t proteins of k n o w n structure. This information is contained within numerous tables and accessed using Structure Query Language. This enables t h e user to formulate sophisticated queries in a relatively simple manner. Interface links to external databases and graphics packages are also enabled, 28.

29.

BAIROCH A, BOECKMANN B: The SWISS-PROT Protein Sequence Data B a n k . Nucleic Acids Res 1991, 19:22472249. HENDLICH M, LACKNER P, WEITCKUS S, FROSCHAULER R, GOTTSBACHER K, CASARI G, SIPPL M: Identification o f

Native Folds Amongst a Large Number of Incorrect Models. J Mol Biol 1990, 216:167-380. Lf~THYR, BOWIE B, EISENBERGD: Assessment o f Protein Models w i t h Three-dimensional Profiles. Nature 1992, 356:83-85. The detection of errors in modelled structures is particularly important given the n u m b e r of m o d e l structures n o w produced routinely. The authors demonstrate h o w structural profiles can be employed to detect errors, by s h o w i n g that m a n y previously incorrectly determined crystallographic structures w o u l d not have p a s s e d their criteria. 30.

31. •.

HOLM L, SANDER C: D a t a b a s e A l g o r i t h n x f o r G e n e r a t i n g Protein Backbone and Side-chain Co-ordinates from a Cc~ Trace. J Mol Biol 1991, 218:183-194. A rapid procedure for automatically generating a complete set of atomic coordinates solely from C c~ coordinates is described. The authors report that the procedure is well suited to tackling the problems frequently encountered during structure determination and modelling by homology. SANDBERG WS, TERWILLIGER TC: Energetics o f Repacking a Protein Interior. Proc Natl Acad Sci USA 1991, 88:1706-1710 A comprehensive set of double mutations w a s performed on two residues w h i c h were buried a n d apparently in contact within the bacteriophage gene V protein. From estimations of protein stability, it was suggested that all mutations destabilized the structure, including a simple positional e x c h a n g e of the original two residues. However, mutations involving polar residues were less destabilizing than predicted using a conventional cyclohexane model. 32.

LIM WA, SAUER RT: The Role o f Internal Packing Intera c t i o n s i n D e t e r m i n i n g the Structure and Stability o f a P r o t e i n . J Mol Biol 1991, 219:359-376. An experimental investigation of the contribution by hydrophobic core residues to structure, stability a n d ligand binding. Three buried interacting residues were combinatorially randomized to any of the five hydrophobic amino acids Leu, Val, lie, Met or Phe. The results show that although 70% o f isolated mutants were active, only two of the 78 maintained the wild-type level of stability a n d activity. Thus, although t h e gross structure is tolerant towards mutations in the core that maintain hydrophobicity, the detailed structure is perturbed and this almost always has a negative effect on functionality. 33.

AQVIST J, LUECKE H, QUIOCHO FA, WARSHEL A: Dipoles L o c a l i s e d at the H e l i x T e r m i n i o f Proteins Stabilize Charges. Proc Natl Acad Sci USA 1991, 88:2026-2030. Further theoretical evidence that charge stabilization at helix termini derives from local peptide dipoles rather than the helix macro-dipole is reported. Long-range effects are s h o w n to be diluted by the surrounding protein environment. Calculations using Warshel's method agree with data from two experimental studies a n d indicate that the first turn of the helix accounts for almost the entire observed effect. 34. •,

JIANG F, KIM SH: Soft D o c k i n g : Matching of M o l e c u l a r Surface Cubes. J Mol Biol 1991, 219:79-102. A hierarchical method to dock two molecules is described that involves converting the surface of each molecule to small cubes. Shape complementarity at a wide level can t h e n be performed swiftly using a full global search. Matches that are devoid of steric clashes can be assessed for favourable interactions s u c h as electrostatic complementarity. T h e s e two steps drastically reduce the n u m b e r of solutions, leaving only a small subset to be screened in detail. 35.

36. •.

KEENJN, CACERES 1, ELIOPOULOS EE, ZAGALSKYPF, FINDLAY JBC: Complete Sequence and Model for the A 2 Subunit o f the Carotenoid Pigment Complex, Crustocyanln. Eur J Biochem 1991, 197:407-417. The model-building strategy applied in this paper is of particular interest as it reinforces the p r o b l e m s encountered w h e n the percentage identity is in the 25% 'twilight zone'. A good logical approach is taken b y the authors, w h i c h m a k e s u s e o f all available information.

37.

GRAHAM-LORENCES, KHALIL MW, LORENCE MC, MENDELSON CR, SIMPSON ER: Structure-function Relationships o f

H u m a n Aromatase Cytochrome P-450 Using Molecular Modelling and Site-directed Mutagenesis. J Biol Chem 1991, 266:11939-11946. H u m a n aromatase cytochrome P-450 only has about 20% s e q u e n c e identity with the bacterial cytochrome P-450cam, the structure o f which is k n o w n . Nevertheless, a model w a s built for the activesite region of aromatase a n d t w o regions predicted to be close to the substrate-binding pocket w e r e mutated. Both sets of mutants reduce activity a n d lend s o m e s u p p o r t to the model. BLONDELA, BEDOUELLE H: Engineering the Q u a t e r n a r y Structure o f an Exported Protein w i t h a L e u c i n e Zipper. Protein Eng 1991, 4:457-461. An excellent example of m o d u l a r design. A 35-residue leucine zipper s e q u e n c e (taken from yeast transcription factor GCN4) w a s a p p e n d e d to the C-terminus of t h e monomeric maltose-binding protein. The hybrid protein was efficiently exported into the periplasmic space of E. coli a n d appeared as a dimer, presumably held together by the leucine zipper tail. 38. •.

39.

CURTISBM, FRESNELLSt, SmNIVASANS, SASSENFELDH, KLINKE R, JEFFERY E, COSMAN D, MARCH CJ, COHEN FE: E x p e r i m e n t a l and Theoretical Studies o n the Three D i m e n s i o n a l Structure o f H u m a n I n t e r l e u l d n - 4 . Proteins 1991, 11:11t-119. A prediction of t h e three-dimensional structure of interleukin4 b a s e d o n circular dichroism spectroscopy, secondary structure prediction, disulphide bridge constraints, helix packing geometries, a n d k n o w n three-dimensional topologies of four-helix bundles. A g o o d logical approach to tertiary structure prediction. PASTOREA, ATKINSON RA, SAVDEKV, WILLIAMS RJP: Topological M i r r o r I m a g e s i n Protein Structure Computation: an Underestimated Problem. Proteins 1991, 10:22-32. This p a p e r draws attention to a problem e n c o u n t e r e d during NMRdetermination of protein structure. With insufficient distance constraints, it is possible to generate two solutions for a structure called topological mirror images. A l t h o u g h the chirality of the a m i n o acids a n d secondary structures is usually correct in both solutions, the secondary structure elements are p a c k e d together differently. To differentiate b e t w e e n the two solutions the authors suggest a combination of energy evaluation, ¢/~F plots, chirality tests a n d solvent accessibility, 40.

GREERJ: Comparative Modelling o f Mamxn~linrt Serine P r o t e i n a s e s . J Mol Biol 1981, 153:1027-1042. / 42, SIEZEN RJ, DE VOS WM, LEUNISSEN JAM, DIJKSTRA BW: H o m o l o g y Modelling a n d Protein Engineering Strategy o f Subtilases, t h e Family of S u b t i l i s i n - l i k e S e r i n e P r o t e i n a s e s . Protein Eng 1991, 4:719-737. Over 40 k n o w n sequences a n d four structures belonging to the subtilisin-related proteinase family are neatly summarized, with particular reference to the essential structural framework of the catalytic domain a n d observed variations in the variable regions. This information is u s e d b y the authors to suggest strategies for protein engineering, s u c h as where to introduce a disulphide b o n d in order to increase thermal stability. 41.

43.

KUSSIEPL, ANCHIN LM, SUBRAMANIANS, GLASELJA, LINTHICUM DS: A n a l y s i s of the Binding Site Architecture o f M o n o c l o n a l A n t i b o d i e s to Morphine b y Using Competitive Ligand Binding and Molecular Modelling. j I m m u n o l 1991, 146:4248-4257. Antibodies raised against m o r p h i n e were s e q u e n c e d a n d modelled using standard methods. Morphine was then docked onto the m o d elled antibody, using information derived from competitive ligand

Structure prediction and modelling Swindells 347 data, w h i c h identified the likely orientation of the ligand and putative binding residues o n the antibody. This is a detailed paper describing a variety of experimental a n d modelling studies. 44.

KETTLEBOROUGHCA, SALDHANAJ, HEATH VJ, MORRISON CJ, BENDING MM: The Itnportance of F r a m e w o r k Residues o n Loop Conformation. Protein Eng 1991, 4:773-783. Mouse monoclonal antibody re-&b425 was h u m a n i z e d by grafting the m o u s e complementarity-determining regions onto the h u m a n framework. Reduced avidity led to the authors to investigate the influence

of framework residues on the structure of the combining site. Six framework residues of particular interest were identified. Mutations at these positions s h o w e d a wide range of avidities for the antigen, despite conservation of the loop complementarity-determining regions.

MB Swindells, Protein Engineering Research Institute, 6-2-3 Fumedai Suita, Osaka 565, Japan,

Structure prediction and modelling.

Cracking the second fundamental code of molecular biology (how the tertiary structure of a protein is determined by its amino acid sequence) remains a...
1MB Sizes 0 Downloads 0 Views