TIBS 16-JAI;:UARY 1991

REVIEWS THE NUMBER OF known protein amino acid sequences has increased dramatically in the past few years. With the large investment of resources in the Human Genome Project, this number is likely to grow even faster. To maximize the information that can be derived from this huge database, it will be necessary to obtain details about the structure and function of the proteins, as well as their sequences. Currently, the number of known protein sequences (over 17 000) is about 40 times greater than the number of known protein structures, and the task of completely characterizing the structure and function of tens of thousands of proteins is daunting. Fortunately, some of the task may not be as complex as was first expected. It appears that, in the process of evolution, biology has found convenient structural units that are used over and over again, the same modules sometimes being used to perform different tasks. The existence of repeated sequences or modules in vertebrate proteins is now recognized ~a. These modules* often correspond to single exons that have the same phase at their Intron--exon boundary. This facilitates divergent evolution from a primordial gene by exon shuffling and dupllcation ',2. One of the best known examples Is the Immunoglobulln superfamlly4. The Immunoglobulln structural module has now been observed not only in immunoglobulins, but also in a wide variety of cell-surface proteins, including growth factor receptors and cell

As the database of protein sequences grows it is becoming apparent that many proteins are constructed from relatively few modular units that appear many times. Determination of the three-dimensional structure of such modules by NMR has been possible due to their production in relatively large quantities by recombinant DNA techniques. The knowledge gained about the structure of individual modules can then be used to predict their properties in a variety of intact proteins. adhesion molecules. This module has even been found recently in intracellular muscle-related proteins s. Different kinds of module are found in proteins associated with formation and dissolution of blood clots e, complement (which is involved in the immune responseT), the extracellular matrix s and various cell-surface receptors 4'9-". Some of the known modules and modular proteus are illustrated in Table ! and Fig. 1. The proteins shown are predominantly extracellular and are involved in a wide variety of situations where proteinprotein interactions are important. Not included in the figure, but also of considerable importance, are modules associated with protein-DNA interactions; the zinc finger module, for example, is the subject of numerous structural studies lz. New modular proteins and different kinds of module are continually being identified. These can be detected by searching the sequence data base for a consensus sequence; that is, a

sequence that identifies a particular kind of module (the consensus sequence of the epidermal growth factor (EGF)-like module G is shown by the bold letters in Fig. 4). Recombinant DNA/expression methods or peptide synthesis techniques can then be used to produce milligram quantities of the module. Once the structure of a module, the consensus structure, is determined, for example by NMR, the structure of homologous modules can be predicted with some confidence ~3. The process of defining the structure and function of the entire protein can then be tackled by sequence comparisons of homologous modules, site-directed mutagenesis, biological assays and lower resolution structural techniques such as solution scattering and microscopy. This general strategy, which is summarized in Fig. 2, is outlined in this brief article. Recent examples of the investigation of the structure and function of some protein modules are described.

Table I. Modules found In vertebrate proteins

*Terminology in this field can be confusing. 'Modules' have been variously called structural units, repeats, motifs and domains. The word domain,which has been defined as 'a subregionof the pelypeptide chain that is autonomous in the sense that it possessesall the charactedsticsof a complete globular protein'3, describes manyof the observed features of the repeating sequencesdiscussed here. However,like Patthy2, we will use the word modulewhenreferringto the particularkind of domain that is clearly associated with exon shuffling and duplication. M. Baron, D.G. Norman and I.D. Campbell are at the Department of Biochemistry, University of Oxford, South Parks Road, Oxford OXl 3QU, UK.

Module

Comment

Refs

Gla G K C

contains?cafooxyglutamateresidues resemblesepidermalgrowthfactor namedafter a Danishpastry appearsoften in complementproteins; also knownas complement control protein(CCP)and short consensusrepeat(SCR)

6,15 23,24 6,19

first found in fibronectin

8,25

immunoglobulinsuperfamilydomain appearsin the LDLreceptorand complementproteins found in thrombospondinand the complementprotein properdin found in somegrowthfactor receptors has homologyto EF-handcalcium-bindingdomain lectin moduleof somecell surfaceproteins

4 7,9 7 10 16 11

F3 I L T N E LB

© 1991,ElsevierScievcePublishersLtd,(UK) 0376-5067/91/$02.00

7,26

13

TIBS 16-JANUARY1991

impact on protein research and the technology has recently improved greatly. X-ray crystallography remains the most common method for structural determination of most proteins ~4. It is, however, necessary to crystallize

The modulestructures Until recently, the only way of deriving detailed three-dimensional protein structures was from diffraction studies

of crystalline arrays. Diffraction methods have, of course, had an enormous

(a)

(c) ~ / / / / / ' / / / / / / / / / / / / / / , , ~

(d)

tPA

[ ~ I

PROTEINS

(e ) ~

~

(f)

/~/'////////~/////////////'///~1

~

(g)

(

~/////'/////////'/////////.//,//~

c

h

)

c

~

(i) T T T T

c i r, C 1s

C2, FACTOR B

c

c

FACTOR H

PROPERDIN

G G G E E E E E E E E

THROMBOSPONDIN

(j)

FIBRONECTIN

(

ITCHIN

(I)'

~

(m)-

~

(n)-

ELAM-1

N.CAM G ~

G G

(O)"

-

~

03) -

~(~)~)-

(q) =

~

~

L L L L L L L

LDLRECEPTOR

NGF RECEPTOR

IL-2 RECEPTOR PDGF RECEPTOR

Rgure 1 A representative selection of some of the known proteins that comprise mosaics of modules. Proteins (a)-(d) are associated with blood clotting/fibrinolysis 6, protein (e)--(h) are associated with complementS; (i) and 0) are in the extracellular matrixS; twitchin (k) is an intracellular protein associated with muscleS; (I) and (m) are cell adhesion molecules4.tl; (n)-(q) are various cell-surfece receptors4.~-1o. No attempt has been made here to represent the size of the cytoplasmic domains of the membrane-spanning proteins. Individual modules are depicted by letters and different shapes (see Table I). No account is made for the significant variations in sequence patterns that exist within any particular module family. Not depicted here are other repeating sequences that appear in some of these molecules, for example in Clr and Cls. With the exception of twitchin s, all the proteins shown are found in extracellular spaces and are usually glycosylated. Most of the modules are stabilized by disulphide bridges although the type III fibronectin module, F3, which appears in numerous proteins both inside and outside the cell, usually is not. The immunoglobulin model (I) also lacks disulphide bonds when it is intracellular. 14

the protein and usually necessary to produce heavy atom derivatives to solve the phase problem ~4. The type of protein depicted in Fig. 1 has usually proved to be very difficult to crystallize, partly because they are often glycosylated or membrane bound. The result is that relatively few module structures have been solved by diffraction methods, namely: immunoglobulin (I)4, 'kringle' (K)15, ?-carboxyglutamate (GLA)~s and the calcium-binding 'EFhand' (E) 16. An alternative structural method was devised in the 1980s that involves the use of high-resolution NMRn.~8. This technique can be used to produce relatively good quality three-dimensional structures from about 15 mg of protein dissolved in water. No crystals or heavy atom derivatives are required, so the structure can often be done in a few months. The main limitation of the method is in the size of the protein that can be studied. The current limit is about 150 residues, although this number is being revised upwards as new methods involving 15N and ~C labelling are introduced. The way in which structures can be determined from high resolution NMR spectra has recently been described in T/BS~s. Modules usually have between 40 and I00 amino acids and often have a stable structure when isolated from the native protein. NMR is thus ideally suited to study single modules and some mod. ule pairs. These can be produced in a variety of ways, including proteolytlc cleavage from a larger native protein, recombinant DNA/expression methods and peptide synthesis. All three of these production methods have been used in NMR structural studies of modules; for example a kringle from plasminogen has been excised by proteolysis and its NMR structure determined~9; examples of the other routes are given below. NMR structural methods have already made a significant contribution to investigations of modules involved in protein-DNA interactions. The homeodomain fragment of a protein from Drosophila 2° has been shown to have the helix-turn-helix structure identified in diffraction studies of other DNA-binding proteins ~2. The structures of a zinc fingeP s, produced by peptide synthesis, and a double zinc finger, produced by expression in E. coil, have been determined2L About three years ago, our laboratory set out to produce some of the

TIBS 16-JANUARY1991

~-sheet stacking on one modules depicted in Fig. Strateav for Studyina Modular Proteins another, enclosing a core I by gene expression, of conserved hydrophoand to determine their In mmmIdentify protein modules in database bic residues. In this case solution structures by using a consensus sequence there are two disulphide NMR22. So far the strucbridges, one between adjatures of three homoloAlign module sequences cent strands of one sheet, gous EGF-like ((3) modthe other linkingthe major ules have been deterldentily specific module for study and minor IS-sheets. mined, two that act as We have recently also Refine growth factors2s and one sequence determined the structure that appears in the I [ alignment Express or synthesiso module of module C (Ref. 26). blood-clotting protein, This module, which is factor IX (Ref. 24). In Determine module structure by NMR known as the compleaddition, we have solved ment control protein the structure of a I I (CCP) module and the fibronectin type 1 modL F short consensus repeat 7 ule (FI)~, and, very Obtain consensus (SCR), has been recogrecently, one of the com| Express module structure nized over I00 times in pairs and plement control protein determine structure various extracellular promodules (C)2~. The G, F1 teins that are composed and C modules were all of modular mosaics. produced by inserting a Define Human complement facfunctional Model synthetic gene encoding patches by other tor H is made up of 20 C the module of interest in m mutagenesis examples Interpret low angle modules; the 16th, choand of module a yeast secretion vector II scattering and functional sen as typical, was synmicroscopy of intact involving the a-factor. assay protein thesized in yeast and The modules were synshown to have the same thesized and secreted pattern of disulphide into the medium and bonds as found in the then characterized by intact protein. The NMR protein sequencing and Combine information to obtain structure, data show that the modand functional relationships of intact disulphide mapping. Each ule forms a relatively protein resonance in the high compact structure with resolution ~H NMR specextensive loops and tra was assigned to a R~ure 2 ~sheets (see Fig. 3). particular atom in the Outline of the strategyused to study protein modules. protein and families of structures were calculated that were consistent with the sheet platform with three disulphide Modulesand functionalpatches Let us now consider in more detail bonds radiating up from one face to variNMR-derived distance restraints. the function of some of the modules The module that we have studied in ous other loops and turns 2~(see Fig. 3). illustrated in Fig. 1. One of their roles most detail is the EGF-like module, The structures are found to be relaseems to be tor recognizing and binding tively floppy in solution, which might marked G in Fig. I. This module can act as a growth factor and various homolo- explain their reluctance to crystallize. other proteins; for example in recepgous modules, including EGF and trans- The finding that all molecules that bind tor-growth factor interactions and forming growth factor alpha (TGF-a), to the EGF receptor have a similar tightly controlled cascades of enzymebind to the EGF receptor (see for example structure is not surprising but we have catalysed proteolysis, such as in the blood clotting and complement pathRef. 23). This module also appears in a also shown recently that a G module ways. Another possible function is a from factor IX has the same fold as wide variety of other proteins (Fig. I). purely structural one; module 'spacers' growth factors 24. These studies support The first consideration about module may place the interacting module into G is whether the structures of all the the idea that a consensus structure the correct position to perform its funcsequences that have the G consensus does exist and that the structures of tion. One of the best-studied modules is one module can be used to model other sequence have the same consensus the one from the immunogiobulin famstructure. No diffraction studies have homologues. This notion has already ily (I). it is well known4 how this module yet been successful on thi=; module but been exploited with diffraction-based can be adapted to bind a wide variety structures t3. numerous NMR studies have been In addition to the (3 module, we have of ligands by changing the nature of done. In addition to ourselves, NMR variable polypeptide loops attached groups in Switzerland, USA, Japan, studied the type 1 fibronectin module~ to a stable ~-sheet core structure 4. (F1) that appears numerous times in Canada and Holland have studied variEvidence so far from antibody-protein fibronectin as well as in other blood ous EGFs and TGF-a and all the strucinteractions is that about 15-22 amino tures are in agreement to within the res- proteins such as tissue plasminogen acids on different loops of protein antiolution of the technique, although only activator (tPA). The FI structure has gens make contact with the antibody2L some similarities to the G structure, 26% of their residues are identical. The Another well known example is the RGD dominant feature of the structure is a [5- with a major and a minor antiparallel 15

I

I

1

TIBS 16-JANUARY1991

example, tissue plasminogen patch on a fibronectin modactivator (tPA) is a fibrinolule, which binds to cell-sur(b) ysis protein that interacts with face receptors s. Our work on plasminogen and fibrin and various other modules is has a G and an F1 module beginning to extend this idea (Fig. I). Consideration of the of a stable core structure determined F1 and G strucwith different surface 'patchtures, together with comparies' of amino acids designed son of other FI and G to interact with other prosequences, showed that the teins. lower face of the major [~ Consider, for example, the sheet (the face opposite to functional patches on the that which interacts with the G module. The consensus minor sheet) has a hydrophosequence of G is superimbic patch that might be posed on a bead diagram involved in interactions with that represents features of lnm other tPA modules 2s. the consensus structure (Fig. 4). A large number of G Towards the structure and (c) (d) sequences are known and we function of the Intact protein compared those that bind to So far, we have been disthe EGF receptor (e.g. EGF cussing the structure and and TGF-a) with those that function of the individual do not (e.g. modules from modules depicted in Fig. I. factor IX). This led to a preThey are, of course, only diction that certain growth components of relatively factor residues were likely to large and complex proteins. be involved at the receptorgrowth factor interface23. Is it possible to build up a picture of the intact protein These residues are from diffrom the structures of the ferent loops on the EGF strucmodules? ture and form a patch (see Fig. 4). The above predictions Representations of some of are consistent with results on the known module structures variants of the EGF structure, are shown in Fig. 3. As can be and can be tested by a proobserved, these consist gramme of site-directed mainly of ~-strand and mutagenesis, since NMR analsheet. On contrast, modules ysis allows investigation of involved in protein-DNA whether an amino acid interactions appear usually change causes a local or a to have a large helical conglobal change in the structent). Three classes of modture of the module. For examule can be tentatively identiple, changing Leu47 perturbs fied; I and C, F1 and G, and K. receptor binding but causes Rgure 3 The different classes may fit Schematic ribbon diagrams of the structures of some of the only a local structural together to make mosaic promodules in Fig. 1. (a) The complement control protein change 23, whereas we have teins in different ways. The module2e, C; (b) the immunoglobulin module4, I; (c) the found other changes that peramino- and carboxy-termini fibronectin type 1 module2s, F1; (d) the growth factor module 23, turb both receptor binding of modules I and C (Fig. 3a G; (e) the kringle module19, K. and structure. and b) are clearly 'in-line' and In previous comparisons of at the ends of the module. the G sequences it had been noticed tigating the effect that single amino acid This means that it is easy to link one that, in addition to the usual consensus changes have on the calcium affinity of module to the next, like beads on a sequence residues, several G module this module and are comparing these string. The amino- and carboxy-termini sequences contain the sequence observations with amino acid substi- of F1 and G (Fig. 3c and d) are also in DxDQx...xxDxxxxxY at the amino-termi- tutions that are known to cause haemo- line, but the ends tend to leave a parnus. The third D often corresponds to a philia. The results, together with previ- tially formed [5-sheet, suggesting that ~-hydroxylated Asp or Ash, x denotes ous mutagenesis studies on the intact they may 'clip' together to complete other amino acids and ... denotes a vari- molecule, indicate that the Ca2*-binding the sheets. The kringle structure, K able gap. Knowing the structure of EGF, patch on the first G module of factor IX (Fig. 3e), is rather different in that the we predicted 28 that these residues is essential for clotting activity. amino- and carboxy-termini are close would form a CaZ'-binding site (see Fig. As expected, the structural work on together. One might therefore expect 4). The factor IX G module was subse- the F1 and C modules are also begin- that several kringles connected together quently expressed and shown to bind ning to yield insight into possible sites with a short linker would lead to a calcium24. We have recently been inves- for protein-protein interactions. For coiled structure. 16

TIBS16-JANUARY1991

(b)

Rgum 4 Representations of the consensus sequence and consensus structure of the epidermal growth factor module (G). The consensus sequence, which is present in almost all G modules, is shown on both diagrams. (a) Illustrates the structure of the 53 residues of human EGF; the residues predicted to form a patch at the receptor-growth factor interface 23 are emboldened; (b) Illustrates the first G module from factor IX with a shaded patch associated with a calcium-binding site 24.

Speculation about the way modules might fit together could be checked by the expression of module pairs and the determination of their structures. (A module pair is probably the largest size that can be handled by current NMR methods.) Common patterns for joining certain types of modules may be found. In addition, once the overall dimensions of the modules are known from the NMR studies (see Fig. 3), it should be possible to interpret the relatively low-resolution structural data obtained either from electron microscopy or some of the new scanning microscopes ~9, in terms of a combination of module structures. There is also the possibility of analysing solution scattering from an intact modular protein in terms of known module structures 3°.

edge gained about individual modules and module pairs with information from functional and low resolution structural studies to build up a picture of the structure and function of the intact proteins.

Conclusions

References

The combination of several powerful new technologies, protein engineering, peptide synthesis and NMR, has opened up exciting new possibilities for approaching structural problems. Stable protein modules can be produced in relatively large amounts and their solution structures determined relatively quickly. It seems likely that the structures of at least one of each kind of module in Fig. 1 will be determined soon. For example, NMR spectra of the low-density lipoprotein (LDL) receptor module (L) and an F1 module from tPA are currently being examined and vectors for several other kinds of module and module pairs shown in Fig. I have been constructed. The next major step will be in combining knowl-

Acknowledgements This is a contribution from the Oxford Centre for Molecular Sciences, which is supported by SERC and MRC. We thank numerous colleagues in Oxford who have contributed to this work, including the MRC Immunochemistry Unit, and the labs of the Kingsmans and G. Brownlee. In particular we single out P. Handford and T. Day for their help. We also thank ICI Pharmaceuticals and British Biotechnology for their support.

1 Doolittle, R. F. (1989) Trends Biochem. Sci. 14, 244-245 2 Patthy, L. (1987) FEBS Lett. 214, 1-7 3 Schultz, G. E. and Schirmer, R. H. (1979) Principles of Protein Structure, Springer-Verlag 4 Williams, A. F. and Barclay, A. N. (1988) Annu. Rev. ImmunoL 6, 381-405 5 Benian, G. M., Kiff, J. E., Neckleman, N., Moerman, D. G. and Waterson, R. H. (1989) Nature 342, 45-50 6 Fude, B. and Furie, B. C. (1988) Cell 53, 505-517 7 Reid, K. B. M. and Day, A. J. (1989) Immunol. Today 10,177-180 8 Yamada, K. M. (1989) Curr. Opin. Cell BioL 1, 956-963 9 Soutar, A. K. and Knight, B. L. (1990) Br. Med. Bull. 46, 891-916 10 Mallett, S., Fossum, S. and Barclay, A. N. (1990) EMBO 1 9, 1063-1068 11 Bevilacqua, M. P., Stengilin, S., Gimborne, M. A. and Seed, B. (1989) Science 243, 1160-1165 12 Struhl, K. (1989) Trends Biochem. ScL 14, 137-140

13 Sali, A., Overington, J. P., Johnson, M.S. and Blundell, T. L. (1990) Trends Biochem. ScL 15, 235--240 14 Eisenberg, D. and Hill, C. P. (1989) Trends Biochem. Sci. 14, 260-264 15 Soriano-Garcia, M., Park, C. H., Tulinsky, A., Ravichandran,K. G. and Skrzypczak-Jankum, E. (1989) Biochemistry 28, 6805-6810 16 Babu, Y. S., Bugg, C. E. and Cook, W. J. (1988) J. MoL BioL 204,191-204 17 WQthdch, K. (1989) Science 243, 45-50 18 Wdght, P. E. (~:J89) Trends Biochem. ScL 14, 255-260 19 Atkinson, R. A. and Williams, R. J. P. (1990) J. MoL BioL 212, 541-552 20 Qian, Y., Billiter, M., Otting, G., Muller, M., Gehring, W. and WQthrich, K. (1989) Cell 59, 573-580 21 H~rd, T., Kellenbach, E., Boelans, R., Maler, B. A., Dahlman, K., Freedman, L. P., CarlstedtDuke, J., Yamamoto, K. R., Gustafsson, J. A. and Kaptein, R. (1990) Science 249,157-160 22 Baron, M., Kingsman, A. J., Kingsman, S. M. and Campbell, I. D. (1990) in Protein Production by Biotechnology (Harris, T. J. R., ed.), pp. 49-60, Elsevier 23 Campbell, I. D., Baron, M., Cooke, R. M., Dudgeon, T. J., Fallon, A., Harvey, T. S. and Tappin, M. J. (1990) Biochem. PharmacGI. 40, 35-40 24 Handford, P. A., Baron, M., Mayhew, M., Willis, A., Beesley, T., Brownlee, G. G. and Campbell, I. D. (1990) EMBO J. 9, 475-480 25 Baron, M., Norman, D., Willis, A. C. and Campbell, I. D. (1990) Nature 345, 642-646 26 Barlow, P., Norman, D. G., Baron, M., Day, A., Sim, R. and Campbell, I. D. Biochemistry(in press) 27 Laver, W. G., Air, G. M., Webster, R. G. and Smith-Gill, S. J. (1990) Cell61, 553-556 28 Cooke, R. M., Wilkinson, A. J., Baron, M., Pastore, A., Tappin, M. J., Campbell, I. D., Gregory, H. and Sheard, B. (1987) Nature 327, 339-341 29 Arscott, P. G. and Bloomfield, V. A. (1990) Trends Biotechnol. 8, 151-156 30 Ponting, C., Holland, S., Cederholm-Williams, S. A., Marshall, J. M., Brown, A. J. and Blake, C. C. F. Biochemistry (in press)

Students Did you know that you are entitled to a discount on a subscription to TIBS?

See page VI for details.

17

Protein modules.

As the database of protein sequences grows it is becoming apparent that many proteins are constructed from relatively few modular units that appear ma...
1018KB Sizes 0 Downloads 0 Views