J. Mol. Biol. (1990) 213,327-336

Automatic Definition o f Recurrent Local Structure Motifs in Proteins Marianne J. Rooman, Joaquin Rodriguez and Shoshana J. Wodak Unit~ de Conformation des Macromol~cules Biologiques Universitd Libre de Bruxelles, CP160, P2, Av. P. Hdger 1050 Bruxelles, Belgium (Received 7 August 1989; accepted 3 January 1990) An automatic procedure for defining recurrent folding motifs in proteins of known structure is described. These motifs are formed by short polypeptide fragments of equal size containing between four and seven residues. The method applies a classical clustering algorithm that operates on distances between selected backbone atoms. In one application, we use it to cluster all protein fragments into only four structural classes. This classification is rough considering the observed diversity of local structures, but comparable in homogeneity to the four classes of secondary structure (a-helix, ft-strand, turn and coil). Yet, it discriminates between extended and curved coil and distinguishes ft-bulges from r-strands. In a second application, the clustering procedure is combined with assignment of backbone dihedral angles to allowed regions in the Ramachandran map. This produces an exhaustive repertoire of highly homogeneous families of structural motifs that contains all the ft-hairpins, fta- and aft-loops previously defined by manual procedures, and new structural families of which two examples, a fta-loop and an a-helix beginning, are analyzed in detail. The described automatic procedures should be useful in categorizing structure information in proteins, thereby increasing our ability to analyze relations between structure and sequence.

though slightly different, set of secondary structure assignments. Largest discrepancies usually occur in defining the precise boundaries of secondary strucThe most widely used definitions of local folding ture elements, owing to different built-in tolerance motifs in proteins are those of the well-known levels to local distortions from idealized geometries, secondary structures a-helix (H), ft-sheet (E), turn such as those occurring at helix ends (Richardson & (T) and coil (C) (Pauling et al., 1951; Pauling & Richardson, 1988; Presta & Rose, 1988) and within Corey, 1951; Venkatachalam, 1968; Chandrasekaran ft-sheets (Richardson et al., 1978). In addition, et al., 1973; Lewis et al., 1973; Smith & Pease, 1980; because of the strict requirement to form characterRose et al., 1985; Flory, 1969). Their locations in proteins can be readily determined from visual istic H bonds, it is not uncommon to see residues adopting very similar conformations assigned to inspection of the three-dimensional model. Precise different secondary structure classes. Hence, objective criteria are required, however, to assign a particular residue in a protein to one of the classes extended segments can be found in ft-strands as well as in coil regions, while highly curved segments can of secondary structure in a reliable and uniform way. On the basis of such criteria, several prooccur within a helix, as well as in turns or in coil cedures for secondary structure assignments have regions. These observations suggest that different definibeen devised. In the most widely used ones, assigntions of local folding motifs, based on direct ments are made according to the pattern of H bonds measures of conformational resemblance, could be a made by the peptide units (Kabsch & Sander, 1983). useful alternative to secondary structures, in particOther related methods use criteria derived either ular for investigating relations between structure directly from C~ co-ordinates (Levitt & Greer, 1977), and amino acid sequence. For that purpose, the or by comparing inter-C ~ distances in the protein to analysis of specific folding motifs is also of great those of idealized models of secondary structures interest, particularly since it has long been recog(Richards & Kundrot, 1988). nized that proteins tend to display similar substrucEach of these procedures, when applied to a set of tures and folding motifs (Rao & Rossmann, 1973; known protein structures, yields a self-consistent, 327 ~) 1990 Academic Press Limited 0022-2836/90/100327-10 $03.00/0 1. I n t r o d u c t i o n

328

M.J.

Rooman et al.

Levitt & Chothia, 1976; Richardson, 1981). Analyses of a growing number of well-resolved protein crystal structures have led to the identification and classification of specific local motifs embodied in the different classes of fl-hairpins (Sibanda & Thornton, 1985; Milner-White & Poet, 1986), aft-loops (Edwards et al., 1987) and ~ - l o o p s (Efimov, 1986; Thornton, 1988). Such classification and identification of local structures have proven to be extremely useful in modeling protein structures (Chothia et al., 1986; Rees & de la Paz, 1986; Blundell et al., 1987). Moreover, the concept of recurrent local conformations has further been extended to a genuine "spare parts" approach (Jones & Thirup, 1986; Claessens et al., 1989), in which fragments from known protein structures are used to interpret electron density maps and to model structural changes in engineered proteins. This paper describes an automatic procedure for defining recurrent structure motifs in proteins. In this procedure, a classical clustering algorithm (Jambu, 1976; J a m b u & Lebeaux, 1978) is applied to short polypeptide fragments of uniform lengths containing four to seven residues. These fragments are taken from a database of 75 highly resolved (_

Automatic definition of recurrent local structure motifs in proteins.

An automatic procedure for defining recurrent folding motifs in proteins of known structure is described. These motifs are formed by short polypeptide...
902KB Sizes 0 Downloads 0 Views