J Mol Model (2015) 21: 191 DOI 10.1007/s00894-015-2742-x

ORIGINAL PAPER

Computational design of enzyme–ligand binding using a combined energy function and deterministic sequence optimization algorithm Ye Tian 1 & Xiaoqiang Huang 1 & Yushan Zhu 1

Received: 27 April 2015 / Accepted: 24 June 2015 / Published online: 11 July 2015 # Springer-Verlag Berlin Heidelberg 2015

Abstract Enzyme amino-acid sequences at ligand-binding interfaces are evolutionarily optimized for reactions, and the natural conformation of an enzyme–ligand complex must have a low free energy relative to alternative conformations in native-like or non-native sequences. Based on this assumption, a combined energy function was developed for enzyme design and then evaluated by recapitulating native enzyme sequences at ligand-binding interfaces for 10 enzyme–ligand complexes. In this energy function, the electrostatic interaction between polar or charged atoms at buried interfaces is described by an explicitly orientation-dependent hydrogenbonding potential and a pairwise-decomposable generalized Born model based on the general side chain in the protein design framework. The energy function is augmented with a pairwise surface-area based hydrophobic contribution for nonpolar atom burial. Using this function, on average, 78% of the amino acids at ligand-binding sites were predicted correctly in the minimum-energy sequences, whereas 84% were predicted correctly in the most-similar sequences, which were selected from the top 20 sequences for each enzyme–ligand complex. Hydrogen bonds at the enzyme–ligand binding interfaces in the 10 complexes were usually recovered with the correct geometries. The binding energies calculated using the combined energy function helped to discriminate the active sequences from a pool of alternative sequences that were Electronic supplementary material The online version of this article (doi:10.1007/s00894-015-2742-x) contains supplementary material, which is available to authorized users. * Yushan Zhu [email protected] 1

Department of Chemical Engineering, Tsinghua University, Beijing 100084, People’s Republic of China

generated by repeatedly solving a series of mixed-integer linear programming problems for sequence selection with increasing integer cuts. Keywords Computational enzyme design . Protein–ligand interaction . Protein design . Energy function . Global optimization

Introduction The computational design of enzymatic function is a central theme in chemical biology, and this approach provides significant opportunities to discover industrial biocatalysts for desired chemical reactions. Currently, enzymes and enzymebased cell biocatalysts play critical roles in the development of green processes for the pharmaceutical industry and the large-scale chemical synthesis industry [1]. The key merit of enzymes compared with chemical catalysts is their exceptional selectivity; however, the main disadvantage of enzyme use is the extreme difficulty involved in redesigning an enzyme’s active site to accommodate unnatural substrates, even when there are only subtle structural differences between the unnatural substrate and the natural primary substrate. Fortunately, recently developed de novo computational enzyme design methods [2, 3] can facilitate the identification of novel catalysts that convert unnatural substrates into target products with the aid of supercomputers and advances in molecular modeling. Pauling proposed that the conceptual requirement for enzyme design is the construction of an active site that stabilizes the transition state selectively relative to the ground state for a target reaction [4]. An active site is composed of two sets of amino acids: one set represents the catalytic residues, which provide functional groups that create geometric constraints on the transition state [5], and the other set represents the binding

191 Page 2 of 14

residues surrounding the transition state. For catalytic residues, several matching algorithms that anchor these residues onto a set of inert scaffolds have been developed, including Dezymer [6], Orbit [7], RosettaMatch [8], OptGraft [9], ScaffoldSelection [10], ProdaMatch [11], AutoMatch [12], and Saber [13]. RosettaMatch has been shown to facilitate the discovery of de novo enzyme catalysts for targeted reactions including the retro-aldol reaction [14], the Kemp elimination reaction [15], the Diels–Alder reaction [16], and the hydrolysis of esters [17]. Orbit has been used to create enzyme-like proteins for the hydrolysis of esters [18] as well as de novo enzymes for the Kemp elimination reaction [19]. The design of binding residues that ensure good packing and thus stability and tight binding with the transition state was the focus of the work reported in this paper. Successful models and algorithms for protein design identify amino acid sequences for a target tertiary structure; these can be used directly in sequence selection during enzyme design by considering the small transition-state molecule as an additional design site [20, 21]. In protein design, only the folding energy is optimized to guarantee a stable structure. However, both the folding energy and the binding energy must be taken into account simultaneously in enzyme design, and the folding energy is always partially sacrificed to enhance the binding affinity with the transition state in order to reduce the activation energy and promote the catalytic efficiency of the target reaction. For enzyme design using Orbit [7], the residues required to accommodate substrate binding are identified by minimizing the total energy of the enzyme–ligand system, using pairwise force fields for protein design [22] and deadend elimination based on combinatorial optimization algorithms [23]. RosettaDesign also uses the total energy of the enzyme–ligand system as the objective function for binding sequence optimization, but the interaction energies between the transition state and protein side chains are upweighted to favor the selection of sequences that create good protein–ligand interactions over sequences that only provide good protein–protein interactions [24]. The force field used in RosettaDesign is dependent on the environment. This restricts the use of deterministic algorithms for sequence selection, so stochastic optimization algorithms such as Monte Carlo are used in RosettaDesign [21, 25]. The environment-dependent force field is very accurate for protein–ligand binding, but the stochastic algorithm can only screen a limited number of rotamer combinations in a reasonable time. Friesner and his colleagues investigated the design of enzymes via the computational prediction of active-site sequences and found that many residues in the active sites of enzymes could be predicted by optimizing the enzyme–ligand binding affinity subject to constraints on the folding energy and the geometries of the catalytic residues [26, 27]. Boas and Harbury showed that a standard molecular mechanics potential energy function can be used to design protein–ligand binding by optimizing the

J Mol Model (2015) 21: 191

dissociation energy (given a limit on protein destabilization), and they claimed that a high-resolution rotamer library and an accurate solvation model are required to successfully redesign binding sites [28]. Although optimizing the protein–ligand binding energy under folding energy constraints can recapitulate many residues in the active sites of native enzymes, such an optimization framework is difficult to extend to the de novo design of enzymes for novel substrates because the constraint on the folding energy cannot be determined accurately before the completion of the design process. Computationally designed ligand-binding proteins always show large discrepancies from the design models, and the computational design of protein–ligand interactions is considered to be an unsolved problem [29]. Recently, Tinberg et al. described a general computational method for designing pre-organized and shape-complementary ligand-binding sites, and used this method to generate protein binders to the steroid digoxigenin [30]. The lack of accuracy achieved in the design of protein– ligand interactions contributes to the low catalytic efficiencies observed for computationally designed enzymes. Statistical data collected by Huang et al. [31] show that buried polar or charged amino acids at the protein–ligand interface always form perfect hydrogen bonds with polar or charged groups of the ligand or with each other. Therefore, an accurate electrostatic model of the interaction between polar and charged groups and their desolvation in the buried state is needed [32]. In the work reported in the present paper, a combined energy function was developed for the computational design of enzyme–ligand binding, where lone-pair directionality in the hydrogen bonding was taken into account and the generalized Born model was modified to be pairwise decomposable, based on the general side chain for protein design. More importantly, we used an in silico benchmark to test this energy function for the computational design of enzyme–ligand binding based on the recapitulation of the sequences and structures of active sites of native enzymes in a set of naturally occurring enzymatic scaffolds with bound ligands.

Materials and methods Given each enzyme–ligand complex, a set of amino acid residues at the binding interface was selected for sequence optimization, with the protein backbone and side chains at nonoptimized positions kept rigid. The design sites were divided into two categories: (1) primary sites that interact with the ligand via hydrogen bonding, salt bridges, van der Waals interactions, or hydrophobic interactions, and for which both the protein identities and their side-chain conformations were varied; (2) secondary sites that interact directly with the primary sites but are far from the ligand, and for which only the side-chain conformations were varied. The ligand was fixed,

J Mol Model (2015) 21: 191

even though it was considered an additional design site during the sequence optimization process, since only one ligand model with the same atomic coordinates as those of the crystal structure was considered. The identity of the amino acid at each primary design site was selected from among all of the natural amino acids, except for prolines; the side-chain conformations of those amino acids were instead selected from a backbone-independent rotamer library containing 11,810 rotamers [33]. Energy function for enzyme design A combined free-energy function was developed for enzyme– ligand interactions within the computational enzyme design protocol, with the reference state referring to the template of the scaffold in solvent and the isolated ligand in solvent. CHARMM 22 force-field [34] parameters and internal coordinate parameters were used in our energy function, in which all hydrogen atoms were explicitly considered. The ligand molecule was constructed from the model compounds for the CHARMM force field, and the partial atomic charges for atoms in groups for which there were no analogous groups in CHARMM were calculated by the high-quality atomic charge generation method AM1-BCC [35]. The free-energy function comprised five contributions: (i) the van der Waals interaction; (ii) the hydrogen-bonding interaction; (iii) the electrostatic interaction; (iv) the hydrophobic contribution; and (v) the entropic contribution. For the van der Waals interaction, two terms (the attractive and repulsive terms) were considered separately as  12  6 ! ri j ri j ri j E attr ¼ ei j −2 ≤ 1:1225 ð1Þ if di j di j di j   di j ri j if > 1:1225; ð2Þ E rep ¼ 10:0−11:225 ri j di j where dij is the distance between two atoms i and j, rij is the sum of the van der Waals radii of the two atoms, and eij is the square root of the product of the well depths of the two atoms. Eattr is the attractive portion of the Lennard–Jones 12–6 potential for the two atoms in a nonbonded interaction. Erep describes the repulsive energy between the two atoms, which reaches an upper bound of 10.0 kcal/mol when the two atoms overlap and ramps linearly down to 0.0 kcal/mol to connect with the 12–6 attractive potential at E=0.0 [21]. This term is less repulsive than the 12–6 potential and compensates well for the use of a rigid backbone and a discrete rotamer library. The atomic van der Waals radii and well depths are taken from the set of CHARMM 22 force-field parameters [34], except that the van der Waals radii are scaled by 0.90 for all atoms. The hydrogen bonds are represented by a 12–10 potential with a distance-dependent term and an angle-dependent term as

Page 3 of 14 191



E HB ¼ E0

R0 5 D

12



R0 −6 D

10 ! F ðθ; ϕÞ;

ð3Þ

where R0 is the equilibrium hydrogen-bond distance, set at 2.8 Å, and E0 is the well depth, set at 8.0 kcal/mol. D is the distance between the acceptor and donor atoms. This hydrogen-bond potential is based on the potential used in DREIDING [36], albeit modified with more accurate angledependent terms to describe the lone-pair directionality explicitly [37]. The angle term is dependent on the hybridization state of the acceptor as sp3 acceptor

F ¼ cos2 θcos2 ðϕ−109:5Þ θ > 90o ; ϕ > 60o

ð4Þ

sp2 acceptor

F ¼ cos2 θcos2 ðϕ−120:0Þ ; θ > 90o ; ϕ > 60o

ð5Þ

where θ is the donor–hydrogen–acceptor angle and ϕ is the hydrogen–acceptor–base angle (the base is the heavy atom attached to the acceptor). The rotamers for the amino acids serine, threonine, and tyrosine were expanded to include more allowed positions of the hydrogen atoms in their side-chain hydroxyl groups. Specifically, five additional rotamers were generated for each rotamer of serine or threonine by rotating its hydroxyl hydrogen atom by 60, 120, 180, 240, and 300°, but the resulting rotamer was eliminated if its hydroxyl hydrogen atom collided with other atoms in this rotamer. One additional rotamer was generated for each rotamer of tyrosine by displacing the side-chain hydroxyl hydrogen atom to the position symmetrical to its original position. Additionally, a damping term was introduced when the hydroxyl oxygen atom acted as the acceptor in a hydrogen bond, since the hydroxyl hydrogen atom could occupy a region that precludes the proper formation of this hydrogen bond. A hydroxyl hydrogen atom was defined as an interfering atom if the angle formed between the hydrogen atom from the donor, the hydrogen atom from the acceptor, and the acceptor itself was greater than 100°, and the potential of a hydrogen bond was multiplied by 0.1 if there was an interfering atom. The electrostatic energy of each rotameric state of an enzyme–ligand system is the sum of the desolvation energy of each side chain i (ΔGidesolv), the screened Coulombic interaction between each side chain i and the backbone (ΔGi/bb SC ), and the screened Coulombic interaction between each pair of side chains i and k (ΔGi/k SC), i.e., E elec ¼ E desolv þ E SC ¼

X i

1

0

XX C BX i=bb i=k ΔGidesolv þ B ΔGscreenedCoul þ ΔGscreenedCoul C A; @ i

i

ð6Þ

k k>i

where ESC represents the sum of the screened Coulombic interactions. To calculate the desolvation energy and the

191 Page 4 of 14

screened Coulombic interaction within the computational enzyme design framework, the generalized Born model specifically parameterized for proteins [38] within the CHARMM all-atom force field was modified to be pairwise decomposable based on the generic side chain proposed by Zhang et al. [39]. As shown in Fig. 1, the desolvation energy of a side chain i is defined as the difference between the electrostatic solvation energy of the side chain i in the context of the folded protein and that of the side chain i in the context of the local backbone. The local backbone is represented by GXG, including atoms CA(i − 1), C(i − 1), O(i − 1), N(i), HN(i), CA(i), HCA(i), CB(i), C(i), O(i), N(i + 1), HN(i + 1), and CA(i + 1). According to thermodynamic cycles and perturbation theory [40], the electrostatic desolvation energy of side chain i can be calculated as   solv ΔGidesolv ¼ ΔGsolv −ΔG i;gs i;GXG  X solv ΔGsolv ð7Þ þ i;k;gs −ΔGi;gs ; k k≠i

where ΔGsolv i,GXG is the electrostatic solvation energy of side chain i in the context of the local backbone represented by GXG, ΔGsolv i,gs is the electrostatic solvation energy of side chain i in the context of backbone and generic side chains solv (gs) at all design sites other than i, and ΔGi,k,gs is the electrostatic solvation energy of side chain i in the context of backbone, k, and generic side chains at all design sites other than i and k. The electrostatic solvation energy for side chain i in any

Fig. 1 Free-energy cycles used to calculate side-chain desolvation energies. In one-body perturbation, the protein solute is defined as the template, there is a true side chain at design site i, and there are pseudospheres at all other design sites. In two-body perturbation, a

J Mol Model (2015) 21: 191

of the above contexts was computed based on the analytical formula for atomic polarization proposed by Dominy and Brooks [38], but the dielectric constant of the protein (εp) was set to 24.0, consistent with the protein environment at the active site [32, 41]. The generic side chain was characterized by two parameters: the radius of the generic side-chain pseudosphere, set at 3.09 Å; and the distance from the CA atom to the center of the pseudosphere, set at 2.1 Å. The free-energy cycle used to calculate side chain to side chain screened Coulombic interactions is shown in Fig. 2, and the derived equation is i=k

ΔGscreenedCoul ¼ 332

  X X qq 1 1 X X qi qk i k −332 − ; εp d ik f GB εp εw i i k k

ð8Þ

where qi is the partial charge on the polar or charged atom i and dik is the distance between the two atoms i is the generalized Born function and q k. ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi f  2 GB   f GB ¼ d ik þ Ri Rk exp −d 2ik =4Ri Rk ; where Ri is the atomic Born radius. The side chain/backbone screened Coulombic interaction was calculated using Eq. 8 in a similar way. The partial charges on the polar or charged atoms of protein side chains or ligands were obtained from the PARSE parameter set for particular functional groups [42]. The atomic Born radii were calculated based on the electrostatic solvation energy of the side chain i in the context of backbone and generic side chains (gs), i.e., ΔG solv i,gs , and the calculated atomic

second true side chain k is added to replace one of the pseudospheres in one-body perturbation. εp and εw are the dielectric constants of the protein and solvent, respectively

J Mol Model (2015) 21: 191

Page 5 of 14 191

Fig. 2 Free-energy cycle used to calculate side chain to side chain screened Coulombic energies. In two-body perturbation, the protein solute is defined as the template, there are two true side chains at design sites i and k, and there are pseudospheres at all other design sites. εp and εw are the dielectric constants of the protein and solvent, respectively

k

k

(Coulombic) let side chain k feel potential made by i

i+1

i

i-1

εp

i+1

i

i-1

εp

i-1

εw

2-body (solv, folded) + (screen)

2-body (solv, folded)

k

(screened Coulombic)

k

let side chain k feel potential made by i i+1

i

Born radii were stored and used directly in Eq. 8 to calculate the screened Coulombic interaction between polar or charged atoms. The continuum solvation model was augmented with a term EHP proportional to surface area that accounted for the hydrophobic effect of the nonpolar atoms. The exposed and buried areas of a side chain i were calculated using the pairwise method developed by Zhang et al. [39] based on the general side chain, as shown in Fig. 1. The corresponding equations are Aexposed ¼ Aexposed þ i i;gs

X

exposed Aexposed i;k;gs −Ai;gs



ð9Þ

k k≠i

  X  exposed Aburied ¼ Aexposed Aexposed −Aexposed þ ; ð10Þ i;gs i i;GXG −Ai;gs i;k;gs k k≠i

where Aexposed i,GXG is the surface area of side chain i in the context of the local backbone, Aexposed is the surface area of side chain i,gs i in the context of backbone and generic side chains at all exposed is the surface area of design sites other than i, and Ai,k,gs side chain i in the context of backbone, k, and generic side chains at all design sites other than i and k. The solventaccessible surface area for side chain i in any of the above contexts was computed using the numerical surface calculation algorithm with a probe radius of 1.4 Å [43]. The exposed and buried areas were parameterized as follows: 26 cal/mol Å2 favoring nonpolar burial, 200 cal/mol Å2 penalizing nonpolar aromatic atom exposure but 150 cal/mol Å 2 penalizing

i-1

εw

i+1

i

nonpolar atom exposure of tyrosine, and 10 cal/mol Å2 penalizing aliphatic exposure but 20 cal/mol Å2 penalizing aliphatic exposure for methionine. The penalty terms were introduced exclusively for negative design. The side-chain entropy contribution term, ES, was included in the energy function to reflect the loss of side-chain flexibility upon folding, and the entropy values for the 20 natural amino acids were calculated by Creamer’s method [44], based on the assumption that a single conformation exists for a side chain in the folded state, and an ensemble of conformations in the unfolded state. The loss of entropy for methionine was increased to 8.0 kcal/mol to account for the freer rotation of the long, linear side chain and to avoid the overselection of methionine in wild-type proteins. The loss of entropy for lysine should have been augmented for the same reason as given for methionine, but this parameter had not been determined. In this work, lysine was only selected at sites for which the native amino acid was lysine. Optimization algorithm for sequence selection The final combined free-energy function was a linear combination of seven terms: (i) Eattr, the van der Waals attractive term; (ii) Erep, the van der Waals repulsive term; (iii) EHB, the hydrogen-bonding term; (iv) E desolv, the electrostatic desolvation term; (v) E SC , the electrostatic screened Coulombic term; (vi) EHP, the hydrophobic term; (vii) ES, the entropic term. The parameters in this free-energy function were refined based on the results of core redesign calculations on the proteins reported in our former work [45], but the

191 Page 6 of 14

J Mol Model (2015) 21: 191

candidate amino-acid types were extended to include all natural types. The optimal rotamer sequences at binding sites for a given enzyme–ligand complex were obtained by running the slightly improved global optimization algorithm developed by Huang et al. [46]. The main steps were as follows: (1) First, the intrinsic interactions were computed, including the seven energy terms for all rotamers at a particular design site along with the template and the general side chains at other sites. Rotamers with intrinsic energies of >20.0 kcal/mol above another rotamer of the same amino acid type or 50.0 kcal/mol above another rotamer of a different amino acid type at the same design site were then eliminated. (2) The energy matrix was calculated on the basis of only four energy terms (the van der Waals attractive term, the van der Waals repulsive term, the hydrogen-bonding term, and the electrostatic screened Coulombic term) for the rotamers left after step (1), and these rotamers   {ij} were ranked using the scoring function e i j ¼ E    

i j þ ∑ min E ðk s Þ þ E i j ; k s at each design site. k≠i

s

Up to 500 rotamers were retained at each design site based on their scores, with two constraints applied: (i) the minimum number and maximum number of rotamers for each amino acid type had to be complied with; (ii) the root-mean-square deviation (RMSD) between two rotamers of the same amino acid type had to be no less than 0.05 multiplied by the number of side-chain heavy atoms of that amino acid. (3) The energy matrix with all energy terms was calculated for the rotamers left after step (2), and Goldstein singles DEE (dead-end elimination) and split singles DEE were performed to remove the rotamers that could not appear in the global minimum-energy conformation sequences. (4) The global minimum-energy conformation sequence was obtained using a combined linear programming/ mixed integer linear programming (LP/MILP) algorithm according to the energy matrix calculated in step (3). Up to 20 sequences were obtained by adding integer cuts to the MILP formulation to ensure that the amino-acid types for two selected sequences differed at one or more design sites [47]. In step (4), the integer cut was achieved via the following equation: p X X

yi j ≤ p−1;

ð11Þ

i¼1 j∈S i j*

where yi j is a binary variable that indicates whether or not rotamer {ij} is selected. p is the number of design sites. S i j*

is the set of rotamers at design site i with the same identity as that of ij*, which is the selected rotamer in the preceding MILP problem. Energy matrix calculations and sequence optimizations were all performed in PRODA, the PROtein Design Algorithmic package written in ANSI C, which was run on a computer cluster with 256 cores.

Results and discussion The combined energy function for sequence selection was applied to native active-site recapitulation in the computational enzyme design framework, which is the most stringent test used to benchmark an energy function. Ten crystal structures for enzyme–transition state analog complexes, enzyme–substrate complexes, or enzyme–inhibitor complexes were taken from the Protein Data Bank according to the mode of catalysis and the structure of the ligand. It should be noted that the target of this research was to check that the developed energy function and optimization algorithm is able to identify native amino-acid sequences given the structures of protein–ligand complexes, so the crystal structures chosen for the benchmark test were not limited to enzyme–transition state analog complexes. The test set is shown in Table 1, and includes members of the hydrolase, lyase, isomerase, and transferase enzyme families. The oxidoreductase family was not considered because our energy model cannot appropriately handle interactions involving the cofactor metal ions required by oxidoreductases as yet. For each test complex, design schemes showing which of the residues were subjected to type and conformation selection, which of the residues were subjected to only conformational variation, and the fixed catalytic residues (if present) are shown in Table S1 of the BElectronic supplementary material^ (ESM). The chosen enzymes catalyze the transformations of primary biological substrates such as nucleotides, peptides, sugars, and antibiotics, and the primary reactions catalyzed by the enzymes shown in the test set are now briefly described. Glycinamide ribonucleotide (GAR) transformylase (1c2t) converts GAR into formyl GAR with folate as the cofactor. Human deoxycytidine kinase (1p60) phosphorylates natural deoxycytidine with the general base Glu53. Thymidine kinase (1vtk) catalyzes the transfer of γphosphate from ATP to the 5′-hydroxyl of deoxythymidine to form deoxythymidine-5′-monophosphate. The major function of influenza virus neuraminidase (1bji) is to catalyze the cleavage of sialic acid from a glycoprotein via the oxocarbonium intermediate. Soybean β-amylase (1byb) is a unique starch-hydrolyzing enzyme that catalyzes the successive liberation of β-maltose from the nonreducing ends of α1,4-linked outer starch chains. Endoglucanase Cel5A (1h2j) is a glycosidase that hydrolyzes the β-1,4-glycosidic linkages in cello-oligosaccharides via oxocarbenium-ion-like transition-

J Mol Model (2015) 21: 191

Page 7 of 14 191

Table 1 Crystal structures and active-site design results for ten enzyme–ligand complexes Scaffold Resolution No. (Å) pred. des. sitesa

No. corr. Mean no. pred. des. corr. pred. sites des. sitesc min.(most)b

RMSD of corr. pred. residues min. (most), Åd

1c2t

2.10

14

12(12)

10.45

0.65(0.65)

1p60

1.60

15

11(13)

10.70

0.37(0.35)

1vtk 1bji

2.75 2.00

17 15

12(13) 10(11)

11.20 10.10

0.73(0.72) 0.44(0.82)

1byb 1h2j

1.90 1.15

16 15

13(14) 12(14)

11.80 12.05

1.67(1.56) 1.46(1.52)

1ikg

1.90

19

16(17)

15.10

0.47(0.47)

1gm9 1qfe

1.80 2.10

15 16

12(13) 11(12)

11.25 10.30

0.55(0.54) 1.90(1.86)

2pwe Total

2.00

19 161

16(16) 125(135)

14.90

0.56(0.56) 0.99(1.00)

a

No. pred. des. sites is the number of design sites for which the identity of the amino acid was varied during sequence selection b

No. corr. pred. des. sites min (most) is the number of design sites in the minimum-energy sequence that were predicted correctly (or, in parentheses, the number of design sites in the most-similar sequence—which is selected from among the top 20 sequences ranked by enzyme–ligand total energy—that were predicted correctly)

c

Mean no. corr. pred. des. sites is the average number of design sites predicted correctly for the top 20 sequences

d RMSD of corr. pred. residues min. (most) is the root-mean-square deviation of the conformations of the correctly predicted residues in the minimum-energy sequence (or the most-similar sequence in parentheses) from the corresponding crystallographic conformations

state stages. Streptomyces R61 DD-peptidase (1ikg) catalyzes the crosslinking step between neighboring peptidoglycan strands in bacterial cell-wall biosynthesis. Penicillin G acylase (1gm9) catalyzes the hydrolysis of natural penicillins to 6aminopenicillanic acid, which is an important pharmaceutical intermediate for a range of semisynthetic penicillins. The type I enzyme of 3-dehydroquinate dehydratase (1qfe) catalyzes a cis-dehydration of 3-dehydroquinate via a covalent imine intermediate. Sucrose isomerase (2pwe) catalyzes the isomerization of sucrose into trehalulose and isomaltulose via the formation of a covalent glycosyl-enzyme intermediate. Recapitulation of active sites of native enzymes The results of the calculations performed to predict the activesite sequences of native enzymes using the combined energy function and the LP/MILP-based optimization algorithm are summarized in Table 1. These results relate to the top 20 active-site sequences ranked by total energy for each enzyme–ligand complex in a test set with 10 scaffolds. The minimum-energy sequences (each of which presents the

global minimum total energy) along with the most-similar sequences (each of which presents the highest sequence identity to the native sequence) among the top 20 sequences for the 10 scaffolds are displayed in Table 2. On average, 78% (125/ 161) of the active sites were correctly predicted in the minimum-energy sequences, and sequence identity to the native active site was observed for at least 67% (e.g., 10 out of 15 for scaffold 1bji) of these. For sequences with high similarity, 84% (135/161) of the active sites were predicted correctly on average. Although the prediction accuracies for the mostsimilar sequences were comparable with those obtained by Chakrabarti et al. [26, 27], our prediction accuracy for the minimum-energy sequences was much higher, and the number of design sites for each scaffold was much larger. For instance, only 4 out of 10 design sites were predicted correctly in the minimum-energy sequence for scaffold 1vtk by Chakrabarti et al. [26, 27], but 12 out of 17 design sites were predicted correctly in the present work. Using our sequence optimization algorithm, the top 20 sequences for each scaffold were generated starting from the minimum-energy sequence by adding integer cuts. Therefore, a higher prediction accuracy for a minimum-energy sequence can greatly improve the quality of the list. Consistent with the high prediction accuracies of the sequences, the geometric accuracies of the side chains with identities that matched those of the native residues were also high. The RMSDs of side chains predicted correctly in the minimum-energy sequences and the most-similar sequences are displayed in Table S2 of the ESM, and the average RMSDs for those sequences are given in Table 1. Specifically, the side-chain RMSDs of 117 out of 135 residues that were correctly predicted in the most-similar sequences were less than 1 Å, even when some of the neighboring residues were selected incorrectly. Table 1 shows that the average RMSDs of all 10 of the most-similar sequences were less than 2 Å, and 7 out of the 10 most-similar sequences had average RMSDs of less than 1 Å. According to the results shown in Table S2 of the ESM, a few bulky residues deviated significantly from their crystal structures in three scaffolds (Trp55 and Arg420 in scaffold 1byb, Tyr66 in scaffold 1h2j, and Arg82 in scaffold 1qfe), and the average RMSDs of the predicted most-similar sequences for 1byb, 1h2j, and 1qfe were

Computational design of enzyme-ligand binding using a combined energy function and deterministic sequence optimization algorithm.

Enzyme amino-acid sequences at ligand-binding interfaces are evolutionarily optimized for reactions, and the natural conformation of an enzyme-ligand ...
1MB Sizes 0 Downloads 7 Views