Proteins: Structure, Function and Bioinformatics DOI 10.1002/prot.24609

Research Article

De novo inference of protein function from coarse-grained dynamics Pratiti Bhadra,1 and Debnath Pal1,2 1 2

Institute Mathematics Initiative

Bioinformatics Centre and Supercomputer Education Research Centre Indian Institute of Science, Bangalore, India, 560012 Tel: +918022932901 Fax: +918023602648

Short title: Inference of protein function from dynamics Correspondence to: Debnath Pal Indian Institute of Science Bangalore 560012 E-mail: [email protected]

This article has been accepted for publication and undergone full peer review but has not been through the copyediting, typesetting, pagination and proofreading process which may lead to differences between this version and the Version of Record. Please cite this article as an ‘Accepted Article’, doi: 10.1002/prot.24609 © 2014 Wiley Periodicals, Inc. Received: Feb 25, 2014; Revised: Apr 29, 2014; Accepted: May 13, 2014

PROTEINS: Structure, Function, and Bioinformatics

Page 2 of 40

ABSTRACT Inference of molecular function of proteins is the fundamental task in the quest for understanding cellular processes. The task is getting increasingly difficult with thousands of new proteins discovered each day. The difficulty arises primarily due to lack of high-throughput experimental technique for assessing protein molecular function, a lacunae that computational approaches are trying hard to fill. The latter too faces a major bottleneck in absence of clear evidence based on evolutionary information. Here we propose a de novo approach to annotate protein molecular function through structural-dynamics match for a pair of segments from two dissimilar proteins, that may share even 99% true recall. A blind prediction on a novel protein appears consistent with additional evidences retrieved therein. This is the first proof-of-principle of generalized use of structural dynamics for inferring protein molecular function leveraging our custom-made coarse-grained forcefield, useful to all. Keywords: function annotation; forcefield; molecular dynamics; algorithm; autocorrelation vector 2

John Wiley & Sons, Inc.

Page 3 of 40

PROTEINS: Structure, Function, and Bioinformatics

Introduction The knowledge of molecular function of a protein is of fundamental value. However, no high throughput technologies can currently assay protein molecular function, similar to technologies for protein structure determination. Existing automated protein function inference methods exploit individual or multiplexed information ranging from primary to quaternary structure, along with inferred evolutionary relationships, genomic contexts, experimental data for interaction networks, microarrays, and so on.1 However, despite ample evidence linking essential dynamics of a protein to its molecular function,2 there have been no attempts to conceptualize predictive approaches linking the collective-motion of a protein to its putative function. Interestingly, individual studies illustrating such relationship has been depicted as far back in the 90’s,3 especially for catalysis4 and ligand binding.5 However, there have been no attempts to extend these studies for general use, to benefit protein function annotation. Another difficulty of linking protein dynamics to function is the limitation in rapidly accessing the protein conformational population available from microseconds and upwards simulation timescale.6 Experimental methods like Nuclear Magnetic Resonance (NMR) or small angle X-ray scattering are able to provide dynamics related information at long timescale through ensemble average properties, but these lack the requisite spatial and temporal resolution desired from a single molecule study.7,8 Alternative computational approaches such as normal mode analysis (NMA) have been used, but they are inadequate in encapsulating the vast landscape of energies.9 All atom molecular dynamics (AAMD)10 offer realistic options, but limited computing capacity and cost makes microseconds and above time-scale simulations prohibitive.11 A modified approach, Accelerated Molecular Dynamics (aMD) could be leveraged, if faithful reproduction of structural motions were guaranteed despite use of modified

John Wiley & Sons, Inc.

3

PROTEINS: Structure, Function, and Bioinformatics

potentials.12,13 Coarse-grained (CG) models14-17 offer a substitute, especially suited for economical production of long time-scale simulations.18-21 But they are yet to be demonstrated to faithfully reproduce protein motion as in the physiological conditions. For example, CG-based protein simulations are known to show disrupted secondary structure; consequently, a nonradial CG protein potential function has been recently proposed.22 A number of coarse-grained forcefields are currently available23-27 that attempt to balance the aforementioned issues, but are minimally validated against experimental data for physiologically relevant motions. In this paper, we propose a conceptually new approach to infer protein molecular function from its coarse-grained structural dynamics information. A rigorously benchmarked (against information from experimental NMR data) coarse-grained molecular mechanics (CGMM) forcefield is first developed to run microsecond timescale molecular dynamics simulations. The normalized root-mean-square fluctuation (RMSFnorm) graphs calculated from the simulation trajectories are then compared across proteins to identify mobile segments which are matched for identical dynamics through the computed three dimensional autocorrelation vectors (ACV). In 87% of the 60 unique proteins in our test set, we were able to correctly correlate the dynamics of the selected pair of protein segments to the corresponding molecular function of the proteins. In 93.5% cases, where the functions did not match, the dynamics did not pass the matching criteria as well. The function prediction scheme bypasses use of evolutionary information and would therefore find wide use in de novo function prediction for novel sequence and structure families.

4

John Wiley & Sons, Inc.

Page 4 of 40

Page 5 of 40

PROTEINS: Structure, Function, and Bioinformatics

Materials and Methods Datasets The forcefield developed for this work was computed from the probability distribution of bonded/nonbonded distance, angle, torsion values calculated from unique subunits of monomeric and dimeric proteins of the EVA28 database. The database contains more than 2900 unique protein structure of which 1365 are monomers. The developed forcefield was tested on a nonredundant single domain protein monomer data set of 315 proteins determined using Nuclear Magnetic Resonance (NMR) methods (NMR dataset). This dataset was screened from 5617 NMR derived proteins in the Protein Data Bank (PDB)29 of length > 30, with 10 or more models for ensembles in the coordinate file. From these, one representative sequence was chosen by applying 30% sequence redundancy criteria. The selected subset was further pruned by structurally comparing30 the protein pairs and selecting one representative for groups that have Z-score > 2. This ensured that all structures in our test database have unique fold and function, and the sequence identity did not exceed the 30% threshold. Because of the unique fold and function of each protein in this data set, it was also used for random negative test of function from dynamics. We also compiled a test set of 60 protein subunits for our studies (test dataset) for which we knew from literature the exact protein segments responsible for molecular function. Moonlighting proteins and proteins with novel functional motifs31 were included in this set. Coarse-grained Molecular Mechanics (CGMM) Forcefield A Coarse-Grained Molecular Mechanics (CGMM) model for protein was defined on pseudo atoms centered at the Cα atom position of the backbone, each representing an amino acid (Fig.

John Wiley & Sons, Inc.

5

PROTEINS: Structure, Function, and Bioinformatics

Page 6 of 40

S1). The total energy Utotal was expressed as a sum of the bonded and nonbonded terms in the forcefield (FF): Utotal = Ubond + Uangle + Utorsion + Unonbond

(1)

where Ubond, Uangle, Utorsion and Unonbond represent the bond, angle, dihedral, and nonbonded potentials, respectively. All energy terms correspond to the virtual interactions between the pseudo atoms centered at the Cα atom position. The Boltzmann inversion method32 was used to calculate the Potential Mean Force (PMF; U) using the probability distribution (P):

U i = k BT ln( Pi / P0 )

(2)

where Ui is the potential energy corresponding to a property at value i, kB is the Boltzmann constant, Pi is the probability of occurrence of the property at value i, P0 is the reference-state probability, and T is temperature. 298K is used as standard temperature for our parameterization. The computed PMF for the desired property was thereafter arrived at by using regression and imposing suitable boundary conditions. The final functional form of the potential and parameters were chosen on the basis of least root-mean-square-error. Virtual bond: Three type of virtual bond distance (rij) are considered in our forcefield: the backbone trans and cis peptide bonds, and the disulfide bridge linking the cysteine (Cys) residues. The trans peptide virtual bond potential is modeled by combination of harmonic and polynomial functional term (Eq. (3)), but the cis peptide and disulfide bridge virtual bonds are described by harmonic potential function with suitable parameters (Eq. (4)).

(

2

) (

)

bond (rij ) = f ( rij ) *K * rij − | r0 | + 1 − f ( rij ) * ( a 2 *rij2 + a1 *rij + a 0 ) Utrans

6

John Wiley & Sons, Inc.

(3)

Page 7 of 40

PROTEINS: Structure, Function, and Bioinformatics

(

U bond rij − | r0 | cis / disufide ( rij ) = K *

)

2

(4)

In equation (3) and (4), U bond is the virtual bond potential energy (kcal/mole) at rij trans / cis / disufide ( rij ) virtual bond distance (Å) between two consecutive (ith and jth) Cα atom positions. r0 is the reference state where the energy is minimum in the computed Potential Mean Force (PMF); K, a2, a1, a0 are constants. f(rij) is a Boolean function that assumes a value 1 within the boundary condition. Figure S2A represents an example of probability distribution of the trans peptide virtual bond with a narrow distribution centered on 3.78 Å. The corresponding PMF yields a potential function depicted in Fig. S2B, with energy minimum at 3.78 Å. For obtaining close regression fits, the potential function is bounded into three separate regions: rij 3.85Å. Potential energy is described by harmonic function in the populated region 3.6 Å < rij < 3.85 Å and elsewhere by a polynomial function with different parameters. Similar fitting has been performed for cis and disulfide virtual bonds with energy minimum at 2.96 Å and 5.5 Å, respectively. Example constants for trans virtual bond are depicted in Table S1. Virtual angle: The virtual angle is defined by three consecutive Cα atom positions. The calculated virtual bond angles are grouped by the secondary structure of the residues identified by the DSSP program.33 A triplet of secondary structure labels are used for the grouping as angle definition involves three coordinates. Eight secondary structure labels output by DSSP, namely, α-helix (H), 310-helix (G), π-helix (I), β-sheet (E), β-bulge (B), bend (S), turn (T) and unstructured (L), have been used by us. The β-bulge (B) labels were merged under β-sheet (E). To use only non-redundant probability distributions, they were grouped under 86 classes. Kmeans clustering method has been used to group all virtual angles into classes. Pearson correlation measure between two probability distribution graphs of two different triplets was

John Wiley & Sons, Inc.

7

PROTEINS: Structure, Function, and Bioinformatics

Page 8 of 40

used as distance for K-means clustering. Triplets that was present more than 2000 times in the EVA database were chosen as the initial centroids of K-means cluster. We used K-means clustering function from MATLAB for computing the clusters/classes. Probability distribution of each class was used to calculate the corresponding PMF using Boltzmann inversion method. The data for angles were obtained from EVA database, wherein probability (Pi) was defined as the number of virtual angles in radian in the range ii,ii+0.01 radian [ii is 0 to 3.13 radian] divided by the total number of virtual angles present in calculated data from EVA database. Each PMF was fit using Fourier series function, polynomial functional and harmonic potential term (Eq. (5)) using regression method. 4

' 2 Uithangle _ group (θ ) = f(θ )*∑[kn *cos(n ωθ )+kn *sin(nωθ )] + (1− f (θ ))*[g(θ )* Ka (θ −θa ) + (1− g(θ )) n=0

5

4

3

(5)

2

*(a5*θ +a4*θ +a3*θ +a2*θ +a1*θ +a0 )] angle In equation (5) U ith _ group (θ ) is virtual angle potential energy (kcal/mol) at θ (radian) virtual angle

between three consecutive Cα atoms. ω is the multiplicity of Fourier series. Constant parameters are kn, kn’, Ka, a5, a4, a3, a2, a1. θa is a constant base value of angle, which is used for harmonic potential fitting. f(θ),g(θ) are Boolean functions to enforce the boundary conditions. Virtual angle is periodic, so in the regression method we use additional data outside 0 to 3.14 radian for proper fitting. An example probability distribution of triplet HHH is shown in Fig. S2C, with a sharp peak at 1.6 radian. Fourier series is used for best fitting to the computed PMF (Fig. S2D) in the most populated virtual angle range 1 to 2.3 radian; other regions beyond this range fit to polynomial function (θ < 1 radian) and harmonic potential (θ > 2.3 radian). Corresponding Boolean functions f(θ) is set to 1, if 1 radian < θ < 2.3 radian and otherwise it is 0. Similarly, g(θ)

8

John Wiley & Sons, Inc.

Page 9 of 40

PROTEINS: Structure, Function, and Bioinformatics

is set to1, if θ > 2.3 radian, and g(θ) is 0 if θ < 1 radian. Example parameters for HHH potential energy function are given in Table S1. Virtual dihedral: The virtual dihedral is defined by four consecutive Cα atom positions. Secondary structure definitions are used similar to virtual angle potential calculations. The, triplet of virtual angle are replaced by the quartet of secondary structure labels of corresponding residues. To use only non-redundant virtual dihedral probability distributions, they were grouped into 126 classes. The classes were calculated in a similar manner as virtual angles. Computed PMF was sectioned into three parts, and each was fit with Fourier series function (Eq. (6)) with different parameters.

U dihedral (τ ) =

4

∑ [k

n

cos( nωτ ) + k n' * sin( nωτ )]

(6)

n =1

In equation (6) U dihedral (τ ) is virtual dihedral potential energy (kcal/mol) at τ (radian) virtual dihedral between four consecutive Cα atoms. ω is multiplicity of Fourier series function. kn, kn’, are constant parameters. Since dihedral angle is periodic, best fit was performed using additional data outside -3.14 to 3.14 radian using regression method. Example probability distribution of HHHH (non-glycine) virtual dihedral shows a sharp peak at 0.88 radian (Fig. S2E). For Fourier series fitting, the distribution was divided into three parts where one part is around the peak, and other two before and after the peak region (Fig. S2F). Example constant parameters for HHHH potential are given in Table S1. For realistic performance of CGMM forcefield, Gly dihedrals are considered separately as they are least restricted as per the Ramachandran plot.34 Probability distribution of quartets, which included Gly at 2nd or 3rd position, was separately computed from EVA database. On basis of the

John Wiley & Sons, Inc.

9

PROTEINS: Structure, Function, and Bioinformatics

Page 10 of 40

probability distribution, the Gly-containing quartets were grouped into 207 non-redundant classes. Computed PMF of each class obtained from Boltzmann inversion of probability distribution was fit with Fourier series function (Eq. (6)) with appropriate parameters. Nonbonded: The nonbonded interaction between pseudo atoms located at Cα atom position is calculated for pairs separated by at least two virtual bonds. The 20 natural amino acids formed 210 unique pairs for each of which the PMF of nonbonded interaction was obtained from the probability distribution using Boltzmann inversion method. Cut-off for nonbonded interaction was set to 12 Å, and within that range it was divided into several interaction shells, depending on the presence of a discernible potential well. Nonbonded PMF was fit to Morse potential function (Eq. (7)), with distinct parameters for each interaction shell. ε *( nbr 0− nbrij ) 2

U nonbonded (nbrij ) = 6* e( nbr 0/2.8) *((1 − e 1

) + ε2 )

(7)

In the above equation, U nonbonded (nbrij ) is virtual nonbonded potential energy (kcal/mole) at nbrij virtual nonbonded distance (Å) between two (i and j) nonbonded pseudo atoms at Cα atom positions. nbr0 is the reference state where energy is minimum in PMF in that interaction shell. ε1 and ε2 are constant parameters. The interaction shells are necessary to enforce the stabilized secondary structure during simulation. An example probability distribution of alanine-alanine (Ala-Ala) nonbonded interaction is shown in Fig. S3A. It has two peaks at 5.5 Å and 6.2 Å. The PMF for short-range is computed from probability distribution within 7 Å nonbonded distance using Boltzmann inversion method. There are two peaks within short-range PMF; therefore, two functions have been fitted (Fig. S3B). These two functions differ in parameters only, both has same functional form. Computed PMF beyond 7 Å is almost flat, and minimum potential energy

10

John Wiley & Sons, Inc.

Page 11 of 40

PROTEINS: Structure, Function, and Bioinformatics

is considered at 10.7 Å. Example parameters for Ala-Ala nonbonded interaction is given in Table S1. Simulation parameter The potential functions were incorporated into GROMACS35 simulation package 4.5.5 using the tabulated potential method. The simulation medium was chosen as vacuum, and the steepestdescent method was used for energy minimization of the protein. Canonical or NVT ensemble was used, whereby the simulation temperature was kept at 298K using Berendsen temperature coupling. Simulated annealing was used in the equilibration step for 70 pico-second (ps) time interval. 1 µs duration molecular dynamics simulation was performed using 2 femto-second (fs) integration time-step. Leapfrog method was chosen as the integrator algorithm. Coordinate file was saved after each 100ps time in the 1 µs long simulation. The mass of each CG pseudo atom at Cα atom position was set to its corresponding accurate amino acid mass. The custom-built coarse-grained molecular mechanics (CGMM) forcefield returned the expected low mean variation of the energy fluctuation at 62.6(±19) KJ/mol for all proteins in the NMR dataset. Due to the symplectic integration scheme36 used in the simulation, the stability of the equilibrium energy and its reproducibility in a long simulation like ours reflects the suitability of the potential energy functions defined in the forcefield. Selecting “mobile” protein segments for dynamics-function correlation All segments of the protein molecule are not flexible and do not undergo fluctuations that may be suitable for interrogation of function. Therefore, for predictive purposes, interesting segments of the protein primary structure must be delineated, for which we used the Root-Mean-Square

John Wiley & Sons, Inc.

11

PROTEINS: Structure, Function, and Bioinformatics

Page 12 of 40

2

1 T fluctuation (RMSF) criterion: RMSFi = ∑ | ri (t ) − r i | , where r i is the average position of the T t =1 pseudo atom i in T molecular dynamics frames. This was

( RMSFi )norm =

further normalized:

( RMSFi − RMSF ) . Since the RMSFnorm values are real numbers, to aid quick σ ( RMSF )

search of RMSFnorm patterns, they were converted into symbols corresponding to a range of RMSF values as depicted in Fig. S4.

Symbol-based RMSF pattern matches between two

proteins were carried out using Smith-Waterman alignment and identity weight matrix, as implemented in MATLAB (www.mathworks.in/products/matlab/). Pair of matched segments was evaluated using a percentage-based criterion for classification as “mobile” and suitable for assessment of dynamics-function correlation. We counted the occurrence of symbol ‘L’ if >35% in the selected segment, or the combined percentage presence of the symbols ‘G’, ‘H’ and ‘I’ if >14% was deemed as a mobile segment. The symbols L, G, H and I are as defined in Figure S4. 85% of the 60 proteins in the test set satisfy the above condition (Table S2). Selecting protein segments with matching dynamics We define unweighted 3D autocorrelation vectors (ACV) to describe the protein dynamics pattern, based on distance between residues in each frame. The autocorrelation vector represents the 3D conformation in an n-dimensional space. The conformations generated during coarsegrained simulation were converted to vectors using the 3D autocorrelation method. The 3D ACV of a protein is defined using a distance step dx, where the ACV is of dimension n where n=dmax/dx, for dmax representing the distance between two farthest atoms of protein in a given conformation (frame). Each component (i) of the 3D ACV can be represented as:

12

John Wiley & Sons, Inc.

Page 13 of 40

PROTEINS: Structure, Function, and Bioinformatics

3D ACV(i) =

∑P P j ,k

j

(8)

k

where the Pj and Pk are the properties or weights associated with the atoms j and k, separated by a distance (i)dx and (i+1)dx. When the value of P is uniformly taken as 1, it is called unweighted ACV. An ACV can be weighted by some properties defined on atom (e.g., atomic charge, van der Waals radii). For our CG trajectory analysis, unweighted 3D ACVs were computed using pseudo atom at Cα atomic positions, with step size dx = 2 Å. To compare whether the dynamics of two protein segments are similar, we calculated correlation coefficient (CC) value between each ACV pair taken from two different protein segment’s ACV profile obtained from the 1 µs dynamics trajectory (Fig. S5). The computed ACVs from the starting, ending and the 100 ns interval frames of the two proteins were used to calculate the Pearson’s correlation coefficient. The 121 Pearson correlation coefficient (CC) values obtained from 11 pairs of ACVs was used to set a criteria whether two proteins have same dynamics for the selected protein segment pair. If at least 25% of the CC values are >0.95 and at least 50% are >0.9 then the selected segment pairs were deemed to have matching dynamics. As can be seen from Fig S5, all the ACV curves for a molecular trajectory may not match; however a group of them may match for particular time segments in the simulation. Therefore, for matching dynamics, the ACVs at each time need to be matched against all ACVs from the other protein. Wherever possible the segments chosen for calculation were also verified and documented for functional relevance through literature survey. Evaluation of results

John Wiley & Sons, Inc.

13

PROTEINS: Structure, Function, and Bioinformatics

Page 14 of 40

The overall performance of our method was estimated using the true positive rate (TPR) and True negative rate (TNR). The TPR was defined as TPR =

TP , where true positive, TP TP + FP

was defined as cases where both the protein segments being compared had matching dynamics and function, while false positive, FP, was defined as cases where the compared protein segments had matching function, but not matching dynamics. Similarly, TNR was defined as TNR =

TN , where true negative, TN was defined as cases where both the protein segments TN + FN

being compared did not have matching dynamics as well as function, while false negative, FN was defined as cases where the compared segments had matching dynamics but not matching function.

Results Forcefield suitability A suitable CG forcefield is expected to produce RMSF graphs of proteins from simulations that match closely with the ones obtained from the experimental data. For this purpose, we used RMSFnorm values computed from 315 unique-fold NMR ensemble structures and compared to RMSFnorm computed from dynamics trajectories from simulation on the same structures using our custom-built CGMM forcefield. We also cross-compared simulation results (using same parameters) from our forcefield with the widely used latest release of MARTINI2.2 forcefield27 (Fig. S6). Our in house program was used to compute RMSFnorm from the NMR ensemble data along with the utilities of GROMACS software package;35 the latter was applied to coarsegrained molecular dynamics (CG-MD) simulation trajectories from CGMM and MARTINI as well. RMSFnorm graphs, produced by CGMM matched closely with RMSFnorm graphs calculated 14

John Wiley & Sons, Inc.

Page 15 of 40

PROTEINS: Structure, Function, and Bioinformatics

from the NMR ensembles. The average correlation coefficient (CC) obtained for all 315 pairs of matches is 0.74(±0.24). The comparative performances on the same set returned a lower overall correlation of 0.54(±0.18) for MARTINI2.2 forcefield.26 We also compared the initial release of MARTINI forcefield (version 2.1), for which overall correlation was much lower at 0.34(±0.24) for the same dataset. A histogram of CC value difference of each protein between CGMM and MARTINI2.2 forcefields (CCCGMM/NMR - CCMARTINI2.2/NMR) can clearly discern the improved performance of CGMM (Fig. S7); the mean difference of the CC values between CGMM/NMR and MARTINI2.2/NMR is 0.2 (±0.2). 84.5% of proteins have better CC value with CGMM/NMR in comparison to MARTINI2.2/NMR (CCCGMM/NMR - CCMARTINI2.2/NMR >0). Around 27.7% and 1.7% of proteins of our dataset gave CC value for CGMM/NMR greater than 0.9 and less than 0, respectively. In case of MARTINI2.2/NMR, the corresponding percentages are 6.3% and 1.3%. The length of protein does not affect the performance of CGMM force field. The improved CGMM forcefield performance indicates its comparative suitability in computing the dynamics-function match. RMSFnorm graphs correlate to ligand binding affinity The RMSF graphs capture the mobility of the protein segments during the dynamics simulation as per the forcefield used. While every protein has a characteristic RMSF, it is pertinent to ask if the RMSF curve exhibits similar characteristics for related proteins that are interpretable for linking to a molecular function. This is the basic premise for use of RMSF in de novo inference of function from dynamics. To address this specific query, we took a diverse set of orthologous Acyl_CoA binding proteins (ACBPs), with available experimental binding affinity data (Table S3). The proteins arranged in the descending order of binding affinity occur from both prokaryotic and eukaryotic families. Importantly, although they diverge between 21.3% to 81.6%

John Wiley & Sons, Inc.

15

PROTEINS: Structure, Function, and Bioinformatics

Page 16 of 40

sequence identity (Table S4), they share a common fold, and a loop between first two α-helices (H1 and H2; residues 14-22) interacts with the ligand. The Acyl_CoAs, which has variable size based on side chain length, viz short (≤C14), medium and long (≥C20), show varied affinity for the same binding site; for example, M. perniciosa ACBP Kd values for myristoyl_CoA (C14) and arachidoyl-CoA (C20) are 57 nM and 0.65 nM, respectively, indicating tighter binding to longer chain ligand for the same receptor site. As can be seen from the RMSFnorm curves plotted for all the structures (Fig. 1), the curves are similar to each other despite difference in the primary structure. Interestingly, the Acyl_CoA binding site (marked as a box, Fig. 1) shows clustering of RMSFnorm curves for Bovine, Rat, Yeast and M. perniciosa that bind medium to long chain Acyl_CoA moieties. This is in contrast to P. falciparum and T. brucei, that are grouped separately along with C. elegans (no experimental information). Based on the grouping, it can be argued that medium-long chain binding ACBPs have lower RMSFnorm at the binding site suggesting less flexibility, which correlates with higher binding affinity for medium-long chain Acyl_CoA moieties. The short chain binding ACBPs have higher RMSFnorm that correlate with lower binding affinity for medium-long chain Acyl_CoA moieties. The observations suggest that any receptor site cradled by high flexibility protein segments will disfavor binding of larger ligands as the likelihood of making requisite contacts necessary for stable interaction decreases proportionately. Smaller ligands have higher likelihood of making necessary contacts with higher flexibility binding site. It can be inferred that C. elegans ACBP have high putative binding affinity towards short-chained Acyl_CoA. Overall the results highlight the ability of RMSFnorm data to sensitively classify ligand-binding affinity, and appear suitable for correlating dynamics with molecular activity (also see Table I, first entry). Matching dynamics-signature and function 16

John Wiley & Sons, Inc.

Page 17 of 40

PROTEINS: Structure, Function, and Bioinformatics

The suitability of RMSFnorm for correlating protein segments depicting matching dynamics makes it pertinent to further explore its generic use to examine correlations on a larger set of proteins. For analysis, we arbitrarily chose 60 protein subunits arranged into 31 pairs with literature evidenced identical function, some of them moonlighting proteins (Table I). We created another 31 pairs whose functions do not match (Table S5). Table I has 10%, 16%, 19% sequence pairs with >70%, 40-70%, 30-40% identity, respectively; the remaining 45% of the sequence pairs have ≤17% identity. The examples were chosen to broadly maintain a balanced distribution of sequence identity pairs greater than and less than 30% identity so as to not concur / concur with evolutionary information, that benefits function annotation. In Table S5, except four cases with 100% self-identical moonlighting proteins, all other sequence pairs have 0.95

>0.90

Match (%)

1ST7:A

13-23

76.03

97.52

42/87 (48)

1HBK:A

13-23

28.18

82.64

25/92 (27)

2NRM:A

58-64

59.50

79.34

64/156 (41)

2D8R:A

15-28

26.45

94.21

16/115 (14)

1QXF:A

25-38

87.60

100

13/80 (16)

13-23

Heme distal ligand binding

103M:A

62-68

Zinc finger region

1VD4:A

142-156

Protein disulfide oxidoreductase

Protein 2

1B4Q:A

20-27

1EGR:A

9-16

55.37

95.87

26/110 (24)

Potassium channel inhibitor

1CMR:A

20-28

2LO7:A

18-26

28.93

59.50

3/34 (9)

α-amylase inhibitor activity

2KER:A

13-21

4AIT:A

15-23

61.16

99.17

24/81 (30)

Heparin binding

2KMS:A

740-746

4AHE:A

62-68

0.83

33.06

22/164 (13)

Copper chaperone activity

1Z2G:A

21-26

2GGP:A

13-20

37.19

62.81

4/127 (3)

GTP binding

2FN4:A

36-44

4EPV:A

10-17

35.54

83.47

98/181 (54)

activity

John Wiley & Sons, Inc.

PROTEINS: Structure, Function, and Bioinformatics

Page 36 of 40

DNA binding

2OVG:A

16-35

2L0K:A

21-40

99.17

100

17/99 (17)

RNA binding(TAR), Protein

1K5K:A

49-57

1JFW:A

49-57

56.20

100

67/87 (77)

2KMO:A

6-11

2LEO:A

45-50

47.11

79.34

22/70 (31)

1MVZ:A

14-19

2KMO:A

6-11

25.00

60.33

9/80 (11)

Controlling nuclear cell division cycles

1BU2:A

31-62

2EUF:A

45-76

100

Cytokines

1TGJ:A

33-48

1ES7:A

21-36

95.8

100

44/124 (36)

Elongation cycle of protein

1XB2:B

66-81

2CP9:A

16-31

100

100

50/298 (17)

Antioxidant

2TIR:A

24-42

2HSY:A

23-41

100

100

34/109 (31)

Ligand binding

1T8K:A[C]

19-26

2KOS:A

32-41

91

Channel regulator activity

1OAW:A[C]

30-38

1AXH:A

27-35

49.59

80.99

9/59(15)

Ribonuclease inhibitor Activity

1B2U:D[C]

39-47

1NXJ:A

38-46

33.88

85.95

22/214 (10)

Oxidoreductase activity

2BBK:L[C]

48-56

2AH1:G

100-108

33.0

80.1

54/140 (39)

transduction Serine type endopeptidase inhibitor activity Serine type endopeptidase inhibitor activity 100

229/254 (90)

biosynthesis

John Wiley & Sons, Inc.

100

35/81 (43)

Page 37 of 40

PROTEINS: Structure, Function, and Bioinformatics

1UOU:A[D]

112-127

1AZY:A

113-128

94.2

100

180/488 (37)

Platelet derived endothelial cell growth 1UOU:A[D]

116-202

2J0F:A*

116-202

100

100

466/482 (97)

182-189

1A9U:A

177-184

100

100

172/396 (43)

Thymidin phosphorylase

factor (ECGF1) [substrate binding site] MAP kinase (Activation Loop Region) 1WZY:A[D] MAP kinase (ATP binding site)

1WZY:A[D]

31-39

1A9U:A

30-38

96.6

80.9

172/396 (43)

Repressor (Transcription factor

1WZY:A[D]

252-270

2OVG:A

16-35

0

0

14/369 (4)

Thrombin protease

1SHH:B[D]

174-185

6LPR:A

175-186

10.7

71.9

42/330 (13)

G -protein coupled receptor

1SHH:B[D]

190-219

4AMI:A

95-111

0

1.6

25/484 (5)

Biotin(acetyl-CoA-carboxylase)

1BIB:A[D]

87-93

2EJ9:A

9-15

95.87

82.64

73/329 (22)

1BIB:A[D]

22-41

2L0K:A

21-40

100

100

20/345 (5)

binding)

synthetase Repressor (DNA binding)

[A] Descriptions obtained from UNIPROT database. [B] Italic - “true positive” means both segments with similar dynamics and function match, underline - “false positive” means function matches, but the segment do not have similar dynamics. [C] Examples chosen from Manikandan et al 31

. [D] Moonlighting protein.

John Wiley & Sons, Inc.

PROTEINS: Structure, Function, and Bioinformatics

Figure 1 RMSFnorm curves from ACBPs of different organisms. RMSFnorm of ACBPs, which has high binding affinity towards short and long chained Acyl_CoA molecules, is represented by gray and black line, respectively. Black box around residue number 20 denotes the binding region of ACBPs. 71x29mm (600 x 600 DPI)

John Wiley & Sons, Inc.

Page 38 of 40

Page 39 of 40

PROTEINS: Structure, Function, and Bioinformatics

Figure 2 Dynamics-function matching protein segment pairs with different secondary structures depicted by best matching ACV pairs. (A). Functionally important superposed region of 1CMR (gray, secondary structure: EEEELEEEE) and 2LO7 (black, secondary structure: HHHHHHHHH), both are potassium channel inhibitor. (B) Functionally important superposed region of 1T8K (gray, secondary structure: GGGGTTTT) and 2KOS (black, secondary structure: HHHHHLLGGG), both selectively binding the vitamin pantetheine 4'-(dihydrogen phosphate). H=α-helix, G=310-helix, T=hydrogen bonded turn, L=loop/coil region. 72x28mm (600 x 600 DPI)

John Wiley & Sons, Inc.

PROTEINS: Structure, Function, and Bioinformatics

Figure 3 Hypothetical protein XcR50 (PDB: 1TTZ; deposited in the year 2004) is in gray, and Simian foamy virus protease reverse transcriptase (PDB: 2JYS) in black. The inset with gray shading in the left panel highlights the active site region of both the proteins. The right panel shows the putative catalytic residues Asp9 from 1TTZ aligned with Asp24 from 2JYS. The alignment was done using rotation matrix from the best matching ACV pairs. The Asp35, a potential base in the catalytic reaction is present within 4.0 Å of Asp9 of 1TTZ satisfying the requirement of proton abstraction to facilitate water mediated aspartic-type endopeptidase activity. 81x41mm (600 x 600 DPI)

John Wiley & Sons, Inc.

Page 40 of 40

De novo inference of protein function from coarse-grained dynamics.

Inference of molecular function of proteins is the fundamental task in the quest for understanding cellular processes. The task is getting increasingl...
1MB Sizes 2 Downloads 3 Views