FULL PAPER

WWW.C-CHEM.ORG

Bayesian Inference of Conformational State Populations from Computational Models and Sparse Experimental Observables Vincent A. Voelz* and Guangfeng Zhou We present a Bayesian inference approach to estimating conformational state populations from a combination of molecular modeling and sparse experimental data. Unlike alternative approaches, our method is designed for use with small molecules and emphasizes high-resolution structural models, using inferential structure determination with reference potentials, and Markov Chain Monte Carlo to sample the posterior distribution of conformational states. As an application of the method, we determine solution-state conformational populations of the 14-membered macrocycle cineromycin B, using a combination of previously published sparse Nuclear Magnetic

Resonance (NMR) observables and replica-exchange molecular dynamic/Quantum Mechanical (QM)-refined conformational ensembles. Our results agree better with experimental data compared to previous modeling efforts. Bayes factors are calculated to quantify the consistency of computational modeling with experiment, and the relative importance of reference C 2014 Wiley Periodipotentials and other model parameters. V cals, Inc.

Introduction

probabilistic model of conformational states that can be sampled using Monte Carlo methods, much like the inferential structure determination (ISD) approach of Habeck and coworkers.[15] and similar approaches implemented for flexible proteins.[16] Unlike ISD, our method uses proper reference potentials, and can be used to reweight a representative set of conformations whose populations have been estimated previously. We expect this method to be very useful for studying the solution-state conformational heterogeneity of molecules like natural products and peptidomimetics, using a combination of high-resolution structural modeling and sparse restraints from experimental observables. As an example, we perform highresolution replica-exchange molecular dynamic (REMD)/QM studies and subsequent posterior sampling of the conformational populations of the 14-membered macrolide cineromycin B, for which sparse NMR data have been previously published. Our results suggest a more coherent picture of the conformational properties of cineromycin B in solution, with the most populated conformations closely agreeing with experimental NMR observables, as well as the two crystal isoforms of analog albocycline (O-methyl cineromycin B).

For drug-like molecules such as natural products and peptidomimetics,[1,2] conformational preorganization in solution is often directly related to activity.[2–7] To understand the mechanism of action of such molecules, as well as rationally design their conformational properties, it is important to develop combined experimental and computational methods to accurately estimate conformational populations in solution. Many excellent methods have been developed to combine modeling efforts with ensemble-averaged NMR observables to predict populations of conformations in heterogeneous ensembles.[8–12] In methods such as DISCON[11] and NAMFIS,[8] a Monte Carlo search of relevant structures is performed and ranked using a molecular mechanics energy potential. The structures are clustered (in the case of DISCON, by the NMR observables themselves) and the distribution of populations most compatible with the experimental data is determined. In recent work by Blundell et al., multiple NMR data sets nuclear Overhauser effects (NOEs, J-coupling and residual dipolar couplings) were used to determine the distribution of conformational states.[2] Dihedral macrostates were enumerated geometrically, and the angular spread and population of each macrostate is adjusted to best reproduce the experimental data. When experimental observations are sparse, the burden rests much more heavily on accurate computational modeling. Toward that end, Aliev et al. have used a combined NMR/MD/ QM approach to extract a significant amount of accurate conformational information for many small molecules.[12–14] Here, we present a method that similarly uses a combined MD/ QM approach, based on a Bayesian framework for combining computational and experimental information, to determine conformational state populations in solution. Our main result is a posterior

DOI: 10.1002/jcc.23738

Methodology A Bayesian posterior model of conformational populations The question of how to combine experimental evidence with computational models to determine molecular structures has V. A. Voelz, G. Zhou Department of Chemistry, Temple University, Philadelphia, Pennsylvania E-mail: [email protected] or [email protected] Contract grant sponsor: National Science Foundation; Contract grant number: CNS-09-58854 C 2014 Wiley Periodicals, Inc. V

Journal of Computational Chemistry 2014, 35, 2215–2224

2215

FULL PAPER

WWW.C-CHEM.ORG

been elegantly addressed by Habeck et al., using a Bayesian inference approach referred to as ISD.[17,18] Consider a conformational state space X, consisting of states Xi where i51; :::; Ni . The experimental data D consists of a set of ensembleaveraged experimental observables rjexp , where j51; :::; Nj . For each conformational state Xi, we calculate predicted observables rj ðXi Þ that can be compared to the experimental values. By Bayes Theorem, we can express the probability of states in X given the data D as PðXjDÞ / PðDjXÞPðXÞ

(1)

where PðDjXÞ is a likelihood model, and P(X) is a prior distribution. For our likelihood model, we assume normally-distributed errors in the experimental measurements, parameterized by a standard deviation r Y 1 PðDjX; rÞ5 pffiffiffiffiffiffiffiffiffiffiexp ð2½rj ðXÞ2rjexp 2 =2r2 Þ 2pr2 j

(2)

Since r is undetermined, we treat it as a nuisance parameter, including it in the joint posterior of possible X and r: PðX; rjDÞ / PðDjX; rÞPðX; rÞ:

(3)

where the posterior PðXjDÞ is recovered by marginalization, ð PðXjDÞ5 PðX; rjDÞdr:

(4)

Since X and r are statistically independent, we let PðX; rÞ5PðXÞPðrÞ, where P(X) is a conformational prior probability, and PðrÞ51=r is a noninformative (Jeffreys) prior. Thus, the negative logarithm of the posterior probability is 2ln PðX; rjDÞ5ðNj 11Þln r1v2 ðXÞ=2r2 2ln PðXÞ1ðNj =2Þln 2p (5) P where v2 ðXÞ5 j ðrj ðXÞ2rjexp Þ2 is the sum of squared errors. It is useful to think of the 2ln PðXÞ term as an energy potential Ephys ðXÞ arising from the physics-based computational model, and the remaining terms as an energy potential Edata ðXjrÞ arising from the experimental data. Posterior sampling (described further below) can be used to determine the optimal values of r to balance these two terms.

2216

set of observables to be compared with experimental values D5ðr1exp ; r2exp ; :::Þ, such that  PðXjDÞ /

 PðrðXÞjDÞ PðXÞ Pref ðrðXÞÞ

(6)

The bracketed term is a potential of mean force (PMF) along the reaction coordinates defined by rðXÞ. The reference potential is needed to properly normalize the observed PMF. For example, consider a molecule where the distance between two atoms is measured to be in close proximity at distance r0. If the atoms in question were only two or three bonds apart, Pref will be large near r0 and the measurement does not confer much information about PðXjDÞ. In contrast, for atoms separated by many flexible bonds, Pref will be small near r0 and the measurement would confer much more information. To get an expression for PðXjDÞ that incorporates a reference potential, we first note that our error model only depends on the observables rðXÞ, so for PðDjX; rÞ we can write PðDjrðXÞ; rÞ. Starting with eq. (3), we apply Bayes’ Theorem once more to obtain PðXjDÞ /

PðrðXÞjD; rÞPðD; rÞ PðrðXÞjD; rÞ PðXÞ5 PðXÞ PðrðXÞ; rÞ PðrðXÞÞ

(7)

where we have used PðrðXÞ; rÞ5PðrðXÞÞPðrÞ and PðD; rÞ5 PðDÞPðrÞ, setting P(D) to unity. We can still represent PðrðXÞjD; rÞ using the same error model in eq. (2), but with a slightly different interpretation: PðrðXÞjD; rÞ is the probability of conformation X given the experimental measurements D and uncertainty r. The ratio PðrðXÞjD; rÞ=PðrðXÞÞ is the PMF along the coordinates rðXÞ, and the prior PðrðXÞÞ is the reference distribution. In practice, this prior can be very difficult to estimate, as it depends on the context of the problem, and would in theory account for correlation between observables rj. A conservative approach is prudent when choosing a suitable reference potential, so as to avoid any unnecessary bias. In this work we present below, we make the conservative assumption that the atomic distances rj are uncorrelated and obey a exponential distribution constrained by their average value, and that J-coupling constants obey a uniform reference potential (more details in Results). These are simple functional forms, but a more complex reference prior could be justified if we had more information.

Reference potentials

Bayes factors

As Hamelryck et al. has pointed out, there is a problem with the inference approach presented above: it assumes the information from the computational model P(X) and the information from experiment PðDjXÞ are independent.[19,20] This is obviously not true, because if we possessed a “perfect” potential Ephys ðXÞ, it should provide complete information about all experimental observables. To remedy this problem, Hamelryck and others advocate the use of a reference potential Pref ðrðXÞÞ, where r5ðr1 ; r2 ; :::Þ is the

Bayes factors are quantities we can use to compare and select optimal models.[21] Suppose we wish to compare two different probability models for P(X, D), call them models PðX; DjM1 Þ and PðX; DjM2 Þ, and select the model that is more compatible with experimental measurements. (Or similarly, we may be interested in selecting the best of two different experimental restraints that is compatible with computational models.) This problem of model selection can be accomplished by calculating the Bayes factor, KBF, defined as

Journal of Computational Chemistry 2014, 35, 2215–2224

WWW.CHEMISTRYVIEWS.COM

FULL PAPER

WWW.C-CHEM.ORG

cineromycin B. We first describe how to estimate the prior distribution of conformational states P(X) using a combined replica-exchange MD/QM approach. We then describe how to construct of a model PðDjXÞ from sparse experimental NMR data. Finally, we describe how to sample the combined posterior distribution using MCMC techniques. The results of our posterior sampling suggest improved estimates of conformational state populations compared to previous modeling efforts. Determination of conformational populations in a test system: cineromycin B

Figure 1. The molecular structure of cineromycin B (O-desmethyl albocycline).

ð KBF 5

PðDjM1 Þ 5ð PðDjM2 Þ

Pðr; XjM1 ÞPðDjr; X; M1 Þ (8) Pðr; XjM2 ÞPðDjr; X; M2 ÞÞ

The Bayes factor represents the strength of evidence in support of model M1 over M2, and is analogous to the likelihood ratio test used in classical hypothesis testing. By eq. (3), each integral in eq. (8) is simply the expectation value of the posterior probability Pðr; XjDÞ for a given model over all values of r; X. Thus, if we consider joint energy functions E1 52ln Pðr; XjD; M1 Þ and E2 52ln Pðr; XjD; M2 Þ, the Bayes Factor is given by ln KBF 5ln

hexp ð2E1 Þi1 : hexp ð2E2 Þi2

(9)

This is equivalent to a free energy difference, which can be estimated using many different methods.[22–24] In the results we present below, we use Markov Chain Monte Carlo (MCMC) to sample from the posterior distributions, and use the Multistate Bennett Acceptance Ratio (MBAR) method of Shirts and Chodera[25] to estimate gk 52ln hexp ð2Ek Þi for any number of models Mk, k51; ::; K. The MBAR estimate g^k is obtained by solving the self-consistent set of equations

g^k 52ln

  exp 2Ei ðxjn Þ XK   N exp g^k 2Ek ðxjn Þ n51 k51 k

Nj K X X j51

(10)

where xkn is the nth sample of ðX; rÞ from model Mk.

Results In the sections below, we elaborate upon methods to sample the posterior distribution of conformational states, by way of an example; namely, the 14-membered macrolide antibiotic

Cineromycin B (Fig. 1) is a 14-membered macrolide antibiotic that has gained attention for having activity against methicillin-resistant Staphylococcus aureus (MRSA). To better understand the solution structure of cinermycin B, Terekhova et al. conducted NMR spectroscopy experiments to measure NOEs and J-coupling constants, and performed structural modeling using the MM3 molecular mechanics potential.[26] The lowest-energy structure of cineromycin B was found to have a different conformation than the one suggested by the experimental data. This, and the observation of several mutually incompatible NOEs, suggested to the authors that the solution stucture of cineromycin B consists of a heterogeneous mixture of conformations made possible by rotations about single bonds in the macrocycle. Given the successful examples of rerefinement studies using more accurate computational models along with originally published NMR data,[27] we wanted to see if more accurate computational modeling might similarly alter previous conclusions about the solution-state conformational ensemble of cineromycin B. Estimation of the prior P(X) from molecular modeling Estimates of the prior P(X) for any molecule require adequate conformational sampling to discover all the relevant conformational states, and accurate estimates of the conformational free energy of each state. To accomplish this, we first used REMD to achieve thorough conformational sampling. Then, because Generalized Amber force field (GAFF)1Generalized Born/Surface Area (GBSA) is not sufficiently accurate for highresolution prediction of conformational populations, we further refined conformational minima using quantum mechanical calculations to get better estimates of conformational free energies. We have shown previously that this approach works well for high-resolution structure prediction of a peptoid nonamer.[28] REMD simulation Simulations were performed using MPI-enabled GROMACS 4.5.4 on a high-performance computing cluster.[29] Twelve temperature replicas were simulated, ranging from 300 to 450 K, with swaps between neighboring temperatures attempted every 10 ps, for 3.6 ls 3 12 replicas 5 43.2 ls of total trajectory data. The GAFF was used[30] along with a popular GBSA Journal of Computational Chemistry 2014, 35, 2215–2224

2217

FULL PAPER

WWW.C-CHEM.ORG

Table 1. Comparison of basis set and theory for selected conformational minima. Theory HF/6-31G(d) HF/6-311G(d,p) HF/6-3111G(2d,p) B3LYP/6-3111G(2d,p) HF/6-31G(d) 1 PCM HF/6-311G(d,p) 1 PCM HF/6-3111G(2d,p) 1 PCM B3LYP/6-3111G(2d,p) 1 PCM

State 38

State 45

State 80

State 92

State 79

0 0.318 0.139 0.023 0.267 0.778 0.574 0.405

2.280 0.810 2.177 1.855 1.384 1.760 1.595 1.340

0.366 0 0 0 0 0 0 0

0.670 0.715 0.660 0.855 0.757 1.069 1.006 1.054

6.442 6.782 5.663 5.354 6.581 7.008 6.785 5.488

All energies are given in kcal mol21.

solvation model,[31] and partial charges were provided by the AM1-BCC algorithm.[32] The MSMBuilder2 package was used to perform dihedral angle-based conformational clustering into Ni 5 100 unique conformational states, using a hybrid k-medoid algorithm.[33] This number of conformational states is greater than the estimated number of possible macrolide backbone rotamers, and sufficiently large as to represent the distribution of conformational minima.

ton distances, and rJ for the coupling constants, using noninformative priors Pðrd Þ  1=rd and PðrJ Þ  1=rJ , respectively. J coupling constants Model predictions of coupling constants JðhÞ from dihedral angles h were obtained from Karplus relations[35] chosen depending on the relevant stereochemistry (Table 2). In cases where the measured coupling constant can be assumed to result from motional averaging (e.g., a methyl group), the sum of squared deviations is computed as

QM optimization The cluster generators of all 100 states were then used as starting points for QM optimization at the B3LYP/6-3111G(2d,p)// HF/6-31G(d) level of theory using Gaussian09.[34] The polarizable continuum model (PCM) was additionally used to account for solvation effects To determine that this level of theory was sufficiently accurate, we explored both Hartree–Fock and density functional theories (B3LYP) theories using several basis sets, and with and without the use PCM (Table 1). Conformational state energies were found to be very similar with and without the use of PCM (Supporting Information Figure S1). The B3LYP energy EðXi Þ of each conformation Xi was used to estimate PðXi Þ according to the Boltzmann distribution PðXi Þ  exp ð2EðXi Þ=kB TÞ where kB is Boltzmann’s constant and T 5 300 K. This estimate is valid in the case that the conformational entropies of each state Xi are uniform. This is a reasonable assumption as clustering produces states with roughly equal conformational volumes. Operationally, the quantity used in our sampling algorithm is the (reduced) free energy, fi: fi 52ln PðXi Þ5EðXi Þ=kB T1ln Z

(11)

P where Z5 i exp ð2EðXi Þ=kB TÞ is required for normalization. Constructing PðDjXÞ from sparse experimental NMR data Our posterior sampling strategy is slightly modified from the description above, to take into account some additional considerations. Two types of NMR observables were published for cineromycin B: NOE effects between hydrogen pairs, as well as a set of vicinal J-coupling constants. Thus, we use two error models with separate nuisance parameters, rd for the interpro2218

Journal of Computational Chemistry 2014, 35, 2215–2224

X v2J ðXÞ5 wj ðJj ðXÞ2Jjexp Þ2

(12)

j

where wj is a weight parameter (for example, wj 51=3 for a hydrogens in a methyl group), where the effective number of ðJÞ P coupling constants is Nj 5 j wj . NOE distances Terekhova et al. published the observation of interproton NOE effects for H(3)/H(6), H(3)/H(9), H(6)/H(9), H(5)/H(7), H(6)/H(7), H(7)/H(9), H(9)/H(12), H(9)/H(11)0 , Me(12)/H(10)0 , Me(4)/H(6), Me(8)/H(7), Me(12)/H(13), and Me(12)/Me(13).† Since these report only the presence of interproton coupling observed in gNOESY experiments, there is an additional uncertainty in the experimental interproton distances. The NOE intensity I is usually translated into a interatomic distance d via the relationship I5cd26

(13)

where c is a scaling parameter. Since the exact distance is unknown, we set all experimental values of NOE distances to ˚ , and introduce c0 5c21=6 as an additional nuisance r exp 52:5 A parameter with noninformative prior Pðc0 Þ51=c0. For equivalent hydrogens, the squared deviations from the experimental distance were averaged uniformly using a weight parameter wj (for example, wj 51=3 for a hydrogens averaged in a methyl ðdÞ P group), where the effective number of distances is Nj 5 j wj .

† Note that this numbering scheme refers to Figure 1, and is different from that of Terekhova et al.

WWW.CHEMISTRYVIEWS.COM

FULL PAPER

WWW.C-CHEM.ORG

Table 2. Vicinal 3 JHH values for cineromycin B published by Terekhova et al.[26] used in posterior sampling (Note that the numbering scheme refers to Figure 1, and is different from that of Terekhova et al.). 3

JHH

3

J6;7 3 J11;12 3 JMe;Hð12Þ 3 J12;13 3 JMe;Hð13Þ The

Karplus

relations[35]

Value

Relation

5 < 2 (set to 1.5) 6 < 2 (set to 1.5) 7

a b b c c

used



a

in

the

posterior



sampling

are:









JðhÞ56:6cos 2 h12:6sin 2 hð0 < h < 90 Þ; 11:6cos 2 h12:6sin 2 hð90 < h < 180 Þ. 

b

JðhÞ514cos 2 h for h in 902180 ; 10cos 2 hð0 < h < 90 Þ.     c 2 2 JðhÞ511cos hð90 < h < 180 Þ; 8cos hð0 < h < 90 Þ.

These considerations result in the use of a modified sum of squared errors X v2d ðXÞ5 wj ðrj ðXÞ2c0 rjexp Þ2

(14)

j

It has been reported that modeling distances using a lognormal distribution helps improve accuracy.[36] To test this in our system, we performed additional posterior sampling using a modified sum of squared errors X v2d ðXÞ5 wj ðln ðrj ðXÞ=c0 rjexp ÞÞ2 :

(15)

j

constructed PðDjXÞ with same assumption, that each experimental observable is statistically independent. The particular choice of a joint exponential distribution is to avoid introducing unnecessary bias into the reference prior. It assumes that the distances are uncorrelated, using a maximum entropy distribution for each distance constrained only the average distance. For the J-coupling constant distributions, we assume uniform reference priors, a choice that reflects our ignorance about the true distribution of these values (other than the fact that they are bounded). We discuss alternative choices in the Discussion section. In our results below, we compare the performance of posterior sampling without the use of reference priors [as in eq. (3)] and with the use of reference priors [as in eq. (7)]. Markov Chain Monte Carlo sampling of the posterior The Metropolis-Hastings algorithm was used for MCMC sampling of the full posterior distribution over variables k5ðXi ; rd ; rJ ; c0 Þ, where i is an index running over all conformations. Iteratively, candidate moves k ! k0 are proposed, and either accepted or rejected by the Metropolis criterion, which accepts the move with probability min ½1; Pðk0 Þ=PðkÞ, where ðdÞ

ðdÞ

2ln PðXÞ5ðNj 11Þln rd 1v2d ðXÞ=2r2d 1ðNj =2Þln 2p ðJÞ

ðJÞ

1ðNj 11Þln rJ 1v2J ðXÞ=2r2J 1ðNj =2Þln 2p X ðdÞ 1 ðln bj 2rj =bj Þ1fi

(18)

j

Reference priors Since the interproton distances in cineromycin B are restrained to macrocycle conformations, we consider a reference distribution compiled from all 100 QM-optimized conformations configurations, regardless of their energies. As a conservative choice to model the reference prior of interproton distances rj, we use a joint exponential distribution Y Pref ðrðXÞÞ5 b21 j exp ð2rj =bj Þ

(16)

j

P where each bj 5ðNi 11Þ21 i ðrj Þi ; i51; :::; 100. These values for bj maximize the likelihood of Pðbj jfðrj Þi gÞ using a noninformative prior for bj. Plots of the reference prior for each distance rj are shown in Supporting Information Figure S2. The inclusion of the reference prior introduces an additional a term to the negative logarithm of the posterior probability, X ðln bj 1rj =bj Þ;

(17)

To efficiently implement the algorithm, we sample over a discrete set of values of rd, rJ, and c0 , and precompute the values of v2d ðXÞ and v2J ðXÞ. To enforce the noninformative priors for these quantities, we choose a log-spaced grid of values over their most probable range. This obviates the need for including their priors in the posterior, resulting in a new effecðdÞ tive energy function with the Nj 11 term in eq. (18) replaced ðdÞ ðJÞ ðJÞ by Nj and the Nj 11 term replaced by Nj . For the cineromycin B system, we chose 233 values of rd between 0.05 and 5.0; 303 values of rJ between 0.05 and 20.0; and 140 values of c0 between 0.5 and 2.0. Monte Carlo moves in these quantities are to adjacent values, while moves in Xi are random selections from the full collection of conformations. In the results below, 107 steps of Monte Carlo were performed, with an average acceptance ratio of around 0.75. In some cases, we sample multiple ensembles using free 0 energies fi 5kfi , for 0  k  1, and use the MBAR algorithm[25] to obtain estimates of conformational state populations Pk ðXÞ with (k 5 1) and without information from computational modeling (k 5 0, i.e., a uniform prior P(X)).

j

which has the net effect of penalizing shorter distances. This can be thought of as compensating for the effects of shortdistance restraints which may overly reward interproton distances that would be in close proximity regardless of restraints. We assume that the reference prior is the product of independent distributions for each distance, as we have

Conformational state populations of cineromycin B The results of two posterior sampling calculations are shown in Figure 2a. In the first, only experimental data influences the posterior distribution, which we achieve using a uniform prior for P(X). In the second calculation, we perform posterior sampling using both the experimental data and the QM Journal of Computational Chemistry 2014, 35, 2215–2224

2219

FULL PAPER

WWW.C-CHEM.ORG

Figure 2. Posterior distributions of conformational populations and nuisance parameters for cineromycin B. a) A comparison of conformational population estimates pi from posterior sampling where only experimental data is considered (exp, x-axis) versus results when both QM energies and experimental data are considered (QM1exp, y-axis). Error bars are uncertainties computed using the MBAR algorithm. Selected conformations are labeled with indices (0–99). In red are labeled conformations that agree well with the experimental restraints, but are disfavored by the QM studies. In green are labeled conformations that are less compatible with experimental restraints, but highly favored by QM studies. b) Posterior distributions of the uncertainty in NOE distance measurements, rd. c) Posterior distributions of the uncertainty in the J-coupling measurements, rJ. d) Posterior distributions of the scaling parameter c0 5c21=6 .

energies fi. Both calculations were performed using reference priors, and with normal (not log-normal) models for distance errors. Conformational populations pi and their uncertainties were calculated using MBAR.[25] The results we present below use the use of the PCM; similar calculations without PCM show very similar results (Supporting Information Figure S3). A comparison of conformational state populations estimated with and without QM information reveals that there are a group of conformational states (e.g., 87, 21, and 79) that are highly compatible with the experimental data, but that are predicted by QM studies to have high energies and negligible populations in solution, likely due to steric strain (Supporting Information Figure S4). At least for this system, the conformational landscape obtained from QM studies plays a crucial role in filtering out unphysical states. Indeed, Bayesian estimates of conformational populations pi are often very similar to the PðXi Þ prior probabilities obtained from QM calculations, indicating the information from QM plays a dominant role in our predictions (Supporting Information Figure S5), likely due to the sparse number of experimental constraints, and their large uncertainties. Figures 2b–2d shows the posterior distribution of nuisance parameters rd, rJ, and c0 5c21=6 , respectively. The maximumlikelihood values of these distributions can be considered the best estimate of the uncertainty in the experimental measure2220

Journal of Computational Chemistry 2014, 35, 2215–2224

ments (Table 3). Uncertainty in the experimental NOE distan˚ stems mainly ces (with respect to our initial guess of 2.5 A) from the lack of exact experimental interproton distances. Posterior sampling of rd is concentrated near 0.7, and sampling of the scaling factor c0 is concentrated near a value of 1.25, indicating that posterior estimates of the average interproton dis˚ tances for observed NOEs are 2:5ð1:25Þ53:060:7A. Interproton distances back-calculated from the most populated conformational states (group A, see below) show good agreement with these values (Fig. 3a). Back-calculated J-coupling constants agree less favorably with experimental values (Fig. 3b), as reflected by the posterior sampling estimates of rJ  5. We attribute this to the role of conformational averaging, which is not accounted for in the current posterior sampling scheme. J-coupling values 3 J11;12 < 2 and 3 J12;13 < 2 correspond to multiple hydrogens subject to conformational averaging, and are especially noninformative. Results for other

Table 3. Maximum-likelihood values of nuisance parameters rd ; rJ , and c0  from posterior sampling. Ensemble

rd

rJ

c21=6

exp QM1exp

0.739 0.669

4.391 5.145

1.379 1.249

WWW.CHEMISTRYVIEWS.COM

WWW.C-CHEM.ORG

FULL PAPER

Figure 3. Interproton distance a) and J-coupling constants b) back-calculated from conformational states 38, 39, 65, and 90 (Group A, red) shows agreement with experimental measured values (black). Shown in red are the distances for all equivalent proton distances, with the line width proportional to the restraint weight wj used in the posterior sampling (i.e., 1/3 for three equivalent protons). [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

conformational states (groups B through D, see below) are shown in Supporting Information Figure S6. We find that a number of conformational states (38, 39, 45, 59, 65, 80, 85, 90, and 92) are moderately compatible with the experimental data, and highly favored according to the QM studies (green labels, Fig. 2a). Several of these states overlap nearly identically, and we can cluster them into four main groups (A through D) based on their conformational similarity. To describe the differences between these groups we will refer to the dihedral angle C5AC6AC7AC8 as v1, and the dihedral angle C6AC7AC8AC9 as v2 (see Fig. 1 for numbering). Group A (38, 39, 65, and 90) is predicted to be the most populated at equilibrium (50.5% at 300 K), and has a conformation nearly identical to one of the two crystal isoforms of albocycline, a cineromycin B analog (Fig. 4) differing only by an O-methyl group. Group A also corresponds to the backbone structure that Terekhova et al. report to be the most consistent with the experimental measurements. The remaining groups (B through D) differ from Group A in the orientation of at least one of double-bond groups, for C5@C6 and C8@C9 (Fig. 5). Group B (45 and 59) is very similar   to Group A (v1 ; v2 52108 ; 16 ), but differentiated by a slightly

different orientation of the C8@C9 double-bond (v1 ; v2 5   2106 ; 85 ), and predicted to have 30.7% population at 300 K. It is likely that fast interconversion exists between groups A and B. Group D (92) differs from group A mostly in the orien  tation of double-bond C5@C6 (v1 ; v2 5104 ; 29 ), and is predicted to have 2.5% population in solution. Group C (80 and 85) differs in both C5@C6 and C8@C9, and corresponds closely to the other of the two crystal isoforms of albocycline. Our posterior sampling algorithm predicts this conformation to be populated 11.9% in solution. Interestingly, Terekhova et al. found that a molecular mechanics potential predicted this conformation to be the most populated, and noted that this result conflicted with the conformation that best matched the experimental data (Group A), and suggested that these conformations must interconvert in solution. Our results are a significant improvement over these findings, attesting to the importance of high-resolution computational studies in discriminating competing conformational models. We correctly and quantitatively rank the populations and find significant overlap with the crystal conformations of known analogs. We separately performed posterior sampling using energies obtained without the PCM model, and obtained similar results (Table 4). A full list of macrolide ring dihedral angles for representative structures in groups A, B, C, and D are listed in the Supporting Information (Table S1). Computational models of cineromycin B are quantitatively consistent with experiment

Figure 4. Group A (conformations 38, 39, 65, 90, shown superimposed, left) is predicted by our posterior sampling algorithm to have 50.5% population in solution at 300 K. These conformations correspond closely to one of the crystal isoforms of albocycline, the O-methylated analog of cineromycin B (right). Molecular structures were rendered using UCSF Chimera.[37]

The MBAR algorithm was used to calculate Bayes factors comparing our posterior model of state populations with and without the inclusion of information from QM modeling. Although the interpretation of the Bayes factor is subjective, Kass and Raftery have asserted that when used for model selection, only values of ln KBF > 1 should be considered as positive evidence in factor of a given model, with small values worth “no more than a bare mention.”[21] In all cases, we found that the magnitude of the Bayes factor favoring or disfavoring the inclusion of QM information Journal of Computational Chemistry 2014, 35, 2215–2224

2221

FULL PAPER

WWW.C-CHEM.ORG

Figure 5. Groups B, C, and D are conformational states predicted by our posterior sampling algorithm to have minor populations 30.7, 11.9, and 2.5%. Group C conformations (magenta and cyan) correspond closely to the other of the two crystal isoforms of albocycline, the O-methylated analog of cineromycin B (shown superimposed, black). The minor populations differ from the major population (group A) mainly in the dihedral angles v1 (C5AC6AC7AC8) and v2 (C6AC7AC8AC9), circled. Molecular structures were rendered using UCSF Chimera.[37]

was small, indicating that the P(X) obtained by modeling is quantitatively consistent with the PðDjXÞ constructed from experimental data (Table 5). Larger values would indicate a disfavorable mismatch between the the prior P(X) and PðDjXÞ. Interestingly, omitting the use of reference potentials slightly favors the inclusion of QM information, as opposed to the use of reference potentials, which slightly disfavors the use of QM information. This is most likely due to competition between our chosen reference potential, which penalizes short distances, and the experimental restraints, which favor short distances. For cineromycin B, posterior models using log-normal distance distributions are essentially indistinguishable from those using normal distributions (Table 6). Furthermore, for all sampling schemes tested, the agreement with experimental data

Table 4. Estimates of group populations with and without PCM. Group A B C D

2222

is similar, with ensemble-average deviations from experimentally measured distances near 0.6 A˚ in all cases. Ensembleaverage deviations for J-coupling constants increase from values near 4 to values near 5 when information from QM is introduced, reflecting the reduced importance of J-coupling constants as mentioned above. Reference potentials produce more correct posterior models The Bayes factors comparing models with reference potentials to those without reference potentials are quite large (Table 7). In all cases ln KBF  30, indicating we should reject models that do not incorporate reference potentials. This is to be expected, as the purpose of the reference potential is to correctly normalize the PMF of each experimental restraint. Table 5. Bayes factors K BF 5PðM1 5QM1expÞ=PðM2 5expÞ calculated using various posterior sampling schemes.

B3LYP 1 PCM

B3LYP (no PCM)

Pref

Distances

ln KBF

0.505 0.307 0.119 0.025

0.634 0.215 0.088 0.028

Yes Yes No No

Normal Log-normal Normal Log-normal

20.674 6 0.007 20.689 6 0.007 0.433 6 0.006 0.434 6 0.006

Journal of Computational Chemistry 2014, 35, 2215–2224

WWW.CHEMISTRYVIEWS.COM

WWW.C-CHEM.ORG

Table 6. Agreement of ensemble-average predictions with experimental data, for various sampling methods. ðdÞ

ðJÞ

Information used

Pref

Distances

˚ hjrj 2c0 rjexp jiðAÞ

hjrj 2rjexp ji

QM1exp QM1exp QM1exp QM1exp exp exp exp exp

Yes Yes No No Yes No Yes No

Linear Log-normal Linear Log-normal Linear Linear Log-normal Log-normal

0.609 0.609 0.599 0.600 0.602 0.604 0.605 0.602

5.069 5.067 4.946 4.948 4.026 4.163 4.004 4.166

Without reference potentials, each new experimental restraint artificially decreases the overall posterior probability.

Discussion We expect that the posterior sampling method described here will be very useful for estimating conformational populations in solution for molecules like natural products and peptidomimetics, where experimental data is sparse and high-resolution modeling approaches are used. A Bayesian approach is particularly useful for this situation, as posterior sampling can be used to estimate uncertainties and nuisance parameters not accounted for by sparse experimental data, and because of its flexibility in incorporating data from disparate sources. While similar to ISD, our method offers several improvements for molecules smaller than proteins: it correctly incorporates reference potentials, and can be used to reweight estimated populations for a set of representative conformations. This is especially useful for small molecules, where it is possible to use QM calculations to estimate the energies of all relevant conformational states. The Bayes factors computed above (Tables 5 and 7) argue favorably for the inclusion of QM information to complement experimental constraints. Bayes factors report the (relative) average posterior probability of the data given the model, and it should be kept in mind that is easy to find error parameters rd and rJ that “loosen” constraints to favor a given model. We find that the additional constraints from QM studies do not compete with experimental constraints; rather, they complement experimental constraints, such that when using proper reference potentials, their introduction does not significantly decrease the posterior probability or increase the error parameters, while simultaneously providing valuable structural detail (see Fig. 2a). For systems not amenable to QM studies, the set of representative states used for reweighting could also come from a set of conformational clusters derived from MD or other methTable 7. Bayes factors K BF 5PðM1 5refÞ=PðM2 5norefÞ calculated using various posterior sampling schemes. Information used QM1exp QM1exp exp exp

Distances

ln KBF

Normal Log-normal Normal Log-normal

30.035 6 0.001 30.027 6 0.001 31.183 6 0.001 31.182 6 0.001

FULL PAPER

ods. In that case, the prior P(X) would be calculated from the conformational free energy of each cluster, and the predicted observable values corresponding to each cluster would need to be correctly ensemble-averaged over the cluster (e.g., r 26 averaged for the NOE distances). Other advantages of our posterior sampling method include the ease of implementation and computational speed. We have implemented our method as an object-oriented library of python scripts. The simplicity of the Metropolis-Hastings algorithm and precomputation of key quantities means that posterior sampling runs can be completed in a few minutes. This code is openly distributed and can be easily modified. There are several caveats to our modeling approach that need to be mentioned. First, there are obviously more sophisticated reference potentials we could choose than those presented here. Such choices, however, need to be justified by the availability of data to construct such distributions. For instance, the interproton distance distributions for small molecules should be highly correlated. Attempts to construct such distributions would be thorny at best, because they would need to be accompanied by a model for PðDjXÞ that accounted for such correlations as well. Given these complications, we believe that our conservative choice of reference potential for interproton distances is highly justified. The reference potentials used for J-coupling values (assumed here to be uniform) could be improved, especially if more vicinal proton pairs were included. A maximum-entropy circular normal distribution,[38] constrained by an average value of cos h, could be used in future work. Next, the approach presented here could be further improved by more sophisticated back-calculation of experimental observables. For instance, it would be desirable to better account for motional averaging effects when known, especially for J-coupling data. In the cases where QM studies or chemical intuition can inform knowledge of kinetic barriers, improved models of averaging should lead to better agreement with experiment (i.e., smaller rJ values). Similarly, our algorithm could take better advantage of QM calculations to predict NMR observables, such as chemical shifts and J-coupling constants, following Aliev et al.[14] Finally, we note that our approach is not the most unbiased way to reweight the probabilities of conformational states given ensemble-averaged experimental observables. Pitera and Chodera have shown that the maximum entropy distribution constrained by the ensemble-average of an observable hfobs ðXÞi can be obtained by perturbing the energies of each state by a value afobs ðXÞ, where a is a suitable Lagrange multiplier.[39] Since our likelihood model PðDjXÞ assumes normally-distributed errors, there is necessarily some amount of bias that is introduced. Recently, Beauchamp et al. have developed the Bayesian Energy Landscape Tilting (BELT) algorithm, which uses the Lagrange multiplier approach, along with a Bayesian posterior sampling scheme similar to the one presented here, to reweight ensembles of configurations sampled from MD simulations. While BELT may be preferred for some problems, it is computationally expensive for large ensembles, requiring the determination of a set of Lagrange multipliers aj, one for each Journal of Computational Chemistry 2014, 35, 2215–2224

2223

FULL PAPER

WWW.C-CHEM.ORG

experimental observable. Unlike BELT, our method can be performed in minutes, and can additionally estimate the posterior distribution of nuisance parameters, which would add even further computational cost to the BELT algorithm.

Conclusions We have presented an improved Bayesian inference approach for estimating conformational populations in solution, suitable for small molecules such as natural products and peptidomimetics. This approach combines high-resolution computational models from REMD/QM with sparse information from experimental NMR observables, using MCMC to sample the posterior distribution of conformational state populations. We applied the method to determine the solution-state conformational populations of cineromycin B using previously published experimental NOE and J-coupling constant observables, and found improved results compared to previous modeling efforts. Through the use of Bayes factors, we showed quantitatively that our method produces models consistent with experimental data, and demonstrated the importance of using reference potentials. We call our algorithm BICePS (Bayesian Inference of Conformational PopulationS), and freely distribute the source code and examples at http://github.com/vvoelz/nmr-biceps.

Acknowledgment The authors thank Rodrigo Andrade and Charles DeBrosse for their expertise and advice. Keywords: Bayesian inference  structure determination  molecular dynamics  quantum chemistry  NMR spectroscopy

How to cite this article: V. A. Voelz, G. Zhou. J. Comput. Chem. 2014, 35, 2215–2224. DOI: 10.1002/jcc.23738

]

Additional Supporting Information may be found in the online version of this article.

[1] J.-M. Tian, S.-S. Ou-Yang, X. Zhang, Y.-T. Di, H.-L. Jiang, H.-L. Li, W.-X. Dai, K.-Y. Chen, M.-L. Liu, X.-J. Hao, Y.-H. Shen, C. Luo, W.-D. Zhang, RSC Adv. 2012, 2, 1126. [2] C. D. Blundell, M. J. Packer, A. Almond, Bioorg. Med. Chem. 2013, 21, 4976. [3] J. Chatterjee, F. Rechenmacher, H. Kessler, Angew. Chem. Int. Ed. 2012, 52, 254. [4] D. Lama, S. T. Quah, C. S. Verma, R. Lakshminarayanan, R. W. Beuerman, D. P. Lane, C. J. Brown, Sci. Rep. 2013, 3, 3451. [5] H. Sch€ onherr, T. Cernak, Angew. Chem. Int. Ed. Engl. 2013, 52, 12256. [6] M. A. Dechantsreiter, E. Planker, B. Math€a, E. Lohof, G. H€ olzemann, A. Jonczyk, S. L. Goodman, H. Kessler, J. Med. Chem. 1999, 42, 3033. [7] J. A. Robinson, ChemBioChem 2009, 10, 971. [8] D. O. Cicero, G. Barbato, R. Bazzo, J. Am. Chem. Soc. 1995, 117, 1027. [9] R. E. Taylor, J. Zajicek, J. Org. Chem. 1999, 64, 7224.

2224

Journal of Computational Chemistry 2014, 35, 2215–2224

[10] A. G€ orler, N. B. Ulyanov, T. L. James, J. Biomol. NMR 2000, 16, 147. [11] O. Atasoylu, G. Furst, C. Risatti, A. B. Smith, III, Org. Lett. 2010, 12, 1788. [12] A. E. Aliev, D. Courtier-Murias, S. Bhandal, S. Zhou, Chem. Commun. 2010, 46, 695. [13] A. E. Aliev, Z. A. Mia, H. S. Khaneja, F. D. King, J. Phys. Chem. A 2012, 116, 1093. [14] A. E. Aliev, Z. A. Mia, M. J. M. Busson, R. J. Fitzmaurice, S. Caddick, J. Org. Chem. 2012, 77, 6290. [15] W. Rieping, M. Nilges, M. Habeck, Bioinformatics 2008, 24, 1104. [16] C. K. Fisher, A. Huang, C. M. Stultz, J. Am. Chem. Soc. 2010, 132, 14919. [17] W. Rieping, M. Habeck, M. Nilges, Science 2005, 309, 303. [18] S. Olsson, W. Boomsma, J. Frellsen, S. Bottaro, T. Harder, J. FerkinghoffBorg, T. Hamelryck, J. Magn. Reson. 2011, 213, 182. [19] T. Hamelryck, M. Borg, M. Paluszewski, J. Paulsen, J. Frellsen, C. Andreetta, W. Boomsma, S. Bottaro, J. Ferkinghoff-Borg, PLoS One 2010, 5, e13714. [20] S. Olsson, J. Frellsen, W. Boomsma, K. V. Mardia, T. Hamelryck, PLoS One 2013, 8, e79439. [21] R. E. Kass, A. E. Raftery, J. Am. Stat. Assoc. 1995, 90, 773. [22] M. R. Shirts, D. L. Mobley, Methods Mol. Biol. 2013, 924, 271. [23] M. Habeck, Phys. Rev. Lett. 2012, 109, 100601. [24] C. Bennett, J. Comput. Phys. 1976, 22, 245. [25] M. R. Shirts, J. D. Chodera, J. Chem. Phys. 2008, 129, 124105. [26] L. P. Terekhova, O. A. Galatenko, V. V. Kulyaeva, N. D. Malkina, Y. V. Boikova, G. S. Katrukha, A. S. Shashkov, A. G. Gerbst, N. E. Nifantiev, Russ. Chem. Bull. 2007, 56, 815. [27] N. M. Henriksen, D. R. Davis, T. E. Cheatham, III, J. Biomol. NMR 2012, 53, 321. [28] G. L. Butterfoss, B. Yoo, J. N. Jaworski, I. Chorny, K. A. Dill, R. N. Zuckermann, R. Bonneau, K. Kirshenbaum, V. A. Voelz, Proc. Natl. Acad. Sci. USA 2012, 109, 14320. [29] S. Pronk, S. Pall, R. Schulz, P. Larsson, P. Bjelkmar, R. Apostolov, M. R. Shirts, J. C. Smith, P. M. Kasson, D. van der Spoel, B. Hess, E. Lindahl, Bioinformatics 2013, 29, 845. [30] J. Wang, R. M. Wolf, J. W. Caldwell, P. A. Kollman, D. A. Case, J. Comput. Chem. 2004, 25, 1157. [31] A. Onufriev, D. Bashford, D. A. Case, Proteins 2004, 55, 383. [32] A. Jakalian, D. B. Jack, C. I. Bayly, J. Comput. Chem. 2002, 23, 1623. [33] K. A. Beauchamp, G. R. Bowman, T. J. Lane, L. Maibaum, I. S. Haque, V. S. Pande, J. Chem. Theory Comput. 2011, 7, 3412. [34] M. J. Frisch, G. W. Trucks, H. B. Schlegel, G. E. Scuseria, M. A. Robb, J. R. Cheeseman, G. Scalmani, V. Barone, B. Mennucci, G. A. Petersson, H. Nakatsuji, M. Caricato, X. Li, H. P. Hratchian, A. F. Izmaylov, J. Bloino, G. Zheng, J. L. Sonnenberg, M. Hada, M. Ehara, K. Toyota, R. Fukuda, J. Hasegawa, M. Ishida, T. Nakajima, Y. Honda, O. Kitao, H. Nakai, T. Vreven, J. A. Montgomery, Jr., J. E. Peralta, F. Ogliaro, M. Bearpark, J. J. Heyd, E. Brothers, K. N. Kudin, V. N. Staroverov, R. Kobayashi, J. Normand, K. Raghavachari, A. Rendell, J. C. Burant, S. S. Iyengar, J. Tomasi, M. Cossi, N. Rega, J. M. Millam, M. Klene, J. E. Knox, J. B. Cross, V. Bakken, C. Adamo, J. Jaramillo, R. Gomperts, R. E. Stratmann, O. Yazyev, A. J. Austin, R. Cammi, C. Pomelli, J. W. Ochterski, R. L. Martin, K. Morokuma, V. G. Zakrzewski, G. A. Voth, P. Salvador, J. J. Dannenberg, S. Dapprich, A. D. Daniels, O. Farkas, J. B. Foresman, J. V. Ortiz, J. Cioslowski, D. J. Fox, Gaussian09 Revision d.01, 2009. [35] A. Bothner-By, Adv. Magn. Reson. 1965, 1, 195. [36] W. Rieping, M. Habeck, M. Nilges, J. Am. Chem. Soc. 2005, 127, 16026. [37] E. F. Pettersen, T. D. Goddard, C. C. Huang, G. S. Couch, D. M. Greenblatt, E. C. Meng, T. E. Ferrin, J. Comput. Chem. 2004, 25, 1605. [38] R. Gatto, S. R. Jammalamadaka, Stat. Methodol. 2007, 4, 341. [39] J. W. Pitera, J. D. Chodera, J. Chem. Theory Comput. 2012, 8, 3445.

Received: 9 June 2014 Revised: 25 August 2014 Accepted: 31 August 2014 Published online on 24 September 2014

WWW.CHEMISTRYVIEWS.COM

Bayesian inference of conformational state populations from computational models and sparse experimental observables.

We present a Bayesian inference approach to estimating conformational state populations from a combination of molecular modeling and sparse experiment...
683KB Sizes 0 Downloads 6 Views