Constructing sequence-dependent protein models using coevolutionary information

Ryan R. Cheng,1 Mohit Raghunathan,1,2 Jeffrey K. Noel,1,2  N. Onuchic1,2* and Jose 1

Center for Theoretical Biological Physics, Rice University, Houston, Texas 77005

2

Department of Physics & Astronomy, Rice University, Houston, Texas 77005

Received 20 May 2015; Accepted 27 July 2015 DOI: 10.1002/pro.2758 Published online 30 July 2015 proteinscience.org

Abstract: Recent developments in global statistical methodologies have advanced the analysis of large collections of protein sequences for coevolutionary information. Coevolution between amino acids in a protein arises from compensatory mutations that are needed to maintain the stability or function of a protein over the course of evolution. This gives rise to quantifiable correlations between amino acid sites within the multiple sequence alignment of a protein family. Here, we use the maximum entropy-based approach called mean field Direct Coupling Analysis (mfDCA) to infer a Potts model Hamiltonian governing the correlated mutations in a protein family. We use the inferred pairwise statistical couplings to generate the sequence-dependent heterogeneous interaction energies of a structure-based model (SBM) where only native contacts are considered. Considering the ribosomal S6 protein and its circular permutants as well as the SH3 protein, we demonstrate that these models quantitatively agree with experimental data on folding mechanisms. This work serves as a new framework for generating coevolutionary data-enriched models that can potentially be used to engineer key functional motions and novel interactions in protein systems. Keywords: coarse-grained protein models; coevolutionary information; statistical inference; computational biophysics

Introduction

Abbreviations: SBM, structure-based models; DCA, Direct Coupling Analysis; mfDCA, mean field Direct Coupling Analysis; MSA, multiple sequence alignment; WHAM, weighted histogram analysis method; FEP, free energy perturbation; TSE, transition state ensemble. Additional Supporting Information may be found in the online version of this article. R.R.C. and M.R. contributed equally to this work. Grant sponsor: NSF INSPIRE; Grant numbers: MCB-1241332, MCB-1214457; Grant sponsor: Welch Foundation; Grant number: C-1792; Grant sponsor: Center for Theoretical Biological Physics; Grant number: PHY-1427654; Grant sponsor: Data Analysis and Visualization Cyberinfrastructure; Grant number: OCI-0959097. *Correspondence to: J. N. Onuchic; Rice University, 6100 Main Street- MS-61, Houston, TX 77005-1827. E-mail: [email protected]

C 2015 The Protein Society Published by Wiley-Blackwell. V

Early work applying the statistical mechanics of spin-glasses to proteins formulated the foundation for the theory of protein folding.1 Subsequent advances led to the development of energy landscape theory of protein folding2,3 and the modern view that proteins fold as the ensemble of accessible structures is funneled by the underlying energy landscape into a unique native structure. A consequence of this theory is the view that naturally selected proteins are minimally frustrated. This theoretical picture of protein folding has led to the development of idealized, minimally frustrated protein models called structure-based models (SBM)4–6 for studying protein folding and function.7 Consistent with earlier models,8,9 SBMs encode structural information into the Hamiltonian of a protein model by conceptualizing that the dominant interactions consist only of the interactions found in

PROTEIN SCIENCE 2016 VOL 25:111—122

111

the native structure. Furthermore, SBMs typically assume that all native contacts are stabilizing as well as equal in their strength (homogeneous). While examples exist where non-native interactions may play a significant role in folding,10 SBMs are able to capture the highly cooperative nature of the folding transition and are especially successful for studying the folding of proteins where topological effects and entropic barriers dictate the folding mechanisms.6 However, these models cannot systematically capture the effect of stabilizing mutations while destabilizing mutations can only be addressed through coarse approximations. One could mimic the effect of a destabilizing mutation to a particular residue by deleting the stabilizing contacts that it forms in the native state, but this coarse approximation is unable to distinguish between varying degrees of destabilization. In principle, all-atom representations of proteins with explicit solvent interactions can capture the effects of mutational stabilization or destabilization (sequence level effects) but are too computationally expensive to explore underlying energy landscapes and explore multiple folding and unfolding transitions. Coarse-grained optimized transferable potentials11–14 is one prominent method for exploring sequence-dependent effects in proteins. Likewise, it has been shown that experimental mutational changes in stability15 or the native-basin fluctuations in an all-atom implicit solvent simulation16 can be used as constraints to obtain coarse-grained protein models with heterogeneous interaction energies. Here, we adopt a novel approach using only sequence and structural data, where we consider the feasibility of constructing a sequence-dependent coarse-grained protein model by using sequence data to supplement an SBM. Recent developments in global statistical inference using maximum entropy modeling have led to a number of advances, particularly in the areas of protein structure prediction (see reviews17,18). Maximum entropy modeling has also been applied to a diverse range of topics such as drug resistance,19,20 evolutionary fitness,21 neural networks,22 self-driven particles,23 and bacterial signaling systems.24–28 The development of Direct Coupling Analysis (DCA)27,29,30 has advanced the study of coevolutionary data by inferring the underlying Potts model Hamiltonian that governs the correlated mutations in a protein family. In particular, the mean field approach (mfDCA)29 makes use of an analytical approximation to perform the inference procedure through a single computationally inexpensive step. It has been demonstrated that the inferred couplings from DCA are highly correlated with experimental mutational changes in protein stability31–33 and physical protein–protein interaction mechanisms,26 suggesting a direct relationship between the statistical couplings and the pairwise interaction energies of a realistic

112

PROTEINSCIENCE.ORG

protein model. Furthermore, the inferred coevolutionary information has allowed for the quantification of the degree that evolved proteins are minimally frustrated,32 consistent with earlier theoretical estimates34,35 (see review36). Motivated by these findings, our goal is to construct an SBM where the statistical couplings from mfDCA are used to describe the strength of native contact interactions. Earlier work involved the enrichment of SBMs with coevolutionary data that encodes the functional conformations of a protein37 or complex.38 These studies identified the strongest co-evolving pairs of sites in a protein with a metric called Direct Information29 and incorporated them into the Hamiltonian of an SBM as homogeneous, stabilizing contacts. Here, we supplement an SBM that is coarse-grained on the Ca level (i.e., one bead per residue) and adopt the strength of our native contacts from the inferred Potts model couplings of mfDCA, i.e., Jij (Ai, Aj) which depend on positions i and j in a protein and the amino acids at those positions, Ai and Aj, respectively. We enforce that the sum of inferred couplings for all native contacts sum to the total energetic stabilization of native contacts in a homogeneous SBM (i.e., Ne, where N is the number of native contacts and e is the mean contact energy strength in the SBM). A natural way of incorporating these heterogeneous couplings into an SBM is by linearly mixing them with the interaction energies of a homogeneous SBM with a mixing parameter v. Our mixing condition interpolates between v 5 0 (fully homogeneous) and v 5 1 (fully heterogeneous), allowing v to control the standard deviation of the energetic heterogeneity while enforcing a constant mean strength of native contact.39 Additional details of our model are discussed in the Materials and Methods section. We focus on two well-studied protein systems: Ribosomal S6, for which energetic heterogeneity plays a significant role in its folding mechanism,15,40,41 and SH3, for which energetic heterogeneity is secondary to geometry in dictating the folding mechanism.6,42 We construct DCA-enriched SBMs and explore them using molecular dynamics simulations that sample many folding and unfolding transitions at the folding temperature, Tf. We compare our simulation results with experimental data characterizing the folding mechanisms, namely the so-called U-value analysis,43,44 which characterizes the transition state ensemble through mutational changes in stability. We find that increasing the weight of heterogeneous interactions (increasing v) tends to improve the quantitative agreement of our models with experimental data on folding mechanisms. However, increasing v also coincides with a loss of co-operativity as well as the disappearance of the free energy barrier separating unfolded and folded states, which is consistent with earlier work on SBMs with heterogeneous contacts39 and theory.45

Constructing Sequence-Dependent Protein Models

The general feature of reduced co-operativity in Cabased SBMs has previously been observed even for SBMs with homogeneous contact strengths,46 which can potentially be recovered through the incorporation of, for example, barriers associated with the removal of water to bring hydrophobic residues together.47–49 For simplicity, we did not consider desolvation barriers and chose to focus on supplementing a traditional SBM, and hence, we were not able to explore models approaching v 5 1 and focus on models constructed in the vicinity of v 5 0.5 as a matter of practicality. Despite its simplicity, the class of SBMs that we introduce serves as a potential framework for the engineering of proteins. By building a global statistical model from large collections of sequence data, one could identify mutations that strengthen or weaken desired interactions in a protein model.

Materials and Methods Aligned sequences for protein families We obtained the multiple sequence alignments (MSA) from Pfam50 (version 27) for the protein families that were studied: Ribosomal S6 (PF01250) and SH3 (PF00018). All residue inserts were removed from the data sets such that the aforementioned families have fixed lengths of L 5 92 and L 5 48, respectively.

Direct coupling analysis (DCA) We infer the underlying Potts model Hamiltonian that governs the correlated mutations in a particular protein family, sequence 5 ðA1 ; A2 ; :::; AL Þ HðsequenceÞ52

X 1i

Constructing sequence-dependent protein models using coevolutionary information.

Recent developments in global statistical methodologies have advanced the analysis of large collections of protein sequences for coevolutionary inform...
NAN Sizes 1 Downloads 9 Views