FULL PAPER

WWW.C-CHEM.ORG

SSThread: Template-Free Protein Structure Prediction by Threading Pairs of Contacting Secondary Structures Followed by Assembly of Overlapping Pairs Kevin J. Maurice* Acquiring the three-dimensional structure of a protein from its amino acid sequence alone, despite a great deal of work and significant progress on the subject, is still an unsolved problem. SSThread, a new template-free algorithm is described here that consists of making several predictions of contacting pairs of a-helices and b-strands derived from a database of experimental structures using a knowledge-based potential, secondary structure prediction, and contact map prediction followed by assembly of overlapping pair predictions to create an ensemble of core structure predictions whose loops are

then predicted. In a set of seven CASP10 targets SSThread outperformed the two leading methods for two targets each. The targets were all b-strand containing structures and most of them have a high relative contact order which demonstrates the advantages of SSThread. The primary bottlenecks based on sets of 74 and 21 test cases are the pair prediction and C 2014 Wiley Periodicals, Inc. loop prediction stages. V

Introduction

protein sequence is aligned to the structural features of the template as measured by the KBP. Two subproblems to the protein structure prediction problem are secondary structure prediction[5,6] and contact map prediction,[7,8] both of which can be formulated as simple classification problems. The best secondary structure prediction programs use pattern recognition algorithms to predict the three state secondary structure (helix, strand, and loop) of each residue using the protein sequence as well as the sequences of homologous proteins. The best secondary structure prediction programs are 80% accurate.[9] Contact map prediction programs predict which residues of the protein are in contact. In this case, a contact is defined as having a-carbons within a given distance or some variation of this. As with secondary structure prediction, the best methods utilize pattern recognition algorithms using the protein sequence and the sequences of homologous proteins. It is possible to generate the 3D structure of a protein from a correct contact map.[10] The best contact map prediction methods are not accurate enough to generate structures alone but are accurate enough to be useful in protein structure prediction. Theoretically protein structures could be obtained by molecular dynamics simulations of protein folding. Unfortunately doing a quantum mechanics or even molecular mechanics simulation for a molecule as large as a typical protein over the timespan in which proteins fold is not feasible. Czaplewski et al.[11] attempted to work around this using a simplified

Obtaining the three-dimensional (3D) structure of a protein is important because it allows researchers to study aspects of the protein’s function such as the mechanism of an enzyme or how a transcription factor binds to specific DNA sequences. Protein structures can also be used in the design of therapeutic drugs. It was demonstrated by Christian Anfinsen in 1961 that the 3D structure of a protein is determined by its amino acid sequence.[1] Yet, it is still a largely unsolved problem to obtain the 3D structure of a protein using only the sequence. The challenge of obtaining the structure of a protein from its sequence is known as the protein structure prediction problem. Experimental methods to obtain protein structures including X-ray crystallography and NMR exist but require a large amount of time, cost, and expertise. Homology modeling methods use the structures of homologous proteins as templates to build the structure. In order for this to be successful with current methods a homolog with a known structure that has at least 30% sequence identity to the protein is required, which excludes many proteins. Modeling proteins without a template, called template-free, ab initio or de novo prediction, is much more challenging but can be used when no template is available and is potentially able to predict the structure of a protein with a new fold. Threading[2] is an alternative template based method. This method, instead of relying on sequence similarity between the protein and the template, uses a knowledge-based potential (KBP).[3] A KBP measures how well a set of structural features fit the protein sequence. Common structural features used include backbone torsion angles,[4] solvent accessibility,[2] and residue to residue contacts.[3] KBPs are statistically derived from sets of experimentally solved structures. In threading the 644

Journal of Computational Chemistry 2014, 35, 644–656

DOI: 10.1002/jcc.23543

K. J. Maurice 15 Mitchell Ave, New Brunswick, NJ, 08901 E-mail: [email protected] C 2014 Wiley Periodicals, Inc. V

WWW.CHEMISTRYVIEWS.COM

FULL PAPER

WWW.C-CHEM.ORG

force field which represents residues by two interaction sites, the peptide group and a united side chain. Shell et al.[12] accelerated the simulation using a technique called zipping and assembly which is based on a theory of what drives proteins to fold rapidly. The leading template-free methods are fragment insertion algorithms. In these algorithms, the general idea consists of first predicting the structure of fragments of the protein from the protein sequence. Then a Monte-Carlo algorithm generates a structure prediction by randomly inserting fragment predictions into the structure while optimizing a scoring function. The process is repeated many times to generate an ensemble of predictions. The original fragment insertion method is Rosetta.[13] A newer method, QUARK,[14] has surpassed Rosetta as measured in the CASP9[15] (ninth critical assessment of techniques for protein structure prediction) experiment by using replica exchange Monte-Carlo. Genetic algorithms, search algorithms based on biological evolution, have been applied to template-free protein structure prediction. Recently, Brasil et al.[16] developed an advanced algorithm named multiobjective evolutionary algorithm with many tables that had promising results on a set of small proteins. Faraggi et al.[17] significantly improved a genetic algorithm by including predictions of the real value of backbone torsion angles that were predicted by machine learning algorithms. Another genre of template-free methods are secondary structure packing algorithms.[18–22] These methods build the core structure by assembling the a-helices and b-strands (secondary structure elements [SSEs]) that make up the core structure followed by loop prediction. The most sophisticated of these algorithms thus far is BCL::Fold.[22,23] BCL::Fold, like Rosetta and QUARK, uses a Monte-Carlo search. Instead of inserting fragments, however, it makes random additions and modifications of the complete SSEs that make up the core of the structure. The loops are predicted at a later stage. BCL::Fold is nearly as successful as Rosetta reporting a prediction with a lower root mean squared deviation (RMSD) in 29% of the cases in their benchmark set. SSThread also generates core structures by packing SSEs. Instead of making random moves SSThread is based on making predictions of contacting SSE pairs that are derived from experimental structures. Predictions are scored using a KBP, secondary structure prediction and contact map prediction. Then overlapping pairs are assembled using a nonstochastic heuristic search algorithm to generate an ensemble of core structure predictions. Then the loops are predicted using cyclic coordinate descent (CCD).[24]

Methods Databases In developing and testing the algorithm, a few nonredundant databases of protein structures were used. The databases were taken from the ASTRAL compendium[25] which contains nonredundant sets of domains as defined in the SCOP database[26]

with various criteria for redundancy. Version 1.75A of the SCOP database was used. The four globular nonmembrane SCOP classes were used: all-a, all-b, a/b, and a1b. Domains composed of more than one peptide chain or more than one segment of the same peptide chain were removed. One database which I will call SCOPFamily has one representative per family as defined in the SCOP database and contains 3730 cases. A database SCOPFold has one representative per fold and contains 965 cases. Also, SCOPFold100 is the 100 highest resolution X-ray crystal structures from SCOPFold.

Scoring Structures in SSThread are scored using seven terms. A KBP is used which includes a backbone torsion angle term,[4] a solvent accessibility term,[2] and a residue to residue contact term.[3] For core residues the KBP scoring includes the sequences of homologs.[27,28] The sequences of homologs are not used for loop residues because loops are usually not conserved among homologs. A secondary structure prediction program is used as well as a contact map prediction program. A compactness term is used. An SSE length term is used. All seven terms make a significant contribution to the overall score. To compare incomplete predictions, including predictions of different sizes and covering different residues, the terms are calibrated to a native Z-score. The average and variance of the score is calculated for each residue or residue pair and each term from experimental structures. The KBP terms are calibrated on an amino acid type specific basis. The scores, averages, and variances are additive among residues or residue pairs. An adjustment to the variance accounting for the correlation between the scores of residues and residue pairs is used and is described below. As the terms are all on the same scale the scores, averages, and variances are additive among terms. The correlations between terms are weak and sometimes positive and sometimes negative so can be ignored when adding the variance among terms. Adding the scores, averages, and variances for the KBP terms among the protein sequence and the sequences of homologs is more complex and is described below. Once the total score, average, and variance have been calculated the Z-score can be calculated as Z 5

St 2lt rt

(1)

where Z is the Z-score, St is the total score, mt is the total average and rt is the total standard deviation (the square root of the total variance). The Z-score of native structures, whether complete or incomplete, will by definition be distributed with an average of zero and a standard deviation of one. When adding the variances among residues or residue pairs the correlations between scores such as the correlation between the scores of adjacent residues must be accounted for. Rather than account for every type of correlation individually the Z-scores for each term were calculated for complete structures in the SCOPFold100 database without accounting Journal of Computational Chemistry 2014, 35, 644–656

645

FULL PAPER

WWW.C-CHEM.ORG

for correlations. The variance of those Z-scores are then calculated. These variances are multiplied by the variance of the term during prediction. In the SCOPFold100 database, the Zscores calculated using the adjusted variance do not depend on the length of the protein. The torsion angle term is calculated as Sðai ; /i ; wi Þ 5 Klog

f ðai ; /i ; wi Þ f ðai Þf ð/i ; wi Þ

(2)

where a is the amino acid type, the u and w torsion angles are in 10 intervals, K is a constant and f(.) is the frequency of the term calculated from the SCOPFamily database. The solvent accessibility measure used here is half-sphere exposure (HSE).[29] The half-sphere centered at a residues’ a-carbon in the direction of its b-carbon with a radius of 13 A˚ is defined. The HSE is the number of a-carbons from other residues that are within the half-sphere. This measure fits four criteria for a good solvent accessibility measure for use in KBPs: the value should not change when changing the amino acid type, it should have a high amino acid type specificity, it should not require all-atom models and it should be fast to calculate. The scores for solvent accessibility are calculated as Sðai ; ei Þ 5 Klog

f ðai ; ei Þ f ðai Þf ðei Þ

(3)

where e is the HSE. This is the same as Jones et al. [2] except for a constant and the use of a different solvent accessibility measure. When complete backbone structures are not available the HSE comes from the structure in the database that the SSE is derived from. For each residue in the SCOPFamily database the average and variance of the sum of the torsion angle score and the solvent accessibility score specific to the amino acid type is calculated. Calibrating with the sum of the two terms will account for any correlation between the terms. The contact term is orientation dependent[30,31] and dependent on the distance in sequence.[32] To completely determine the orientation between two residues requires six variables, three of translation and three of rotation. If calculating the scores in intervals there are too many intervals given the limited number of protein structures available to calculate accurate scores when using all six variables. Here the orientation variables are the distance between contacting residues’ a-carbons and the distance between their b-carbons. This accounts for the distance between residues and whether the side chains are pointed toward each other or away from each other. To account for the effects of chain connectivity a sequence distance variable is included.[32] The scores of the contact term are calculated as     f ai ; aj ; da ; db ; ds     S ai ; aj ; da ; db ; ds 5 Klog (4) f ðai Þf aj f da ; db ; ds where da and db are the distance between the a and b carbons respectively in 1 A˚ intervals and ds is the distance in 646

Journal of Computational Chemistry 2014, 35, 644–656

sequence in intervals: 1, 2, 3, 4, 5, 6–7, 8–10, and more than 10 residues apart. A contact is defined as having a heavy atom ˚ apart. During tree from each residue that are less than 5 A assembly and loop prediction when all-atom structures are not available whether two residues are considered in contact is based on the distance between the a-carbons, the distance between the b-carbons, and the amino acid types. The thresholds are chosen to estimate the criteria when all-atom structures are used. The thresholds are calculated using a Bayes test procedure which minimizes the sum of the two types of error, false positives and false negatives. The contact term is calibrated on a per contact and amino acid type specific basis. So that the knowledge that some amino acid types are more or less frequently in contact is not lost during the calibration the following series of equations are used. hAA lA 1 h AA lA 5h AA lAA hAC lA 1 h AC lC 5h AC lAC

(5)

hAD lA 1 h AD lD 5h AD lAD . . . for each pair of amino acid types. The mAC is the average score of contacts between Alanine and Cytosine residues and hAC is the number of contacts between Alanine and Cytosine residues. The 20 m terms from the left side of the equations are calculated from all 210 equations using QR-factorization. When using the term the average used for a contact between an Alanine and a Cytosine residue is mA plus mC. The same procedure is used to calculate the contact standard deviations. When applying this KBP on core residues the sequences of homologs to the protein are used.[27,28] Homologous sequences come from a database obtained from the CD-HIT[33,34] website that includes a nonredundant NR database at 90% sequence identity. The protein sequence is searched against the database using PSI-BLAST.[35] The settings for the search are a maximum of two iterations and a threshold expectation value for inclusion in the next round of 1 3 1025. The expectation threshold is more stringent than the default setting so that only accurate alignments are included in the next round. Only homologs with an expectation value of 1 3 1025 or lower are used. The sequences are weighted so that under-represented sequences have a higher weight than over-represented sequences. The weight on each sequence from Panjkovich et al. [28] is ui 5

n X 1 r t j51 j i;j

(6)

where rj is the number of amino acid types present among the sequences at position j and ti, j is the frequency of the amino acid type in sequence i at position j with respect to all residues in position j. For SSThread, the weights are calculated for each position of the protein using only the homologs that have a residue aligned at that position. For the contact term the weights are calculated for each pair of positions using only the homologs that have a residue aligned at both WWW.CHEMISTRYVIEWS.COM

FULL PAPER

WWW.C-CHEM.ORG

positions. The weight on each sequence is normalized according to the equation xi 5

pffiffiffi ui s Xn

(7)

u j51 j

where Xn

r i51 i

s 5

(8)

n

The s can be thought of as a measure of the effective number of sequences among the protein sequence and the sequences of homologs adjusted for redundancy. Adding up the scores, averages, and variances among homologs is calculated as St 5

n X

xi Si

(9)

xi li

(10)

i51

lt 5

n X

Sc ðmc ; ds ; LÞ 5 Klog

r2t

5

x2i r2i

i51

1 2

n21 X n X

pffiffiffiffiffiffi xi xj ri rj pi;j

Snc ðmnc ; ds ; LÞ 5 Klog

(12)

where s is the secondary structure and p(si) is the probability from SPINE X for secondary structure i. Using the probabilities in this way allows the algorithm to distinguish between predictions of various reliability. For example, if the prediction is confident that a residue is a helix only a helix needs to be considered but if the prediction is uncertain whether the residue is a helix or a strand then both must be considered. The average and variance of the secondary structure term are calculated as l 5 pðsH ÞSðsH Þ 1 pðsE ÞSðsE Þ 1 pðsL ÞSðsL Þ 2

(13)

r 5 pðsH ÞðSðsH Þ2lÞ 1 pðsE ÞðSðsE Þ2lÞ 1 pðsL ÞðSðsL Þ2lÞ2 (14) 2

(16)

l 5 pðmc ÞSc 1 ð12pðmc ÞÞSnc 2

(17) 2

r2 5 pðmc ÞðSc 2lÞ 1 ð12pðmc ÞÞðSnc 2lÞ

(18)

The solvent accessibility term, with its preference to bury hydrophobic residues, is insufficient to generate compact structures. Therefore a compactness term is used which is calculated as f ðdc ; LÞ v ðdc ; LÞ

(19)

(11)

where pi,j is the proportion of identical residues between sequences i and j. The square root of pi,j is a good estimate of the correlation coefficient between the scores of sequences i and j and so is used as its substitute. The secondary structure is predicted by SPINE X.[36] The score is calculated from the probabilities reported by SPINE X according to the equation pðsi Þ f ðsi Þ

12pðmc Þ 12bðds ; LÞ

The average and variance of the contact map term are calculated as

Sðdc ; LÞ 5 Klog

i51 j5i11

Sðsi Þ 5 Klog

(15)

where p(mc) is the probability of a contact from NNcon, and b(ds, L) is the background frequency of a contact given the distance in sequence (ds) and the length of the protein (L). The score for residues that are not in contact are calculated as

i51 n X

pðmc Þ bðds ; LÞ

2

where H is for helix, E is for strand, and L is for loop. Contact maps are predicted using NNcon.[37] It is set on ab ˚ between initio mode, uses a contact cutoff distance of 8 A a-carbons and is not used on residue pairs that are less than six residues apart. The score for the contact map term for residues in contact is calculated as

where dc is the distance between the residues a-carbon and ˚ intervals, L is the length of the prothe center of mass in 5 A tein in 25 residue intervals, and v(dc, L) is the proportion of the volume of a sphere within the dc interval for sphere of appropriate size for a protein of length L. The average and variance of the term is calculated from the SCOPFamily database. Calculating the compactness term on a per residue basis puts it on the same scale as the other terms unlike if using a score based on the radius of gyration. The compactness term is used only during loop prediction when complete backbone structures are available. Due to the over-prediction of short SSEs an SSE length term is used. The term’s score is a constant multiplied by the length of the SSE prediction. The term is also weighted by the square root of the average s so that it is on the same scale as the other terms. The constant is set so that the lengths of SSE predictions have a similar distribution to the lengths of SSEs from experimental structures. An average and variance of the length score is calculated from the SCOPFamily database. This term is not used during loop prediction when complete backbone structures are available. When adding the different terms the only weight is on the SSE length term as mentioned earlier. Otherwise, all the terms are on the same scale and therefore do not require weighting. The relative contribution of the different terms are automatically adjusted according to the confidence in the prediction of secondary structure and contact maps which use reported probabilities and according to the s for the KBP terms. To test the relative contribution of each term pair prediction was carried out on 75 cases from the SCOPFold100 set that have a length less than 250 residues using the method described below except doing an exhaustive search with loop closure turned off to reduce runtime. The relative contribution Journal of Computational Chemistry 2014, 35, 644–656

647

FULL PAPER

WWW.C-CHEM.ORG

of each term is the proportion of the variance in the total score that is contributed by the variance of the terms score. The average relative contribution is 0.313 for the SSE length term, 0.282 for the contact term, 0.201 for the secondary structure prediction term, 0.081 for the solvent accessibility term, 0.070 for the torsion angle term, and 0.050 for the contact map term. There is a significant variation in the contributions from case to case. For example, the standard deviation in the contribution of the secondary structure prediction term was 0.104. Pair database A database of contacting pairs of SSEs is derived from the SCOPFamily database and is clustered by structure and subclustered by solvent accessibility. Secondary structure is defined using Define Secondary Structure of Proteins (DSSP).[38] The SSEs are helices or strands and must be at least three residues in length. Helices include 310-helices, a-helices, and p-helices. Pairs must have one SSE that is at least four residues in length. Pairs are considered in contact if they have at least two residue to residue contacts. Residues are considered to be in contact if ˚ apart. they contain heavy atoms that are less than 5 A The pairs were clustered using a novel structure similarity measure designed for comparing pair predictions called placement RMSD (pRMSD). The pRMSD between two pairs 1 and 2 each of which have two SSEs A and B is calculated by the following procedure. The transformation that superimposes SSE 1A onto 2A is calculated and applied to 1A and 1B. The RMSD between 1A and 2A plus the RMSD between 1B and 2B is then calculated. The same process is repeated using SSE 1B to place SSE 1A. The larger of the two calculations is the pRMSD. This measure is the most relevant because during assembly a new SSE is placed onto the existing prediction in the same way. There is a significant difference between RMSD and pRMSD. For example, a small rotation of one helix of a pair around its axis will have a small effect on the RMSD but if you are using that helix to place another SSE you are essentially rotating the other SSE about the helices axis. If the SSE is far from the helices axis, the rotation will move the SSE a significant distance resulting in a high pRMSD. The pRMSD for pair predictions is 2–8 times larger than the RMSD. SSE pairs were clustered in order of longer to shorter. For a pair to belong to a cluster the corresponding SSEs between the pair and the cluster’s representative must meet the following criteria: the secondary structure types must match, the shorter SSE is at least 80% of the length of the longer SSE and ˚ or less. The pair must also the RMSD between the SSEs is 1 A ˚ have a pRMSD less than 4 A to the cluster’s representative. If the corresponding SSEs are different in length they are aligned such that the full length of the shorter SSE is aligned and the pRMSD is minimized. A pair can belong to more than one cluster. The sequential order of the SSEs of a pair can be reversed. If a pair does not belong to any existing cluster it becomes the representative of a new cluster. After clustering, each SSE of each cluster is subclustered by HSE using a threshold of the square root of the average squared difference of 5 a-carbons. 648

Journal of Computational Chemistry 2014, 35, 644–656

Loop closure Loop closure algorithms are used in SSThread in two situations: during pair prediction and tree assembly to identify gaps that cannot be bridged by any reasonable structure and during loop prediction. In both situations a form of CCD[24] is used. In CCD, an initial conformation of a loop that is attached to the n-end of the gap that does not properly bridge the cend of the gap is altered to bridge the c-end of the gap. In each step the u and w torsion angles of residues in the loop are adjusted to the angle that minimizes the distance between the residues at the c-end of the gap. This process is iterated over the residues of the loop until the loop properly bridges the gap. A variant full cyclic coordinate descent (FCCD)[39] uses a-carbon only loop structures. In FCCD, the full rotation around the a-carbon is calculated instead of adjusting torsion angles. In practice, FCCD is orders of magnitude faster than CCD but does not result in complete backbone structures. During pair prediction and tree assembly FCCD is used. An initial loop conformation is obtained by looking it up in a database. To construct the database first all segments of protein structures for each length are taken from SCOPFold. The segments are then clustered by the a-carbon coordinates of the three residues before and the three residues after the segment. The clusters are then subclustered by the virtual dihedral and torsion angles of the segment’s a-carbons. During loop closure an initial loop conformation is identified by finding the closest cluster using the three a-carbon coordinates from each end of the gap and then choosing the conformation from the cluster that has the highest score. Conformations are scored according to the method used in the torsion angle term [see eq. (2)] but instead of using the u and w torsion angles uses the virtual dihedral and torsion angles of a-carbon coordinates. The sequences of homologs are not used. The initial loop conformation is built from the coordinates of the three a-carbons before the gap, the dihedral and torsion angles from the database conformation, the dihedral angle from the three a-carbons after the gap and ideal distances between adjacent a-carbons. Starting from the conformation from the database loop closure is attempted using a FCCD greedy search algorithm. An iterative search is carried out in which in each iteration each residue’s a-carbon in the loop is considered as a possible axis of rotation and only the rotation from the axis that results in the highest scoring conformation is retained. Axes are thrown out if they are the same axis that was chosen in the previous iteration, the twist of the rotation is less than 0.5 , the dihedral and torsion angles of the a-carbons are not realistic given the amino acid type or if there is a steric clash as measured by having two a-carbons that are less than 3 A˚ apart. The conformations for each axis are scored using the virtual dihedral and torsion angles. If the score is a tie the conformation with the lowest RMSD of the three residues after the gap is retained. The process iterates until the RMSD of the three residues after ˚ , a maximum of 200 iterations has the gap goes below 0.2 A been reached or an iteration is reached in which all possible axes of rotation are thrown out. If the RMSD of the three WWW.CHEMISTRYVIEWS.COM

FULL PAPER

WWW.C-CHEM.ORG

˚ the gap passes, othresidues after the gap drops below 0.2 A erwise there is a violation. Loop closure was not run on gaps more than 11 residues in length because in native structures there is probably an SSE that has not yet been predicted within the gap. Also gaps longer than 11 residues are rarely rejected. A diagram of the SSThread algorithm is in Figure 1. Pair prediction The first step of SSThread is to predict the structure of SSE pairs. For each pair cluster in the database, the score for each SSE of the pair at each position in the protein is calculated. If there is more than one solvent accessibility cluster, the score for each cluster is calculated and the highest score is used. The SSE prediction must cover the region covered by all members of the cluster, must cover at least 80% of the representatives residues and otherwise is terminated to maximize the raw score. If the Z-score of the SSE prediction is less than 23 it is rejected. Then for each pair of SSE predictions one from each SSE of the cluster that do not overlap the contact and contact map terms are calculated. If the Z-score for the pair is less than 22 it is rejected. If the gap between the SSUs is 11 residues long or less the gap is checked for loop closure and if it fails the pair is rejected. Tree assembly Overlapping pair predictions are then assembled. For example, say you have a pair prediction with SSEs A and B and another pair prediction with SSEs C and D in which C overlaps B. The transformation that superimposes SSE C onto SSE B is calculated. That transformation is then applied to SSE D. Now you have a three SSE prediction with SSEs A, B, and the transformed D. The assembly algorithm can be thought of as a graph theory algorithm. Graph theory deals with problems that can be represented by a graph that consists of a set of vertices (points) some of which are connected by edges (lines). In this case, the vertices are SSE predictions and the edges indicate that the SSEs are part of a pair prediction. The first step to building a graph is to group identical SSE predictions from different pair predictions into a single vertex. The highest scoring pair predictions are used such that the number of vertices equals a parameter M multiplied by the square of the protein’s length. At first, edges come from the pair predictions. Then, edges for pairs based on overlapping SSEs are calculated. SSEs that overlap are checked to determine if they meet the following criteria: the SSEs must have the same secondary structure

type, the overlap must be at least four residues in length, and the RMSD of the overlapping region must be less than 1.5 A˚. Given three vertices A–C in which A and B are paired during pair prediction and A and C do not overlap, if C overlaps B and meets the overlap criteria C can be superimposed onto B and thus a new pair prediction between A and C is created. An edge between A and C is then added to the graph. All edges meeting this criteria are added to the graph. In graph theory, a tree is a graph in which each pair of vertices is connected by a single path. The next step in SSThread is to search for trees that are subgraphs of the main graph that correspond to predictions of the SSEs that make up the core of a protein structure. A diagram of the tree assembly algorithm is in Figure 2. Initially, each vertex is considered to be its own tree. Then an iterative search is conducted building trees with i vertices by joining two existing trees starting with i equal to 2 and increasing. All pairs of tree sizes that add to i are considered. For example, when i is 4, a 3-vertex tree and a 1-vertex tree could be merged or two 2-vertex trees could be merged. For a pair of trees to be merged they cannot cover any of the same residues of the protein. Otherwise, a new tree is created for each edge connecting the two trees. The atom coordinates corresponding to the tree are then generated. The atom coordinates generated include the three backbone atoms, the b-carbon, the hydrogen attached to the backbone nitrogen, and the oxygen attached to the carbonyl carbon. If the hydrogen or b-carbon are not found in the database structure their coordinates are determined using the backbone coordinates and ideal internal coordinates. If there are any steric clashes as measured by having a pair of a-carbons ˚ the tree is rejected. If any b-strands do not belong within 3 A to a b-sheet as defined by DSSP[38] the tree is rejected. Then the score for the contact and contact map terms are calculated for the trees new contacts. If the Z-score of the tree is less than 21.5 the tree is rejected. If the Z-score does not place the tree in the top D scoring trees, where D is a search parameter, for the current iteration thus far it is rejected. Because the same tree could be generated in a different order a check eliminating redundancy is carried out. Then new gaps that are 11 residues long or less are checked for loop closure and if any fail the tree is rejected. An estimate of the number of helix and strand residues in the protein structure is calculated by adding the probabilities of helix and strand from SPINE X for all the residues of the protein. If the number of residues predicted in a tree is at least 60% of the estimate it is kept as a prediction. To be retained for further searching the number of residues predicted must be less than 125% of the estimate. The top D scoring trees are

Figure 1. Diagram of the SSThread algorithm.

Journal of Computational Chemistry 2014, 35, 644–656

649

FULL PAPER

WWW.C-CHEM.ORG

redundancy. Predictions are clustered in order of the Z-score in descending order. For a prediction to be considered a member of a cluster it must have the same number of SSEs as the clusters representative, the SSEs must match up, the matching SSEs must have the same secondary structure type, the overlap of matching SSEs must be at least 80% the length of the longer SSE, and the overall RMSD must be less than a threshold that is dependent on the length of the protein. The threshold RMSD is proportional to the cube root of the protein’s length and is set so that a 150 residue long protein has ˚ . If a prediction does not belong to a threshold RMSD of 1.5 A any existing cluster it becomes the representative of a new cluster. Only the clusters with the highest Z-score are processed further. The number processed further is a parameter N times the length of the protein. For selection purposes a bonus to the Z-score of 0.2 per SSE predicted is used to enrich for larger core predictions. Loop prediction

Figure 2. Diagram of the tree assembly algorithm.

retained for further searching for each iteration. The search continues until an iteration is reached in which there are no trees that meet all the criteria for further searching. Clustering The core predictions are then clustered to reduce the number of predictions necessary for further processing by eliminating 650

Journal of Computational Chemistry 2014, 35, 644–656

The final step is to predict the loops and termini of each clusters’ representative structure. Initial conformations for the loops and residues missing from the termini are taken from the SCOPFold database using the aforementioned scoring terms not using the sequences of homologs. Several conformations for each loop or termini are retained. The loops taken must have end coordinates that are similar to the ends before and after the gap. Loops are then closed by CCD[24] using the same algorithm as used in the pair prediction and tree assembly stages except full backbone structures are used. The rotations about the u and w torsion angles that minimize the RMSD of the residue after the gap are calculated for each residue and the score used is the torsion angle term. Then loop and termini predictions containing steric clash to the core structure are eliminated. The contact and contact map scores for the loops and termini predictions against the core structure are then calculated. The highest scoring sets of loop and termini predictions corresponding to complete backbone structures are selected. For those sets any cases with steric clash between loops or termini are removed. Then the score for the full backbone predictions are calculated including the solvent accessibility and compactness terms which can be calculated from the predictions now that full backbone structures are available. The highest scoring complete backbone structure for each core prediction is retained. In this process, some core structures will be rejected because no loop can be predicted for one of the gaps either because it fails loop closure or there are no predictions that have no steric clash.

Results There are six questions that must be addressed when evaluating the SSThread algorithm. First, can native structures be built from contacting SSE pairs? Second, is the database of experimentally solved structures sufficient to provide the SSE pairs required to predict the structure of a protein with a new fold? Third, can enough accurate pair predictions for the protein be WWW.CHEMISTRYVIEWS.COM

WWW.C-CHEM.ORG

ranked among the highest scoring pair predictions? Fourth, given success at pair prediction can an accurate core structure be built among the ensemble of core structures generated? Fifth, given an accurate core prediction can the loops be accurately predicted. Sixth, can the accurate predictions be identified from among the ensemble of predictions? To test how many native structures can and cannot be built from contacting SSE pairs each domain in the SCOPFold database was processed to see if all the SSEs can be merged together by contact with other SSEs of the domain. Single helices without any contact to other SSEs are allowed as these could be predicted during loop prediction. In SCOPFold 807 of the 965 cases can be merged. There are three explanations of why a domain would fail. First are very large proteins containing clusters of small SSEs that are not in contact with the core structure. These SSEs could be considered part of a loop. Second are cases that are mostly disordered, which are not cases for this type of prediction anyway. The other cases are domains composed of two or more subdomains. Each subdomain could be built but they could not be merged together without considering coil residues. It could be debated whether these subdomains should be treated as separate domains. To get a rough idea of whether the database of experimental structures is sufficient to predict the structure of a new fold, a pair database was created as described in the methods section but with the overlap length criteria reduced from 80 ˚ and the to 60%, the pRMSD threshold increased from 4 to 6 A ˚ . The SSE overlap RMSD threshold increased from 1 to 1.5 A overlap length criteria is reduced because during the assembly step of SSThread only four overlapping residues are required. The pRMSD and overlap RMSD are set to a level sufficient for building accurate core structures. Then for the 807 cases that can be merged each SSE is initially in its own cluster. If a pair of SSEs is found in a pair cluster in the database that contains a member from a protein with a different SCOP fold the clusters the two SSEs belong to are merged into one cluster. Once all the pairs have been processed either there is only one cluster, which is a success, or there is more than one cluster, which indicates failure. In 745 of the 807 cases (92%), there is success. There are many explanations why domains fail this test. Two notable explanations are cases with very long a-helices or long b-strands with unusual conformations.

FULL PAPER

To test the algorithm, a pair database was generated for each test case in which all members of the same fold according to SCOP as the test case are removed before clustering. Also the test case was removed from the loop prediction database. This simulates the situation when predicting the structure of a new fold. To test the pair prediction step a set of 74 cases were selected that pass the previous test, represent a range of lengths and SCOP classes and otherwise were chosen at random. The recall and precision of pair prediction for the 74 cases is in Figure 3. Recall is the proportion of native SSE pairs that are accurately predicted. A prediction is considered accurate if the corresponding SSEs between the native SSE pair and the SSE pair prediction have the same secondary structure type and the overlap is at least 60% of the length of the longer SSE. The pRMSD must also be below a threshold. The precision is the proportion of predictions with a pRMSD below a threshold. Full recall is not required because if an SSE is in contact with more than one other SSE only one of the pairs needs to be predicted. The number of pair predictions required to predict every SSE in the 74 cases divided by the total number of contacting pairs is 0.72. The low precision can potentially be overcome during later stages. Pair prediction was also evaluated by running tree assembly on only the pair predictions that are the highest scoring with the M parameter set to 1 and have a pRMSD of 10 A˚ or less. Because only a fraction of the pair predictions were used, tree assembly could be carried out with a nearly exhaustive search. This is the most relevant way to evaluate pair prediction because it measures whether it is sufficient to generate accurate core structures. Three measures to evaluate predictions are used. RMSD[40] is the classic measure of structure similarity. Global distance test total score (GDT_TS)[41] and template modeling score (TM-Score)[42] are measures developed to evaluate protein structure predictions and have become commonly used for this purpose. For testing core predictions, the cGDT_TS and cTM-Score are averaged over the number of core residues (helix or strand) in the experimental structure. This way the prediction is punished for not predicting core residues but not punished for not predicting loop residues. The results are in Figure 4A. In 16 cases (21%) there are no predictions. Note that for a tree to be retained as a prediction

Figure 3. Recall and precision of pair prediction for the 74 benchmark cases at different values of the M parameter and different threshold pRMSDs.

Journal of Computational Chemistry 2014, 35, 644–656

651

FULL PAPER

WWW.C-CHEM.ORG

Figure 4. Results for the benchmark set testing for pair prediction (A), tree assembly (B), and loop prediction (C). The structure similarity of the best structure prediction among the ensemble of predictions is reported. The pair prediction graphs do not include 16 cases for which there was no prediction. The tree assembly graphs only include the top scoring clusters according to the N parameter used.

it must cover at least 60% of the estimated number of core residues. In only five cases all of the SSEs of the protein are predicted. In about half of cases failure to predict an SSE of the protein can be explained by an error in secondary structure prediction. Tree assembly and clustering was run on a set of 21 cases which are a subset of the 58 cases that had a prediction when testing the pair prediction stage and represent a range of lengths and SCOP classes. The search parameters used were M was set to 1, D was set to 100,000, and N was set to 100. Clustering reduced the size of the ensemble by an average of 14.8-fold among the 21 benchmark cases. Because the tree assembly algorithm has a heuristic the best predictions from testing the pair prediction stage may be lost when testing tree assembly. The best prediction might be part of a cluster that does not rank in the top scoring clusters. Also the best prediction could be a member of a cluster and not the repre652

Journal of Computational Chemistry 2014, 35, 644–656

sentative of that cluster. The results are in Figure 4B. The average increase in RMSD from testing the pair prediction step to testing the tree assembly step among the 21 cases is 1.7 A˚. The cGDT_TS drops 11.5 and the cTM-Score drops 0.118 on average. Loop prediction was run on the 21 benchmark cases. The accuracy of the best full backbone predictions for the 21 cases is in Figure 4C. For the seven most successful cases plots of Zscore versus RMSD of the ensemble of predictions are in Figure 5 and the structures of the best prediction among the ensemble are in Figure 6. The average increase in RMSD from the best (lowest RMSD) core prediction to the best full backbone prediction among the 21 benchmark cases is 3.8 A˚. The drop in GDT_TS is 14.1 and the drop in TM-Score is 0.121. The best full backbone prediction usually does not come from the best core prediction. Often the best core prediction is rejected during loop prediction. The average increase in RMSD to the WWW.CHEMISTRYVIEWS.COM

WWW.C-CHEM.ORG

FULL PAPER

Figure 5. Z-score versus RMSD of the ensemble of complete backbone predictions for a subset of the benchmark set. Cases are named by their SCOP identifier and can be looked up online at scop.berkeley.edu.

best full backbone prediction from the core prediction it was ˚ . The drop in GDT_TS is 5.3 and the derived from is only 1.6 A drop in TM-Score is 0.041. The contact order of a protein is the average distance in sequence of residues that are in contact in the 3D structure.[43] Relative contact order (RCO) is the contact order as a percentage of the protein’s length. The RCO of SSThread predictions are nearly as high as the experimental structures (Fig. 7). In 5 of the 21 cases, the average RCO of the ensemble of predictions is actually higher than the RCO of the experimental structure. SSThread is partially multithreaded to make use of multicore processors. It runs in polynomial time. In the case of one 100 residue long protein the total runtime was 364 h per thread. On a computer using eight threads it took 2 days.

Identifying the accurate predictions from among the ensemble is difficult. Clustering of the predictions after loop prediction was carried out. The cluster size does not correlate with accuracy. The average correlation coefficient between Z-score and RMSD for the 21 cases is 20.27. The average correlation is 0.20 for GDT_TS and 0.20 for TM-Score. To compare SSThread to the leading template-free methods a set of cases from the CASP10 experiment were used. The CASP experiments compare methods using protein structures that have yet to be released and so are blind predictions. The CASP10 target set includes proteins that were classified as “free modeling” targets in which template structures were not available. Unfortunately, there are only seven appropriate targets and none of them are easy. The top five predictions for each target and each participant are available. The top five

Figure 6. Complete backbone structure predictions for a subset of the benchmark set. The experimental structure is on the left and the best prediction is on the right. Equivalent residues between the experimental structure and the prediction are the same color. The structures were superimposed before being placed side by side.

Journal of Computational Chemistry 2014, 35, 644–656

653

FULL PAPER

WWW.C-CHEM.ORG

of predictions to the top five for Rosetta and QUARK SSThread produces a prediction with a lower RMSD in six cases excluding only target T0666-D1 in which SSThread made no prediction. When using the GDT_TS or TM-Score measures instead of RMSD SSThread compares slightly less well.

Discussion

Figure 7. RCO of the experimental structure versus the average RCO of the ensemble of complete backbone predictions from SSThread for the 21 benchmark cases. The line with the equation y 5 x is included.

predictions for SSThread were selected using the Z-score. The fragment insertion algorithms Rosetta[13] and QUARK[14] participated. BCL::Fold[22] participated but only submitted predictions for two of the seven targets. SSThread was not ready to participate in CASP10. The version of the SCOP database used by SSThread (1.75A) was released before the CASP10 experiment took place so SSThread does not use any information that was not available to the methods being compared. A breakdown of the CASP10 set is in Table 1. The results from the CASP10 set are in Table 2. Only one prediction from all of the targets and methods being compared could be considered correct, target T0666-D1 by QUARK. Target T0666-D1 is a 195 residue long six-helix bundle. The ˚ , which is low for a QUARK prediction had an RMSD of 7.6 A protein this large. SSThread made no predictions for target T0666-D1 due to errors in the secondary structure prediction. Other than in that case, the methods will be compared based on whether they have some native-like features. When comparing the best of the top five predictions SSThread produces a prediction with an RMSD less than the Rosetta predictions for two targets (T0719-D6 and T0658-D1) and a prediction with a lower RMSD than QUARK for two targets (T0719-D6 and T0684-D2), a total of three targets. All three of these targets contain b-strands and the other four targets do not. Two of the cases (T0719-D6 and T0658-D1) have a high RCO. If you compare the best SSThread prediction of the entire ensemble

A major bottleneck of SSThread is the pair prediction step. Pair prediction was tested by running tree assembly on only the accurate pair predictions. Only 5 of the 74 test cases had accurate pair predictions that covered every SSE of the native structure and in 16 cases there were not enough accurate pair predictions to cover at least 60% of the predicted number of core residues. In about half of cases, the failure to predict an SSE can be explained by an error in secondary structure prediction. Any improvement in secondary structure prediction algorithms will therefore improve the accuracy of SSThread. Loop prediction was also a bottleneck. The average increase in RMSD from the best (lowest RMSD) core prediction to the ˚. best full backbone prediction after loop prediction was 3.8 A Most of that increase is because the best full backbone prediction does not come from the best core prediction, which is often rejected during loop prediction. Another problem is identifying the accurate predictions from among the ensemble of predictions. Fragment insertion algorithms select predictions by clustering the ensemble and choosing the clusters with the most members.[44] This does not work with SSThread. A possible explanation for this is that SSThread is a nonstochastic algorithm and the fragment insertion algorithms are stochastic. The Z-score of SSThread predictions correlates weakly with the accuracy. The next steps for SSThread predictions is to predict side chain conformations followed by refinement. Perhaps at that stage the correct predictions could be identified using an all-atom potential. SSThread has successful predictions for both a-helix and b-strand containing proteins. However, the b-strands tend to be predicted more accurately than a-helices. Part of the reason is that contact map prediction methods are more accurate at predicting contacts between two b-strands than contacts involving a-helices. Also for a pair of b-strands that are hydrogen bonded to form a sheet and knowing which residues are hydrogen bonded the orientation between the b-strands is known. Helices can pack in many orientations.

Table 1. Statistics of the CASP10 dataset. Target T0735-D2 T0737-D1 T0740-D1 T0719-D6 T0658-D1 T0684-D2 T0666-D1

Length

Class

Na

Nb

RCO

88 117 155 163 166 168 195

All-a All-a All-a All-b All-b a1b All-a

7 6 7 7[a] 1 10 6

0 0 0 6 10 4 0

11.7 11.0 8.6 17.4 23.1 9.2 8.8

The class was assigned by the author as defined in the SCOP database. The number of a-helices (Na) and b-strands (Nb) that are at least three residues in length are listed. [a] Six of the seven helices are only three residues long.

654

Journal of Computational Chemistry 2014, 35, 644–656

WWW.CHEMISTRYVIEWS.COM

0.210[a] 0.270 0.351 0.317 0.270 0.230 0.294 0.235 0.412 0.350 0.350 0.265 0.220 0.189 0.267 0.295 0.220 0.273 0.177 0.195 0.284 0.217 NP 18.3[a] 10.7

In one case SSThread made no predictions (NP). [a] Only one prediction from BCL::Fold for target T0684-D2 was submitted.

11.1[a] 20.8

0.366 0.283 0.234 0.235 0.324 0.290 NP 30.1 23.7 20.8 12.5 16.4 15.7 27.7 30.9 29.4 19.0 12.5 11.1 14.4 17.3 17.9 20.3 11.2 10.1 16.7 12.5 NP 32.3 21.5 15.9 14.4 20.3 19.9 NP 9.0 10.8 20.0 16.6 13.7 17.7 7.6 9.8 12.2 16.9 18.6 19.5 14.7 13.9 11.4 13.3 22.1 15.4 14.2 15.5 NP 7.8 9.5 13.5 14.2 11.2 13.2 NP

SSThreadall SSThreadtop5 Rosettatop5 QUARKtop5 BCL::Foldtop5 SSThreadall SSThreadtop5 Rosettatop5 QUARKtop5 BCL::Foldtop5 SSThreadall SSThreadtop5 Rosettatop5 QUARKtop5 BCL::Foldtop5 Target

T0735-D2 T0737-D1 T0740-D1 T0719-D6 T0658-D1 T0684-D2 T0666-D1

TM-Score GDT_TS RMSD

Table 2. Results of the CASP10 dataset for four methods using the best of the top five predictions as well as the best of the full ensemble of predictions for SSThread (SSThreadall).

WWW.C-CHEM.ORG

FULL PAPER

Fragment insertion algorithms have difficulty generating predictions with a high RCO because they are based on fragment predictions which are local structure preferences.[45] SSE packing algorithms like SSThread are based on contacting SSEs that may be far apart in sequence and so are easily capable of generating structures with a high RCO (Fig. 7). Although the quality of the SSThread predictions in the CASP10 set were poor, when comparing the best of the top five predictions SSThread outperformed Rosetta on two targets and QUARK on two targets (a total of three targets) out of seven when using RMSD as a measure. The three targets were the only b-strand containing proteins and two of the targets have a high RCO which are properties for which SSThread is more successful. If you compare the best SSThread prediction from among the full ensemble of predictions to the top five predictions for Rosetta and QUARK SSThread has a prediction with a lower RMSD than both methods in all six targets in which SSThread makes a prediction. This indicates that if the best prediction could be identified from among the ensemble SSThread could potentially outperform Rosetta and QUARK. The most similar algorithm to SSThread is BCL::Fold.[22] The BCL::Fold algorithm is composed of two stages: assembly and refinement. During both stages BCL::Fold uses a Monte-Carlo search in which many different types of random moves are made. During the assembly step there are 74 different types of moves. One move is to add an SSE to the prediction according to a small number of “preferred” orientations. Another is to add a b-strand to the end of an existing b-sheet. The other move types alter or remove previously placed SSEs. For example, one type of move is to rotate an SSE by 0–45 about the x-axis. Another is to translate a b-strand by 2 to 4 A˚ along the z-axis. In the refinement step, there are 35 move types including smaller rotations and translations as well as introducing bend to SSEs and twist to b-sheets. Once the core structure has been predicted the loops are predicted by CCD. BCL::Fold uses idealized conformations of a-helices and b-strands during the assembly stage. For a-helices this is probably acceptable because a-helices only have slight bends and kinks. The conformations of b-strands, however, contain a wide variety of twist and curl. A b-strand from a b-barrel like retinolbinding protein has a very different conformation than a b-strand from a b-sandwich like immunoglobulin. In order for BCL::Fold to predict the structure of a nonidealized SSE it would have to be adjusted during the refinement stage, which comes after all of the SSEs of the core structure have been predicted. During the assembly step, SSThread takes the conformation of a-helices and b-strands from a database of experimental structures and thus they can take on any conformation found in known protein structures. Using SSE pairs from experimental structures does have drawbacks however. There must be sufficient SSE pairs in the database of experimental structures to cover the SSE pairs of a protein with a new fold. According to a rough estimate this is true for about 92% of proteins. Also, accounting for variation in SSE conformations increases the search space. Journal of Computational Chemistry 2014, 35, 644–656

655

FULL PAPER

WWW.C-CHEM.ORG

The scoring function in BCL::Fold[23] is similar to that of SSThread, but there are key differences. A major difference is that SSThread reports a native Z-score while BCL::Fold reports an estimation of the free energy, which is equivalent to the raw score in SSThread. SSThread uses the native Z-score to compare incomplete predictions including predictions of different sizes and covering different residues. BCL::Fold does not use the sequences of homologous proteins in its scoring and does not use contact map predictions. Instead of testing for loop closure, as in SSThread, BCL::Fold scores gaps using the euclidean distance and the distance in sequence of the gap. Keywords: ab initio protein structure prediction  knowledgebased potential  contact map prediction  secondary structure prediction  loop closure

How to cite this article: K. J. Maurice J. Comput. Chem. 2014, 35, 644–656. DOI: 10.1002/jcc.23543

[1] C. B. Anfinsen, E. Haber, M. Sela, F. H. White, Proc. Natl. Acad. Sci. USA 1961, 47, 1309. [2] D. T. Jones, W. R. Taylor, J. M. Thornton, Nature 1992, 358, 86. [3] M. Hendlich, P. Lackner, S. Weitckus, H. Floeckner, R. Froschauer, K. Gottsbacher, G. Casari, M. J. Sippl, J. Mol. Biol. 1990, 216, 167. [4] M. J. Rooman, J. P. Kocher, S. J. Wodak, J. Mol. Biol. 1991, 221, 961. [5] K. Nagano, J. Mol. Biol. 1973, 75, 401. [6] B. Rost, C. Sander, J. Mol. Biol. 1993, 232, 584. [7] U. G€ obel, C. Sander, R. Schneider, A. Valencia, Proteins 1994, 18, 309. [8] P. Fariselli, R. Casadio, Protein Eng. 1999, 12, 15. [9] H. Zhang, T. Zhang, K. Chen, K. D. Kedarisetti, M. J. Mizianty, Q. Bao, W. Stach, L. Kurgan, Brief Bioinform. 2011, 12, 672. [10] A. Asz odi, M. J. Gradwell, W. R. Taylor, J. Mol. Biol. 1995, 251, 308. [11] C. Czaplewski, S. Kalinowski, A. Liwo, H. A. Scheraga, J. Chem. Theory Comput. 2009, 5, 627. [12] M. S. Shell, S. B. Ozkan, V. Voelz, G. A. Wu, K. A. Dill, Biophys. J. 2009, 96, 917. [13] K. T. Simons, C. Kooperberg, E. Huang, D. Baker, J. Mol. Biol. 1997, 268, 209. [14] D. Xu, Y. Zhang, Proteins 2012, 80, 1715.

656

Journal of Computational Chemistry 2014, 35, 644–656

[15] L. Kinch, S. Yong Shi, Q. Cong, H. Cheng, Y. Liao, N. V. Grishin, Proteins 2011, 79 (Suppl 10), 59. [16] C. R. Brasil, A. C. Delbem, F. L. da Silva, J. Comput. Chem. 2013, 34, 1719. [17] E. Faraggi, Y. Yang, S. Zhang, Y. Zhou, Structure 2009, 17, 1515. [18] F. E. Cohen, T. J. Richmond, F. M. Richards, J. Mol. Biol. 1979, 132, 275. [19] B. Fain, M. Levitt, J. Mol. Biol. 2001, 305, 191. [20] G. A. Wu, E. A. Coutsias, K. A. Dill, Structure 2008, 16, 1257. [21] N. Max, C. Hu, O. Kreylos, S. Crivelli, Proteins 2010, 78, 559. [22] M. Karakas¸, N. Woetzel, R. Staritzbichler, N. Alexander, B. E. Weiner, J. Meiler, PLoS One 2012, 7, e49240. [23] N. Woetzel, M. Karakas¸, R. Staritzbichler, R. M€ uller, B. E. Weiner, J. Meiler, PLoS One 2012, 7, e49242. [24] A. A. Canutescu, R. L. Dunbrack, Jr., Protein Sci. 2003, 12, 963. [25] S. E. Brenner, P. Koehl, M. Levitt, Nucleic Acids Res. 2000, 28, 254. [26] A. G. Murzin, S. E. Brenner, T. Hubbard, C. Chothia, J. Mol. Biol. 1995, 247, 536. [27] Q. Dong, X. Wang, L. Lin, BMC Bioinform. 2006, 7, 324. [28] A. Panjkovich, F. Melo, M. A. Marti-Renom, Genome Biol. 2008, 9, R68. [29] T. Hamelryck, Proteins 2005, 59, 38. [30] N. V. Buchete, J. E. Straub, D. Thirumalai, Protein Sci. 2004, 13, 862. [31] S. Miyazawa, R. L. Jernigan, J. Chem. Phys. 2005, 122, 024901. [32] J. P. Kocher, M. J. Rooman, S. J. Wodak, J. Mol. Biol. 1994, 235, 1598. [33] W. Li, L. Jaroszewski, A. Godzik, Bioinformatics 2001, 17, 282. [34] Y. Huang, B. Niu, Y. Gao, L. Fu, W. Li, Bioinformatics 2010, 26, 680. [35] S. F. Altschul, T. L. Madden, A. A. Sch€affer, J. Zhang, Z. Zhang, W. Miller, D. J. Lipman, Nucleic Acids Res. 1997, 25, 3389. [36] E. Faraggi, T. Zhang, Y. Yang, L. Kurgan, Y. Zhou, J. Comput. Chem. 2012, 33, 259. [37] A. N. Tegge, Z. Wang, J. Eickholt, J. Cheng, Nucleic Acids Res. 2009, 37, W515. [38] W. Kabsch, C. Sander, Biopolymers 1983, 22, 2577. [39] W. Boomsma, T. Hamelryck, BMC Bioinform. 2005, 6, 159. [40] W. Kabsch, Acta Cryst. 1976, 32A, 922. [41] A. Zemla, Nucleic Acids Res. 2003, 31, 3370. [42] Y. Zhang, J. Skolnick, Proteins 2004, 57, 702. [43] K. W. Plaxco, K. T. Simons, D. Baker, J. Mol. Biol. 1998, 277, 985. [44] D. Shortle, K. T. Simons, D. Baker, Proc. Natl. Acad. Sci. USA 1998, 95, 11158. [45] R. Bonneau, I. Ruczinski, J. Tsai, D. Baker, Protein Sci. 2002, 11, 1937.

Received: 2 September 2013 Revised: 15 November 2013 Accepted: 5 January 2014 Published online on 6 February 2014

WWW.CHEMISTRYVIEWS.COM

SSThread: Template-free protein structure prediction by threading pairs of contacting secondary structures followed by assembly of overlapping pairs.

Acquiring the three-dimensional structure of a protein from its amino acid sequence alone, despite a great deal of work and significant progress on th...
729KB Sizes 0 Downloads 2 Views