Molecular BioSystems

View Article Online View Journal

Accepted Manuscript

This article can be cited before page numbers have been issued, to do this please use: B. Kumari, R. Kumar and M. Kumar, Mol. BioSyst., 2014, DOI: 10.1039/C4MB00425F.

This is an Accepted Manuscript, which has been through the Royal Society of Chemistry peer review process and has been accepted for publication. Accepted Manuscripts are published online shortly after acceptance, before technical editing, formatting and proof reading. Using this free service, authors can make their results available to the community, in citable form, before we publish the edited article. We will replace this Accepted Manuscript with the edited and formatted Advance Article as soon as it is available. You can find more information about Accepted Manuscripts in the Information for Authors. Please note that technical editing may introduce minor changes to the text and/or graphics, which may alter content. The journal’s standard Terms & Conditions and the Ethical guidelines still apply. In no event shall the Royal Society of Chemistry be held responsible for any errors or omissions in this Accepted Manuscript or any consequences arising from the use of any information it contains.

www.rsc.org/molecularbiosystems

Page 1 of 28

Molecular BioSystems View Article Online

DOI: 10.1039/C4MB00425F

Low complexity and disordered regions of proteins have

Bandana Kumari1, Ravindra Kumar1, and Manish Kumar1* 1

Department of Biophysics, University of Delhi South Campus, New Delhi, India

*Corresponding author Manish Kumar Department of Biophysics, University of Delhi South Campus, New Delhi, India Email: [email protected] Phone No.:

+91-11-24114111

1

Molecular BioSystems Accepted Manuscript

Published on 21 November 2014. Downloaded by University of California - San Francisco on 27/11/2014 08:41:49.

different structural and amino acid preferences

Molecular BioSystems

Page 2 of 28 View Article Online

DOI: 10.1039/C4MB00425F

Abstract Low complexity region (LCR) or non-random regions of few amino acids are

with high solvent accessibility. Thus a little attention was paid to them for structural studies. However LCRs have been found to contain information relevant to proteins structure and various important functions. The present study is an attempt to understand the structural trend of LCRs. Here we report a study done to understand structural trend, solvent accessibility and amino acid preferences of LCRs. The results show that LCRs might attain any type of secondary structure, however helix is frequently seen, whereas sheets occur rarely. We also found that LCRs are not always exposed on the surface. We found insignificant contribution of trans-membrane helices in overall helix content. The LCRs having secondary structure have different enrichment and depletion of amino acids from LCRs without secondary structure and disordered protein sequences. But, LCRs of NMR structures showed compositional and functional similarity with disordered regions of proteins. We also noted that in ~3/4 LCRs, entire amino acid did not have a single structural class rather ensemble of more than one secondary structures, which indicates that they are found at places where structure transition occurs. Overall analysis suggests that overall protein sequence has greater influence on structural and sequence enrichment rather than only local amino acid composition of LCRs.

Keywords: low complexity region; protein secondary structure; surface accessibility; trans-membrane helix; root mean square deviation; GO term enrichment

List of Abbreviations LCR: Low Complexity Region PDB: Protein Data Bank NA: Nucleic Acid NMR: Nuclear Magnetic Resonance DSSP: Dictionary of Protein Secondary Structure RMSD: Root Mean Square Deviation GO: Gene Ontology

2

Molecular BioSystems Accepted Manuscript

Published on 21 November 2014. Downloaded by University of California - San Francisco on 27/11/2014 08:41:49.

abundantly present in proteins. LCRs are traditionally considered as floppy structures

Page 3 of 28

Molecular BioSystems View Article Online

DOI: 10.1039/C4MB00425F

Introduction composition1 which are characterized either by the presence of homo-polymeric repeats of a single amino acid or by hetero-polymeric short repeats of amino acid residues, or by aperiodic mosaics of a few amino acids2. Every amino acid has the potential to exist in LCRs which are present in every domain of life viz. Eukaryotes, Archaea and Bacteria3. Statistical analysis has shown that approximately one-quarter amino acids of a protein sequence are present in LCRs and more than one-half of known proteins have at least one LCR4, 5. LCRs have been associated with many diverse and important functions like antigen processing and diversification3, genetic recombination9,

6-9

,

10

, protein-protein interactions11, etc. Several studies have

established that LCRs are also involved in human neurodegenerative diseases12, 13. Despite well-elaborated importance and abundance, compositional and structural properties of LCRs are poorly understood and their structural status as ordered or disordered is ambiguous. Often, LCRs are considered as a part of disordered protein segments which most likely do not form any secondary structure14-16 but exist as solvent-exposed, disordered coil4,

17, 18

. It is also reported that protein with perfect

repeats are natively disordered, which can acquire tertiary structure upon ligand binding19. For example, Lobanov et al.20 and Lobanov and Galzitskaya21 have reported some of the frequently occurring low complexity regions particularly GGGGG, PPPPP, TTTPTT, GGGGSGG, KKKKK, etc. as part of disordered pattern libraries. However few studies have also reported that LCRs can form well-ordered structures (often helical structure) in the protein sequence5,

22, 23

. A recent study by

Lobanov et al.24 tried to associate homorepeats (of ≥6 residues) and disordered patterns present in non-homologous proteins of different proteomes to a common function. The last major study on the nature of secondary structure of LCRs was done by 23

Saqi in the mid-nineties on a small dataset of 202 protein structures. They concluded that LCRs were overwhelmingly helical in nature. In light of the current information a recent review of secondary structural states of LCRs of contemporary protein structure data may offer interesting insights into structural aspect of LCRs. The present study is an attempt to explore if LCRs of protein always behave like intrinsically unstructured region by existing as solvent-exposed, disordered coils or 3

Molecular BioSystems Accepted Manuscript

Published on 21 November 2014. Downloaded by University of California - San Francisco on 27/11/2014 08:41:49.

Low complexity regions (LCR) in a protein sequence are regions of biased

Molecular BioSystems

Page 4 of 28 View Article Online

DOI: 10.1039/C4MB00425F

can also form well-ordered structural regions of protein. Here we have analyzed the secondary structure content and surface accessibility of a non-redundant dataset of

have secondary structures and they might not always exist as highly accessible disordered region. Another interesting observation was that proteins whose structures was determined by X-ray crystallography were found to possess ordered LCRs while those whose structure was determined by NMR possessed disordered LCRs. We also compared the enrichment/depletion profile of disorder promoting amino acids for LCRs in ordered and disordered regions of proteins and observed that they have different enrichment and depletion patterns of amino acids.

Materials and methods Dataset construction We downloaded September 2014 release of PDB25 that had 103,015 protein structures consisting of 270,485 protein chains. It is a well-established fact that binding to nucleic acid (NA) promotes disorder-to-order transitions in some NA binding proteins26, 27. Hence a protein which is unstructured in its native state can undergo disorder to order transition upon binding to NA and adopt a folded conformation14, 28, 29. Moreover, NA binding regions of proteins are rich in positively charged residues, which may be recognized as LCR. Therefore, we removed proteinNA hybrid structures from our dataset as they might make our observations biased. The sequence of each protein chain was extracted from the ATOM record of PDB. In order to create a clean and unbiased protein sequence database, we removed the structures containing character(s) other than standard amino acid symbols [B, X and Z] and were left with 240,770 protein chains. We used CD-HIT30 with clustering threshold 40% and throwaway sequence length of 49 amino acids. This was done to filter out the redundancy so that homology bias from the protein sequences and structural information of a small region of protein could be avoided. This strategy also helped us to winnow the structures of same proteins solved multiple times in PDB3133

. Finally we had a non-redundant dataset of 16,077 protein chains. We also extracted

the amino acid sequence of each protein chain from their respective PDB SEQRES record.

4

Molecular BioSystems Accepted Manuscript

Published on 21 November 2014. Downloaded by University of California - San Francisco on 27/11/2014 08:41:49.

Protein Data Bank (PDB) proteins and found that unlike popular belief, LCRs might

Page 5 of 28

Molecular BioSystems View Article Online

DOI: 10.1039/C4MB00425F

Determination of low complexity region We determined LCR in a protein sequence by using SEG2, 34. In SEG, the search

trigger window length (L), trigger complexity (K1) and extension complexity (K2). During LCR identification process, SEG collects all possible subsequences of length L having local sequence complexity ≤K1. All overlapping subsequences having sequence complexity ≤K1 are merged in both directions till the complexity of contig created by overlapping subsequences lie below ≤K2. In the present work, we used default setting of all three parameters (L=12, K1=2.2, K2=2.5). We found a total of 9,308 LCRs in 6,414 protein chains obtained from ATOM records among which 286 LCRs had chain break. We excluded these 286 LCRs to make sure that all have continuous structure and sequence. In order to avoid identification of HIS tags (HHHHHH) as low complexity region22, we also removed all homo-hexa-His LCRs present at C- or N-terminus of the sequence and/or were described in SEQADV record in PDB as ‘expression tag’. In this way, we obtained the final dataset of 8,632 LCRs (henceforth termed as “ATOM_LCRbase”) constituted of 123,792 amino acids and obtained from 5,955 protein chains. The length of ATOM_LCRbase ranged from 4−87 amino acids. In general, the protein structures determined from X-ray are considered as an ordered protein. Proteins whose structure is solved by NMR are likely to more flexible35-37 and hence NMR can be used in prediction of intrinsic disorder38. As LCRs are known to interfere with the crystallization process, therefore it would be appropriate to carry out a separate study for the NMR and X-ray derived structures and assess the influence of structural environment on LCRs. Hence we created two subsets of ATOM_LCRbase comprising of LCRs derived from X-ray and NMR structures referred as X-RAY_ATOM_LCRbase and NMR_ATOM_LCRbase respectively. They contained 7,319 LCRs obtained from 4,958 protein chains and 970 LCRs obtained from 798 protein chains respectively. An inherent problem of any sequence dataset, derived from protein structure, is the presence of partial sequences because PDB also contains structures, which corresponds to only a section of the whole protein. Regions of proteins, which are invisible in electron density maps, have chances to be disordered, which might have prevented them from crystallizing into well-ordered structures that can diffract X-rays

5

Molecular BioSystems Accepted Manuscript

Published on 21 November 2014. Downloaded by University of California - San Francisco on 27/11/2014 08:41:49.

for low complexity segments is controlled by three numeric parameters4 namely

Molecular BioSystems

Page 6 of 28 View Article Online

DOI: 10.1039/C4MB00425F

coherently. In the present study, LCRs that had chain break and residues not listed in ATOM records had been excluded from ATOM_LCRbase, which means we did not

we constructed another LCR dataset from protein sequence obtained from SEQRES records only, named as SEQRES_LCRbase, using part of sequence, which is present in PDB SEQRES record but absent from ATOM record. Due to low confidence in identification of disorderliness at terminal regions of a protein, residues in the disordered region that were less than 6 in number at C and N terminal were not considered during our analysis39. LCRs containing HIS tags and those comprising less than 4 residues were also excluded as was done while making ATOM_LCRbase. The final SEQRES_LCRbase dataset had 1,515 LCRs collected from 1,290 protein chains. A schematic diagram describing different stages of creation of all the datasets is presented in Figure 1. Secondary structure assignment of amino acids The secondary structure of each amino acid was assigned using dictionary of protein secondary structure (DSSP)40. DSSP takes atomic co-ordinates of proteins as input and assigns one among following types of secondary structures to each amino acid using the Hydrogen-bonding pattern namely alpha helix (H), 3/10 helix (G), pi helix (I), residue in isolated beta-bridge (B), extended strand participates in beta ladder (E), Hydrogen bonded turn (T) and bend (S). In this work we merged H, G and I into a single structure class helix (H), B and E into sheet (S) and T, S and structural elements not belonging to any DSSP assigned secondary structure types as Coil (C). Solvent accessibility assignment of amino acids The solvent accessibility (SA) of an amino acid residue in a protein structure is a measure of its location on protein’s inner core or surface. We calculated the SA of each amino acid residue in all PDB proteins using DSSP40. All SA values were divided into quartiles to categorize an amino acid as buried, semi-exposed and exposed. Amino acids having SA values falling under first quartile (0 −

Low complexity and disordered regions of proteins have different structural and amino acid preferences.

Low complexity regions (LCRs) or non-random regions of a few amino acids are abundantly present in proteins. LCRs are traditionally considered as flop...
2MB Sizes 0 Downloads 5 Views