www.proteomics-journal.com

Page 1

Proteomics

Leveraging the complementary nature of RNA-Seq and shotgun proteomics data Xiaojing Wang1¶, Qi Liu1¶, Bing Zhang1,2,3* 1

Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN 37232

2

Vanderbilt-Ingram Cancer Center, Vanderbilt University School of Medicine, Nashville, TN 37232 3

Department of Cancer Biology, Vanderbilt University School of Medicine, Nashville, TN37232

* Correspondence should be addressed to Bing Zhang. Tel: 615-936-0090 Fax: 615-322-0502 Email: [email protected]

Authors contributed equally

Keywords: Proteogenomics, RNA-Seq, Proteomics, Post-transcriptional regulation, data integration Total number of words: 4915 Received: 02-May-2014; Revised: 22-Aug-2014; Accepted: 25-Sep-2014

This article has been accepted for publication and undergone full peer review but has not been through the copyediting, typesetting, pagination and proofreading process, which may lead to differences between this version and the Version of Record. Please cite this article as doi: 10.1002/pmic.201400184.

This article is protected by copyright. All rights reserved.

www.proteomics-journal.com

Page 2

Proteomics

Abstract RNA sequencing (RNA-Seq) and mass spectrometry-based shotgun proteomics are powerful high-throughput technologies for identifying and quantifying RNA transcripts and proteins respectively. With the increasing affordability of these technologies, many projects have started to apply both to the same samples to achieve a more comprehensive understanding of biological systems. A major analytical challenge for such integrative projects is how to effectively leverage the complementary nature of RNA-Seq and shotgun proteomics data. RNA-Seq provides comprehensive information on mRNA abundance, alternative splicing, nucleotide variation and structure alteration. Sample-specific protein databases derived from RNA-Seq data can better approximate the real protein pools in cell and tissue samples and thus improve protein identification. Meanwhile, proteomics data provides essential confirmation of the validity and functional relevance of novel findings from RNA-Seq data. At the quantitative level, mRNA and protein levels are only modestly correlated, suggesting strong involvement of post-transcriptional regulation in controlling gene expression. Here we review recent studies at the interface of RNA-Seq and proteomics data. We discuss goals, accomplishments and challenges in RNA-Seq-based proteogenomics. We also examine the current status and future potential of parallel transcriptome and proteome quantification in revealing post-transcriptional regulatory mechanisms.

1.

Introduction

Gene expression is the most fundamental process through which genotype gives rise to phenotype. Microarray and, more recently, RNA sequencing (RNA-Seq) technologies have been the primary tools for genome-wide gene expression studies during the past decade.

This article is protected by copyright. All rights reserved.

www.proteomics-journal.com

Page 3

Proteomics

Because mRNAs and proteins are immediately adjacent in the central dogma of molecular biology, mRNA profiles are usually considered as surrogates for protein profiles. Recent advancements in high-throughput liquid chromatography-tandem mass spectrometry (LCMS/MS)-based shotgun proteomics technologies have enabled the identification of more than 10,000 proteins from individual biological samples [1], and genome-scale proteomics is becoming increasingly practical and affordable. To gain a more comprehensive understanding of gene expression, many research groups have started to apply both RNA-Seq and proteomics technologies to the same samples in parallel. Effective integration of data generated by these complementary technologies is critical in maximizing the potential of these projects. On one hand, data generated at the two molecular levels can complement each other. Database search is the primary method for peptide and protein identification in shotgun proteomics. However, typical protein reference databases used for the searches are neither complete nor specific. Because RNA-Seq provides unbiased and comprehensive transcript inventories for individual samples, information derived from RNA-Seq, including transcript abundance, sequence variation, and splicing variation, can be used for the construction of sample-specific protein sequence databases to improve protein identification in shotgun proteomics. Meanwhile, proteomics data can help validate novel isoforms and novel coding DNA sequences (CDSs) predicted by RNA-Seq, providing concrete evidence to improve genome annotation; proteomics data can also help elucidate functional relevance of the large number of sequence variations detected by RNA-Seq, including single nucleotide variants (SNVs), RNA edits, and small insertions and deletions (INDELs). On the other hand, despite the close relationship between RNA and protein, it is increasingly clear that protein abundance cannot be directly inferred from corresponding

This article is protected by copyright. All rights reserved.

www.proteomics-journal.com

Page 4

Proteomics

mRNA abundance [2]. Studies comparing either steady state mRNA and protein abundance or their alteration in response to perturbations have found that mRNA levels only explain less than half of the variability in protein levels [3-8], suggesting that protein abundance is controlled not only by regulating mRNA abundance, but also by mechanisms that affect other steps of protein metabolism, such as protein translation and degradation. Thus, a complete understanding of gene expression regulation and its relationship to cellular function and human disease cannot be achieved without the integration between transcriptomics and proteomics. In this review, we summarize recent studies at the interface of RNA-Seq and proteomics data. We start with the goals, accomplishments and challenges in RNA-Seq-based proteogenomics; then we examine the current status and future potential of parallel transcriptome and proteome quantification in revealing post-transcriptional gene regulatory mechanisms. 2.

RNA-Seq-based proteogenomics

Although the term proteogenomics has been increasingly used in a broad sense to describe the integrative analysis of proteomics and genomics data, it originally refers to the use of proteomics data to improve genome annotation. For this purpose, MS/MS data is used to search against a six-frame translated genome database or databases generated from gene structure prediction and exon-exon junction models, rather than standard reference protein databases, as detailed in a recent review [9]. Despite partial success, these databases contain a large number of putative proteins, leading to low sensitivity and high false positive rate in peptide identification. Because RNA-Seq allows genome-wide analysis of transcription at single nucleotide resolution, it has been widely used to facilitate proteomics studies of nonmodel organisms that do not have a fully sequenced and well-annotated genome since the This article is protected by copyright. All rights reserved.

www.proteomics-journal.com

Page 5

Proteomics

emergence of the technology [10-14]. In human and other model organisms, it has been shown that sample-specific protein sequence databases derived from RNA-Seq data can better approximate the real protein pools in cell and tissue samples and thus improve protein identification [15, 16]. Unlike traditional proteogenomics that uses proteomics to validate reference genome-based protein predictions, RNA-Seq-based proteogenomics uses proteomics to validate individual transcriptome-based protein predictions. This strategy is very appealing because RNA-Seq data offers multiple types of information including mRNA abundance, SNVs, alternative splicing, and novel coding regions (Fig. 1). 2.1 mRNA abundance Given sufficient sequencing depth, RNA-Seq is able to provide a nearly complete set of all expressed transcripts for individual biological samples. It is reasonable to assume that undetected transcripts are untranscribed or extremely low in abundance. Fig. 2a depicts the distribution of transcript abundance based on data generated from ten colorectal cancer cell lines [17]. Transcript abundance for all transcripts (green) or the subset of all coding transcripts (grey) spreads over a wide range of FPKM (Fragments Per Kilobase Of Exon Per Million Fragments Mapped) values; however, transcript abundance distribution for the subset of transcripts with detectable proteins in the proteomics study is clearly narrower and shifted toward higher FPKM values (red). Similar patterns have been observed in other studies [1, 15, 18, 19], suggesting that a large number of transcripts are transcribed but some of them may not yield functional proteins or yield very lowly expressed proteins that are beyond the detection limit in proteomics. Accordingly, it has been suggested that an abundance-based filter can be used to remove those irrelevant entries in a sample-specific protein sequence database [15]. Although the presence of transcripts does not guarantee the presence of corresponding proteins due to post-transcriptional regulations, the abundance-based filter

This article is protected by copyright. All rights reserved.

www.proteomics-journal.com

Page 6

Proteomics

reduces database size and increases search sensitivity. Using this strategy, our previous study in two colorectal cancer cell lines [15] showed a 5% increase in the number of identifiable spectra, which is in turn expected to improve spectral counting based protein quantification. Recently, Omasits et al. used RNA-Seq to generate condition-specific endpoint estimates of actively transcribed protein-coding genes, which were subsequently used to guide directed shotgun proteomics experiments [16]. Applying the strategy to Bartonella henselae growing with or without IPTG (Isopropyl β-D-1-thiogalactopyranoside), the authors identified a total of 1,250 proteins with an estimated false discovery rate below 1%, representing 85% of all distinct annotated proteins and ~90% of the expressed protein-coding genes under individual conditions. In addition to thresholding, mRNA abundance estimated from RNA-Seq may also be used directly as prior information of protein presence to improve protein identification. As demonstrated in a recent study, this approach led to 8% increase in protein identification [20]. 2.2 Transcriptome-level SNVs Transcriptome-level SNVs reflect both DNA variations and RNA editing events. These SNVs are of particular interest in the analysis of disease samples in which DNA variations or RNA editing may play a pathogenic role. Regardless of their mechanisms of origin, mRNA SNVs are directly upstream of protein production and closely relevant to proteomics. Meanwhile, verifying these SNVs at the protein level is critical in evaluating their functional relevance. Publicly available single nucleotide polymorphism (SNP), mutation, and RNA-editing databases can help interpret the SNVs identified in transcriptomics and proteomics studies. Our studies in colorectal cancer cell lines showed that incorporating SNVs identified in RNA-Seq into customized protein sequence databases enables effective identification of variant peptides [15, 17]. Among hundreds of variant peptides identified, some occurred in well-known cancer genes such as KRAS and TP53. Fig. 2b shows the proteomic identification This article is protected by copyright. All rights reserved.

www.proteomics-journal.com

Page 7

Proteomics

of a KRAS(Gly12Val) mutation in colon cancer cell line SW480 based on a customized database derived from matched RNA-Seq data. Using a similar approach, Sheynkman et al. identified 695 variant peptides that were mapped to 504 SNV sites in the Jurkat cell line [21]. To demonstrate the advantage of the customized SNV database derived from the Jurkat cell RNA-Seq data, they also performed a database search using a database containing all known non-synonymous SNVs from dbSNP and showed that more than 70% of thus identified variant peptides are not supported by the RNA-Seq data and are likely to be false positives. Interestingly, the authors also found that a significant fraction of heterozygous alleles are expressed at the protein level, suggesting the feasibility of quantifying protein allele-specific expression at a large scale. Another study focused on SNVs located on chromosome 20 and identified 20 SNV sites in three liver cancer cell lines [22]. More recently, we applied the approach to human tumor specimens and identified 796 single amino acid variants (SAAVs) in 86 colorectal tumor samples, including 64 somatic variants [23]. We also found that somatic variants have a significantly stronger negative impact on protein abundance than germline variants [23]. It is well-accepted that DNA sequencing is better suited for calling somatic mutations than RNA-Seq. However, studies have shown that only about 40% somatic mutations detected in exome sequencing can be confirmed at the RNA level [23-25], raising the concern that whether the remaining mutations are detectable at the protein level. A study including matched genomic, transcriptomic, and proteomic data may help clarify this and allow us to understand how the genotype allele frequencies is related to allele specific expression at RNA and protein levels. This type of tri-level integration can also enable separating SNVs into DNA variations and RNA edits [26-28]. As an example, an integrative personal omics profiling (iPOP) study on a

This article is protected by copyright. All rights reserved.

www.proteomics-journal.com

Page 8

Proteomics

single human being generated customized databases based on missense SNVs and RNA edits from DNA sequencing and RNA-Seq respectively, and used them for proteomics search [19]. The study identified 48 SNV-containing peptides and 51 RNA edit-containing peptides. According to the study, a large fraction of personal variants are expressed as transcripts and a subset is subsequently translated into proteins. 2.3 Alternative splicing isoforms Almost 90% of multi-exon genes produce alternative splicing under certain circumstances [29], which introduces a high proteome diversity. Alternative splicing isoforms are usually tissue or condition specific and low abundant. The dysregulation of splicing mechanisms produces aberrant products that have been related to human diseases, such as cancer [30]. RNA-Seq data have been widely used to confirm known exon-exon junction and to identify novel ones. However, because about 70% human genome territories undergo transcription [31], most of the novel junctions are possibly by-products of adjacent transcription, which may be quickly degraded and not translated. Therefore RNA-Seq data alone is not adequate to confirm the presence of alternative isoforms. Moreover, some novel junctions might be false positives introduced by short reads from RNA-Seq. Due to the above reasons, as shown in Fig. 2c, most of the novel junctions have lower read coverage than the known ones. Novel junctions identified by RNA-Seq can be validated by proteomics data. Although public splicing databases have been used to generate putative splicing peptides for proteomics search, sample-specific databases with splicing peptides predicted by RNA-Seq are directly supported by transcription evidence, which greatly reduces the searching space. In an early study, Ning et al. used RNA-Seq data to derive a six-frame translated novel junction sequence database for proteomics search, with a focus on the identification of novel alternative splicing forms [18]. Surprisingly, only a small number of peptides supporting This article is protected by copyright. All rights reserved.

www.proteomics-journal.com

Page 9

Proteomics

novel alternative splicing were identified. Instead of sample-specific databases, Woo et al. constructed a compact database that contains all useful information expressed in RNA-Seq reads from a large number of Caenorhabditis elegans data sets; however, their proteogenomics study also only identified a dozen novel alternative splicing events [32]. The low confirmation ratio at the protein level can be partially explained by the low abundance of the novel transcripts (Fig. 2c). Moreover, most of the proteins identified in a typical proteomics study have low sequence coverage [33], making it difficult to verify novel junctions. More recently, Sheynkman et al. proposed a new approach that extends both sides of a junction predicted by RNA-Seq for a few base pairs to alleviate the abovementioned problem and identified 57 splice junction peptides in Jurkat cells [33]. Moreover, Wu et al. reported 72 novel splicing peptides in mouse liver [28]. Despite limited success, these studies demonstrate the feasibility of using RNA-Seq data to facilitate the detection of junction peptides. It is worth mentioning that RNA-Seq data can also identify fusion genes. Although observed in several cancer types, fusion events are rare and difficult to identify and validate at the proteomics level. Sun et al. constructed a fusion peptide database, CanProFu, from 6259 reported or predicted fusion gene pairs. Searching proteomics datasets from 40 nonsmall cell lung cancer samples and 39 normal lung samples using CanProFu, they identified 19 unique fusion peptides [34]. Because CanProFu is based on all reported and predicted gene fusion events, it contains about 7 million fusion peptides, which may lead to high false discovery rates and low sensitivity. Sample-specific fusion peptide databases derived from RNA-Seq data may provide a more attractive approach for fusion peptide discovery.

This article is protected by copyright. All rights reserved.

www.proteomics-journal.com

Page 10

Proteomics

2.4 Novel coding regions RNA-Seq studies have also revealed a large number un-annotated transcripts. Validating these transcripts using proteomics could lead to the identification of novel coding genes. Similar to novel junctions, novel transcripts discovered from RNA-Seq data are much less abundant compared to known transcripts (Fig. 2d). Using a database constructed on the basis of multiple RNA-Seq data sets as mentioned above, Woo et. al. identified more than 1,000 novel exons/genes in C. elegans [32]. Another study by Wu et al. identified 43 novel regions in mouse liver [28]. Rather than using the reference genome information, Evan et al. performed a de novo assembly of RNA-Seq reads to predict transcripts and derive protein databases for proteomics search [35]. More than 99% of the proteins identified by traditional analysis were detected using this approach, and 87% of the variant peptides could also be detected. Their method is extremely promising for nonmodel organisms or mammal samples that may contain viruses or other microbes. Critical requirements of the de novo assembly based method are sequence depth and appropriate assembly tools. One limitation of this approach is the six-frame translation of all detected transcripts, which gives rise to a large number of biologically irrelevant protein sequences and therefore decreases the sensitivity of peptide identification. To address this limitation, a strategy named Proteomic reference from Heterogeneous RNA Omitting the Genome (PHROG) has been developed [36]. In this approach, RNA-Seq data from various sources are combined to improve sequence coverage; and the most plausible translation frame is predicted based on known protein sequences from multiple organisms. In addition to the discovery of novel exon/gene region, several proteomics studies tried to find protein coding evidence for known non-coding regions, including pseudogenes, noncoding RNAs, upstream open reading frames (uORFs) in untranslated regions (UTRs), or This article is protected by copyright. All rights reserved.

www.proteomics-journal.com

Page 11

Proteomics

altered translation start sites, by searching six-frame translated genome sequences [37-39]. In a recent study, Kim et al. searched 16 million unmatched spectra from an in-depth proteomic profiling of 30 human tissue samples against a database with translated human reference genome, Refseq transcript sequences, non-coding RNAs and pseudogenes. They found 193 novel protein coding regions, 140 of which were from pseudogenes, 29 from uORFs, 15 from alternative ORFs in coding regions or ORFs in 3'-UTRs, and 9 from non-coding RNAs [40]. In another study using RNA-Seq and proteomics data from two human cell lines, it was reported that long non-coding RNAs are rarely translated [19]. Above sections describe different types of information provided by RNA-Seq. Putting all these useful information together allows the construction of comprehensive customized databases that enable RNA-Seq-based proteogenomics. Most of the abovementioned studies use matched RNA-Seq and proteomics data from the same sample. It is worth mentioning that a consensus database may be constructed based on multiple related RNA-Seq data sets to capture protein features shared by a cohort of samples. For example, using a consensus protein database constructed from RNA-Seq data for 64 colorectal tumor specimens in The Cancer Genome Atlas (TCGA) cohort, we were able to identify hundreds of variant peptides and dozens of novel junction peptides in three independent colorectal tumor samples [41]. This approach could be applied to RNA-Seq data sets generated by large consortium projects such as TCGA and the Cancer Cell Line Encyclopedia (CCLE) [42], and the consensus databases would be of interest to the broad proteomics community. 3.

Integrative analysis of mRNA and protein abundance

The central dogma of molecular biology states a flow of genetic information from DNA to RNA to protein and the latter acts as a direct mediator of cellular functions. Due to the limited depth of quantitative proteomics, mRNA abundance has long been used as a proxy for This article is protected by copyright. All rights reserved.

www.proteomics-journal.com

Page 12

Proteomics

protein abundance and activity, thereby assuming transcriptional regulation is the main determinant of protein abundance. However, protein abundance is controlled by the precisely coupled multi-layer regulations, including transcription, mRNA decay, translation, and protein degradation. Recent advancements in proteomics technologies provide an unprecedented opportunity for the integrative analysis of mRNA and protein profiles, which have started to generate novel insights into protein expression regulation [43]. 3.1 Correlation of mRNA and protein abundances in steady state The correlation coefficients between mRNA and protein are often modest and vary widely across organisms, with most Spearman's Rank Correlation (RS) values ranging between 0.40 and 0.70 (reviewed in [44] ). This modest correlation demonstrates that mRNA abundances are not perfect proxies for protein abundances and regulatory processes after mRNA is made also play important roles in protein production. By integrating mRNA and protein profiles at a large scale, multiple independent studies [45-53] have identified biological features affecting protein production and have quantified their contributions to explaining the variation in protein abundance (Table 1). Linear or nonlinear regression models have been built to relate variations in protein abundance to combined effects of mRNA abundance and biological features associated with different stages of regulation, e.g., translation initiation, elongation, termination and protein turn over. An early study in the bacterium Desulfovibrio vulgaris [47] investigated three classes of translation-related sequence features and found that codon usage and amino acid composition related to translation elongation are major contributors. All sequence features accounted for 15.2%-26.2% of the total variation in mRNA-protein correlation. Several studies in baker’s yeast Saccharomyces cerevisiae [45, 46, 48, 49] explored not only sequence features but also biological factors such as ribosomal density, ribosomal occupancy, and mRNA and protein This article is protected by copyright. All rights reserved.

www.proteomics-journal.com

Page 13

Proteomics

stability. Tuller et al. [45] analyzed tRNA adaption index and evolutionary rate and Brockmann et al. [48] included codon adaption index, saturation effect, and factors related to translational activity, such as ribosomal density and ribosome occupancy. Wu et al. [46] investigated codon usage, amino acid composition, secondary structure of 5’UTR, mRNA and protein stability, ribosomal density, and ribosome occupancy. Gunawardana et al. [49] used LASSO, a shrinkage and selection method for linear regression, to select dominant features from 37 features. As a result, ribosomal density, ribosomal occupancy, tRNA adaption index and codon bias were chosen and a linear model combining the contributions of mRNA and these features explained 86% of variation in protein expression. Vogel et al. [50] assessed the relative importance of ~200 features in a human cell line. They used a Multivariate Adaptive Regression Splines (MARS) model to approximate the nonlinear relationship between protein abundance and the combined contributions of mRNA abundance and sequence features, which explained two-thirds of protein abundance variation. Notably, features related to translation regulation, such as characteristics of coding sequence and 3’ UTR, explained as much variation in protein abundance as mRNA abundance did. Recently, Guimaraes et al. [51] analyzed >100 features and derived a predictive model composed of 16 features explaining 66% of variation in protein abundance in E. coli. Although feature contributions differ across organisms and data sets, factors related to translational elongation, such as characteristics of coding region (e.g., codon bias, amino acid composition, and tRNA adaption index), have been identified to affect protein production in bacteria, yeast and human, which suggests that they might be common mechanisms shared by different organisms. Additionally, using matched mRNA and protein measurements from mammalian studies, Calvo et al. [54] and Vogel et al. [50] both reported the correlation between presence of uORFs and reduced downstream protein abundance.

This article is protected by copyright. All rights reserved.

www.proteomics-journal.com

Page 14

Proteomics

The remaining unexplained variation in protein abundance might be in part accounted for by unknown biological features and experimental/technological limitations, including quality of data and size of data set, computational estimations of protein and mRNA abundance, and quantification of sequence features. Advancements in proteomics and transcriptomics technologies allowing researchers to measure mRNA and protein abundance more accurately and at a larger scale may potentially increase mRNA-protein correlation. Different platforms to quantify proteomic/transcriptomic or different computational methods to estimate their abundance have been reported to affect mRNA-protein correlation [55]. Additionally, the unexplained variation may also be subject to the methods for quantifying biological features, such as tRNA adaption index, the secondary structure of 5’ UTR and 3’ UTR, ribosomal density or ribosomal occupancy. Addressing these limitations in the future could lead to a better understanding of protein-level regulation and improved models for predicting protein abundance. 3.2 Correlation between mRNA and protein alterations In addition to studying steady state mRNA and protein abundance, there are increasing research activities on the alterations of mRNA and protein abundance due to changing conditions or genetic variations, and dynamic changes in response to perturbations [5, 56-77]. Comparative analyses of proteomic and transcriptomic alterations provide a great opportunity to 1) investigate whether changing transcript abundance mediates protein alterations; 2) identify genes and functional categories with discordant mRNA-protein alterations, suggesting potential protein-level regulations; 3) discover the underlying regulators responsible for the protein-level regulations. The correlations between mRNA and protein alterations are in a modest range, with most Spearman or Pearson correlation values between 0.40 and 0.60 [5, 56, 57, 60, 69, 76], This article is protected by copyright. All rights reserved.

www.proteomics-journal.com

Page 15

Proteomics

highlighting the importance and prevalence of protein-level regulation in systems adaptation to changing conditions. Interestingly, it has been reported that the degree of concordance between mRNA and protein alterations is dramatically different between up-regulated and down-regulated transcripts. Several studies reported that transcript induction correlates well with protein increase, whereas transcript reduction is not closely related to corresponding protein profiles. For example, Lee et al. [58] compared the maximal changes of transcriptome and proteome profiles during NaCl acclimation and found that nearly 80% (R2=0.77) of the changing protein levels was explained by transcript induction, while only 9% (R2=0.09) of the variance in protein abundances was explained by transcript decrease. This pattern was not observed in several other studies; and in certain cases, an opposite trend was reported. Vogel et al. [62] revealed numerous discordant changes between mRNA/protein pairs independent of their direction of regulation. Fournier et al. [61] found that most down-regulated proteins correlated with mRNA under-expression, while up-regulated proteins were not associated with mRNA overexpression. Furthermore, dynamic protein and mRNA profiles show a delay between mRNA changes and protein adjustments [58, 61, 64, 66, 77], which further complicates the coordination process between multi-layer regulations. Discordant alterations of mRNA-protein pairs suggest the involvement of protein-level regulation, providing novel insights into functional relationships among proteins. Functionally related proteins may undergo the same type of protein-level regulation, and the consequences of the regulation vary considerably across different functional categories. Juschke et al. [60] quantified mRNA and protein abundance changes of a brain tumor model at a genome scale and found that protein-level regulation significantly enhance co-regulation of protein-complex members beyond transcriptional co-regulation. Lee et al. [58] found that stress response proteins, translation factors, and ribosomal proteins are subject to a noise

This article is protected by copyright. All rights reserved.

www.proteomics-journal.com

Page 16

Proteomics

reduction process that buffers protein changes against mRNA alterations. Gan et al. [57] defined five major regulatory mechanisms termed “transcript only”, “transcript degradation”, “translation repression”, “translation de-repression”, and “protein degradation” according to protein expression changes relative to mRNA changes and identified different functional groups enriched in each regulatory mechanism. Just like transcription regulation, translational and protein-degradation regulation may involve cis- or trans- regulatory elements. Transcriptome and proteome profiling of genetically diverse populations provides an efficient approach for identifying putative cisregulatory elements that affect mRNA abundance (eQTL) or protein abundance (pQTL). Studies in yeast [73, 74], mouse [75] and human [71] have revealed a low overlap between pQTLs and eQTLs, suggesting that most pQTLs may regulate gene expression at the posttranscriptional or translational levels. Mutations in uORFs are known to increase protein expression levels [54, 78]; however, Waern et al. [79] found that removing uORFs might decrease translational efficiency, possibly due to the loss of RNA binding sites or ribosome entry sites in uORFs that may attract ribosomes to mRNAs. Studies using comparative genomics [80] and/or functional genomics approaches may facilitate a better understanding of the regulatory roles of uORFs. To identify trans-regulatory elements, one strategy is to link the changes of putative regulators, such as miRNAs and RNA-binding proteins, with downstream targets with discordant mRNA and protein alterations. Vogel et al. [62] connected the RNA-binding proteins with different mRNA and protein expression patterns, suggesting their potential roles in the regulation of translation and protein degradation. By integrative analysis of mRNA, miRNA and protein profiles in nine colorectal cancer cell lines, a previous study from our group was able to dissect the relative contributions of mRNA decay and translational repression in miRNA-mediated gene expression regulation [81]. Our

This article is protected by copyright. All rights reserved.

www.proteomics-journal.com

Page 17

Proteomics

study was based on the assumption that a negative miRNA-mRNA correlation suggests mRNA decay, a negative miRNA-ratio correlation indicates translational repression, and a negative miRNA-protein correlation may represent the combined results of mRNA decay and translational repression. As a result, we predicted 580 miRNA-target interactions, involving 60 miRNAs and 423 genes. A detailed examination of these interactions revealed that translational repression plays an equally important role as mRNA decay in miRNA-mediated gene expression regulation. It was even the predominant mechanism for certain miRNAs such as miR-138. Indeed, we found that down expression of miR-138 leads to up-regulated protein abundance of most targets without affecting mRNA abundance. Recently, Nassa et al. [82] identified differentially expressed proteins in ERβ+ and ERβ- cells, a large portion of which showed no corresponding changes in mRNA abundance. Investigating miRNA profiles in these two cells, they found that most of these proteins were targets of ERβ-regulated miRNAs, indicating a central role of miRNAs in mediating breast cancer cell proteome changes induced by unliganded ERβ+. 4.

Conclusion and perspectives

RNA-Seq and shotgun proteomics are powerful technologies for identifying and quantifying RNA transcripts and proteins respectively at the genome scale. These technologies generate data that are complementary to each other, and when integrated appropriately, hold great potential in accelerating biological discoveries (Fig. 3). Traditional proteogenomics approach suffers from unreasonably large search space. Although the RNA-Seq-based proteogenomics is still in its infancy, its potential has been clearly demonstrated (Fig. 2). For newly sequenced genomes, integrated RNA-Seq and proteomics profiling can also help correct genome assembly errors [83]. Importantly, RNASeq-based proteogenomics is expected to play a critical role in the evolving field of This article is protected by copyright. All rights reserved.

www.proteomics-journal.com

Page 18

Proteomics

personalized proteomics [41]. For example, in cancer studies, this strategy can help identify abnormal proteins that may serve as highly specific biomarkers or therapeutic targets. Emerging technologies, such as ribosome profiling (RIBO-Seq), may further strengthen proteogenomics studies. RIBO-Seq only detects actively translated mRNAs [84], making it theoretically a even better option for proteogenomics [85]. Comparative analyses of mRNA and protein abundances in steady state and expression alterations under different conditions have demonstrated that transcription is only half the story [86]. In steady state, hundreds of biological features associated with protein-level regulation, including translation and protein degradation, have been assessed and models combining mRNA concentration and important features have been built to quantify their relative contributions to protein expression. Meanwhile, Discordant mRNA and protein changes revealed genes subject to the change of protein-level regulation and potential regulators responsible for the change have been identified. Despite recent progresses in the integrative analysis of transcriptome and proteome profiling data, computational methods for data integration are still far from being robustly approached. Particularly, methods to model gene-specific differential translation is urgently needed. The proteome is the combined outcome of multi-step regulation process, thus it is hard to decode the relative contribution of each regulatory process only from the transcriptomics and proteomics data. Recent advances in methods that analyze translational efficiency, mRNA and protein turn over, protein synthesis and degradation rates [8, 77, 84] make it possible to quantify the entire cascade of gene expression regulation, which create multi-layered expression maps on the system level and provide a great opportunity to study the coupling and coordination between regulatory processes. Additionally, a variety of methods have been developed to decipher RNA-encoded signal that affect post-transcription, translation and

This article is protected by copyright. All rights reserved.

www.proteomics-journal.com

Page 19

Proteomics

protein degradation. For example, HITS-CLIP (CLIP-Seq, PAR-CLIP, iCLIP, CLASH) has been widely used to characterize miRNA-RNA and protein-RNA interactions [87-90]. Recently, RNAcompete was used to characterize binding motifs of 207 RNA binding proteins. To take full advantage of these complex and interrelated data for better understanding of the gene expression regulation at a global and at a gene-specific level, appropriate experimental design and advanced computational methods for data integration will continue to play key roles.

Acknowledgements This work was supported by NIH (http://www.nih.gov/) grants U24 CA159988, R01 CA126218, and contract 13XS029 from Leidos Biomedical Research, Inc.

Conflict of interest The authors have declared no conflict of interest.

References [1] Nagaraj, N., Wisniewski, J. R., Geiger, T., Cox, J., et al., Deep proteome and transcriptome mapping of a human cancer cell line. Mol Syst Biol 2011, 7, 548. [2] Vogel, C., Marcotte, E. M., Insights into the regulation of protein abundance from proteomic and transcriptomic analyses. Nat Rev Genet 2012, 13, 227-232. [3] Lundberg, E., Fagerberg, L., Klevebring, D., Matic, I., et al., Defining the transcriptome and proteome in three functionally different human cell lines. Molecular Systems Biology 2010, 6, 450. [4] Nagaraj, N., Wisniewski, J. R., Geiger, T., Cox, J., et al., Deep proteome and transcriptome mapping of a human cancer cell line. Mol Syst Biol 2011, 7, 548. [5] de Godoy, L. M., Olsen, J. V., Cox, J., Nielsen, M. L., et al., Comprehensive massspectrometry-based proteome quantification of haploid versus diploid yeast. Nature 2008, 455, 1251-1254.

This article is protected by copyright. All rights reserved.

www.proteomics-journal.com

Page 20

Proteomics

[6] Gry, M., Rimini, R., Stromberg, S., Asplund, A., et al., Correlations between RNA and protein expression profiles in 23 human cell lines. BMC Genomics 2009, 10, 365. [7] Vogel, C., Abreu Rde, S., Ko, D., Le, S. Y., et al., Sequence signatures and mRNA concentration can explain two-thirds of protein abundance variation in a human cell line. Molecular Systems Biology 2010, 6, 400. [8] Schwanhausser, B., Busse, D., Li, N., Dittmar, G., et al., Global quantification of mammalian gene expression control. Nature 2011, 473, 337-342. [9] Wang, X., Zhang, B., Integrating genomic, transcriptomic, and interactome data to improve Peptide and protein identification in shotgun proteomics. J Proteome Res 2014, 13, 2715-2723. [10] Lopez-Casado, G., Covey, P. A., Bedinger, P. A., Mueller, L. A., et al., Enabling proteomic studies with RNA-Seq: The proteome of tomato pollen as a test case. Proteomics 2012, 12, 761-774. [11] Armengaud, J., Microbiology and proteomics, getting the best of both worlds! Environ Microbiol 2013, 15, 12-23. [12] Song, J., Sun, R., Li, D., Tan, F., et al., An improvement of shotgun proteomics analysis by adding next-generation sequencing transcriptome data in orange. PLoS One 2012, 7, e39494. [13] Mohien, C. U., Colquhoun, D. R., Mathias, D. K., Gibbons, J. G., et al., A bioinformatics approach for integrated transcriptomic and proteomic comparative analyses of model and non-sequenced anopheline vectors of human malaria parasites. Mol Cell Proteomics 2013, 12, 120-131. [14] Armengaud, J., Trapp, J., Pible, O., Geffard, O., et al., Non-model organisms, a species endangered by proteogenomics. J Proteomics 2014. [15] Wang, X., Slebos, R. J., Wang, D., Halvey, P. J., et al., Protein identification using customized protein sequence databases derived from RNA-Seq data. J Proteome Res 2012, 11, 1009-1017. [16] Omasits, U., Quebatte, M., Stekhoven, D. J., Fortes, C., et al., Directed shotgun proteomics guided by saturated RNA-seq identifies a complete expressed prokaryotic proteome. Genome Res 2013, 23, 1916-1927. [17] Halvey, P. J., Wang, X., Wang, J., Bhat, A. A., et al., Proteogenomic analysis reveals unanticipated adaptations of colorectal tumor cells to deficiencies in DNA mismatch repair. Cancer Res 2014, 74, 387-397. [18] Ning, K., Nesvizhskii, A. I., The utility of mass spectrometry-based proteomic data for validation of novel alternative splice forms reconstructed from RNA-Seq data: a preliminary assessment. BMC Bioinformatics 2010, 11 Suppl 11, S14. [19] Banfai, B., Jia, H., Khatun, J., Wood, E., et al., Long noncoding RNAs are rarely translated in two human cell lines. Genome Res 2012, 22, 1646-1657. [20] Shanmugam, A. K., Yocum, A. K., Nesvizhskii, A. I., Utility of RNA-seq and GPMDB Protein Observation Frequency for Improving the Sensitivity of Protein Identification by Tandem MS. J Proteome Res 2014. [21] Sheynkman, G. M., Shortreed, M. R., Frey, B. L., Scalf, M., Smith, L. M., Large-scale mass spectrometric detection of variant peptides resulting from nonsynonymous nucleotide differences. J Proteome Res 2014, 13, 228-240. [22] Wang, Q., Wen, B., Wang, T., Xu, Z., et al., Omics evidence: single nucleotide variants transmissions on chromosome 20 in liver cancer cell lines. J Proteome Res 2014, 13, 200211.

This article is protected by copyright. All rights reserved.

www.proteomics-journal.com

Page 21

Proteomics

[23] Zhang, B., Wang, J., Wang, X., Zhu, J., et al., Proteogenomic characterization of human colon and rectal cancer. Nature 2014. [24] Shah, S. P., Roth, A., Goya, R., Oloumi, A., et al., The clonal and mutational evolution spectrum of primary triple-negative breast cancers. Nature 2012, 486, 395-399. [25] Morin, R. D., Mendez-Lago, M., Mungall, A. J., Goya, R., et al., Frequent mutation of histone-modifying genes in non-Hodgkin lymphoma. Nature 2011, 476, 298-303. [26] Chen, R., Mias, G. I., Li-Pook-Than, J., Jiang, L., et al., Personal omics profiling reveals dynamic molecular and medical phenotypes. Cell 2012, 148, 1293-1307. [27] Low, T. Y., van Heesch, S., van den Toorn, H., Giansanti, P., et al., Quantitative and qualitative proteome characteristics extracted from in-depth integrated genomics and proteomics analysis. Cell Rep 2013, 5, 1469-1478. [28] Wu, P., Zhang, H., Lin, W., Hao, Y., et al., Discovery of Novel Genes and Gene Isoforms by Integrating Transcriptomic and Proteomic Profiling from Mouse Liver. J Proteome Res 2014. [29] Wang, E. T., Sandberg, R., Luo, S., Khrebtukova, I., et al., Alternative isoform regulation in human tissue transcriptomes. Nature 2008, 456, 470-476. [30] Oltean, S., Bates, D. O., Hallmarks of alternative splicing in cancer. Oncogene 2013. [31] Djebali, S., Davis, C. A., Merkel, A., Dobin, A., et al., Landscape of transcription in human cells. Nature 2012, 489, 101-108. [32] Woo, S., Cha, S. W., Merrihew, G., He, Y., et al., Proteogenomic database construction driven from large scale RNA-seq data. J Proteome Res 2014, 13, 21-28. [33] Sheynkman, G. M., Shortreed, M. R., Frey, B. L., Smith, L. M., Discovery and mass spectrometric analysis of novel splice-junction peptides using RNA-Seq. Mol Cell Proteomics 2013, 12, 2341-2353. [34] Sun, H., Xing, X., Li, J., Zhou, F., et al., Identification of gene fusions from human lung cancer mass spectrometry data. BMC Genomics 2013, 14 Suppl 8, S5. [35] Evans, V. C., Barker, G., Heesom, K. J., Fan, J., et al., De novo derivation of proteomes from transcriptomes for transcript and protein identification. Nat Methods 2012, 9, 12071211. [36] Wuhr, M., Freeman, R. M., Jr., Presler, M., Horb, M. E., et al., Deep Proteomics of the Xenopus laevis Egg using an mRNA-Derived Reference Database. Curr Biol 2014, 24, 14671475. [37] Nagarajha Selvan, L. D., Kaviyil, J. E., Nirujogi, R. S., Muthusamy, B., et al., Proteogenomic analysis of pathogenic yeast Cryptococcus neoformans using high resolution mass spectrometry. Clin Proteomics 2014, 11, 5. [38] Goetze, S., Qeli, E., Mosimann, C., Staes, A., et al., Identification and functional characterization of N-terminally acetylated proteins in Drosophila melanogaster. PLoS Biol 2009, 7, e1000236. [39] Rison, S. C., Mattow, J., Jungblut, P. R., Stoker, N. G., Experimental determination of translational starts using peptide mass mapping and tandem mass spectrometry within the proteome of Mycobacterium tuberculosis. Microbiology 2007, 153, 521-528. [40] Kim, M. S., Pinto, S. M., Getnet, D., Nirujogi, R. S., et al., A draft map of the human proteome. Nature 2014, 509, 575-581. [41] Wang, X., Zhang, B., customProDB: an R package to generate customized protein databases from RNA-Seq data for proteomics search. Bioinformatics 2013, 29, 3235-3237. [42] Barretina, J., Caponigro, G., Stransky, N., Venkatesan, K., et al., The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 2012, 483, 603-607.

This article is protected by copyright. All rights reserved.

www.proteomics-journal.com

Page 22

Proteomics

[43] Vogel, C., Marcotte, E. M., Insights into the regulation of protein abundance from proteomic and transcriptomic analyses. Nat Rev Genet 2012, 13, 227-232. [44] de Sousa Abreu, R., Penalva, L. O., Marcotte, E. M., Vogel, C., Global signatures of protein and mRNA expression levels. Molecular bioSystems 2009, 5, 1512-1526. [45] Tuller, T., Kupiec, M., Ruppin, E., Determinants of protein abundance and translation efficiency in S. cerevisiae. PLoS computational biology 2007, 3, e248. [46] Wu, G., Nie, L., Zhang, W., Integrative analyses of posttranscriptional regulation in the yeast Saccharomyces cerevisiae using transcriptomic and proteomic data. Current microbiology 2008, 57, 18-22. [47] Nie, L., Wu, G., Zhang, W., Correlation of mRNA expression and protein abundance affected by multiple sequence features related to translational efficiency in Desulfovibrio vulgaris: a quantitative analysis. Genetics 2006, 174, 2229-2243. [48] Brockmann, R., Beyer, A., Heinisch, J. J., Wilhelm, T., Posttranscriptional expression regulation: what determines translation rates? PLoS computational biology 2007, 3, e57. [49] Gunawardana, Y., Niranjan, M., Bridging the gap between transcriptome and proteome measurements identifies post-translationally regulated genes. Bioinformatics 2013, 29, 30603066. [50] Vogel, C., Abreu Rde, S., Ko, D., Le, S. Y., et al., Sequence signatures and mRNA concentration can explain two-thirds of protein abundance variation in a human cell line. Mol Syst Biol 2010, 6, 400. [51] Guimaraes, J. C., Rocha, M., Arkin, A. P., Transcript level and sequence determinants of protein abundance and noise in Escherichia coli. Nucleic acids research 2014. [52] Zur, H., Tuller, T., Strong association between mRNA folding strength and protein abundance in S. cerevisiae. EMBO reports 2012, 13, 272-277. [53] Tuller, T., Waldman, Y. Y., Kupiec, M., Ruppin, E., Translation efficiency is determined by both codon bias and folding energy. Proceedings of the National Academy of Sciences of the United States of America 2010, 107, 3645-3650. [54] Calvo, S. E., Pagliarini, D. J., Mootha, V. K., Upstream open reading frames cause widespread reduction of protein expression and are polymorphic among humans. Proceedings of the National Academy of Sciences of the United States of America 2009, 106, 7507-7512. [55] Ning, K., Fermin, D., Nesvizhskii, A. I., Comparative analysis of different label-free mass spectrometry based protein abundance estimates and their correlation with RNA-Seq gene expression data. J Proteome Res 2012, 11, 2261-2271. [56] Lan, P., Li, W., Schmidt, W., Complementary proteome and transcriptome profiling in phosphate-deficient Arabidopsis roots reveals multiple levels of gene regulation. Mol Cell Proteomics 2012, 11, 1156-1166. [57] Gan, H., Cai, T., Lin, X., Wu, Y., et al., Integrative proteomic and transcriptomic analyses reveal multiple post-transcriptional regulatory mechanisms of mouse spermatogenesis. Mol Cell Proteomics 2013, 12, 1144-1157. [58] Lee, M. V., Topper, S. E., Hubler, S. L., Hose, J., et al., A dynamic model of proteome changes reveals new roles for transcript alteration in yeast. Mol Syst Biol 2011, 7, 514. [59] Khositseth, S., Pisitkun, T., Slentz, D. H., Wang, G., et al., Quantitative protein and mRNA profiling shows selective post-transcriptional control of protein expression by vasopressin in kidney cells. Mol Cell Proteomics 2011, 10, M110 004036. [60] Juschke, C., Dohnal, I., Pichler, P., Harzer, H., et al., Transcriptome and proteome quantification of a tumor model provides novel insights into post-transcriptional gene regulation. Genome biology 2013, 14, r133.

This article is protected by copyright. All rights reserved.

www.proteomics-journal.com

Page 23

Proteomics

[61] Fournier, M. L., Paulson, A., Pavelka, N., Mosley, A. L., et al., Delayed correlation of mRNA and protein expression in rapamycin-treated cells and a role for Ggc1 in cellular sensitivity to rapamycin. Mol Cell Proteomics 2010, 9, 271-284. [62] Vogel, C., Silva, G. M., Marcotte, E. M., Protein expression regulation under oxidative stress. Mol Cell Proteomics 2011, 10, M111 009217. [63] Lackner, D. H., Schmidt, M. W., Wu, S., Wolf, D. A., Bahler, J., Regulation of transcriptome, translation, and proteome in response to environmental stress in fission yeast. Genome biology 2012, 13, R25. [64] Waldbauer, J. R., Rodrigue, S., Coleman, M. L., Chisholm, S. W., Transcriptome and proteome dynamics of a light-dark synchronized bacterial cell cycle. PLoS One 2012, 7, e43432. [65] Berghoff, B. A., Konzer, A., Mank, N. N., Looso, M., et al., Integrative "omics"approach discovers dynamic and regulatory features of bacterial stress responses. PLoS genetics 2013, 9, e1003576. [66] Robles, M. S., Cox, J., Mann, M., In-vivo quantitative proteomics reveals a key contribution of post-transcriptional mechanisms to the circadian regulation of liver metabolism. PLoS genetics 2014, 10, e1004047. [67] Pan, Z., Zeng, Y., An, J., Ye, J., et al., An integrative analysis of transcriptome and proteome provides new insights into carotenoid biosynthesis and regulation in sweet orange fruits. J Proteomics 2012, 75, 2670-2684. [68] Irmler, M., Hartl, D., Schmidt, T., Schuchhardt, J., et al., An approach to handling and interpretation of ambiguous data in transcriptome and proteome comparisons. Proteomics 2008, 8, 1165-1169. [69] O'Brien, R. N., Shen, Z., Tachikawa, K., Lee, P. A., Briggs, S. P., Quantitative proteome analysis of pluripotent cells by iTRAQ mass tagging reveals post-transcriptional regulation of proteins required for ES cell self-renewal. Mol Cell Proteomics 2010, 9, 2238-2251. [70] Gouw, J. W., Pinkse, M. W., Vos, H. R., Moshkin, Y., et al., In vivo stable isotope labeling of fruit flies reveals post-transcriptional regulation in the maternal-to-zygotic transition. Mol Cell Proteomics 2009, 8, 1566-1578. [71] Wu, L., Candille, S. I., Choi, Y., Xie, D., et al., Variation and genetic control of protein abundance in humans. Nature 2013, 499, 79-82. [72] Fu, J., Keurentjes, J. J., Bouwmeester, H., America, T., et al., System-wide molecular evidence for phenotypic buffering in Arabidopsis. Nature genetics 2009, 41, 166-167. [73] Foss, E. J., Radulovic, D., Shaffer, S. A., Goodlett, D. R., et al., Genetic variation shapes protein networks mainly through non-transcriptional mechanisms. PLoS Biol 2011, 9, e1001144. [74] Foss, E. J., Radulovic, D., Shaffer, S. A., Ruderfer, D. M., et al., Genetic basis of proteome variation in yeast. Nature genetics 2007, 39, 1369-1375. [75] Ghazalpour, A., Bennett, B., Petyuk, V. A., Orozco, L., et al., Comparative analysis of proteome and transcriptome variation in mouse. PLoS genetics 2011, 7, e1001393. [76] Lundberg, E., Fagerberg, L., Klevebring, D., Matic, I., et al., Defining the transcriptome and proteome in three functionally different human cell lines. Mol Syst Biol 2010, 6, 450. [77] Kristensen, A. R., Gsponer, J., Foster, L. J., Protein synthesis rate is the predominant regulator of protein expression during differentiation. Mol Syst Biol 2013, 9, 689. [78] Zhang, Z., Dietrich, F. S., Identification and characterization of upstream open reading frames (uORF) in the 5' untranslated regions (UTR) of genes in Saccharomyces cerevisiae. Curr Genet 2005, 48, 77-87.

This article is protected by copyright. All rights reserved.

www.proteomics-journal.com

Page 24

Proteomics

[79] Waern, K., Snyder, M., Extensive transcript diversity and novel upstream open reading frame regulation in yeast. G3 (Bethesda) 2013, 3, 343-352. [80] Hsu, M. K., Chen, F. C., Selective constraint on the upstream open reading frames that overlap with coding sequences in animals. PloS one 2012, 7, e48413. [81] Liu, Q., Halvey, P. J., Shyr, Y., Slebos, R. J., et al., Integrative omics analysis reveals the importance and scope of translational repression in microRNA-mediated regulation. Mol Cell Proteomics 2013, 12, 1900-1911. [82] Nassa, G., Tarallo, R., Giurato, G., De Filippo, M. R., et al., Post-transcriptional regulation of human breast cancer cell proteome by unliganded Estrogen Receptor beta via microRNAs. Mol Cell Proteomics 2014. [83] Schellenberg, J. J., Verbeke, T. J., McQueen, P., Krokhin, O. V., et al., Enhanced whole genome sequence and annotation of Clostridium stercorarium DSM8532T using RNA-seq transcriptomics and high-throughput proteomics. BMC Genomics 2014, 15, 567. [84] Ingolia, N. T., Ribosome profiling: new views of translation, from single codons to genome scale. Nat Rev Genet 2014, 15, 205-213. [85] Menschaert, G., Van Criekinge, W., Notelaers, T., Koch, A., et al., Deep proteome coverage based on ribosome profiling aids MS-based protein and peptide discovery and provides evidence of alternative translation products and near-cognate translation initiation events. Mol Cell Proteomics 2013. [86] Plotkin, J. B., Transcriptional regulation is only half the story. Mol Syst Biol 2010, 6, 406. [87] Konig, J., Zarnack, K., Luscombe, N. M., Ule, J., Protein-RNA interactions: new genomic technologies and perspectives. Nat Rev Genet 2011, 13, 77-83. [88] Milek, M., Wyler, E., Landthaler, M., Transcriptome-wide analysis of protein-RNA interactions using high-throughput sequencing. Seminars in cell & developmental biology 2012, 23, 206-212. [89] Darnell, R. B., HITS-CLIP: panoramic views of protein-RNA regulation in living cells. Wiley interdisciplinary reviews. RNA 2010, 1, 266-286. [90] Riley, K. J., Steitz, J. A., The "Observer Effect" in genome-wide surveys of proteinRNA interactions. Molecular cell 2013, 49, 601-604.

This article is protected by copyright. All rights reserved.

www.proteomics-journal.com

Page 25

Proteomics

Figure 1. Comparison between traditional proteogenomics and the RNA-Seq-based proteogenomics. Traditional proteogenomics relies on protein databases derived from the six-frame translation of the whole genome and/or databases generated from gene structure prediction and exon-exon junction models. Tandem mass spectrometry (MS/MS) data are used to validate these predicted proteins. The large number of putative proteins in these databases leads to low sensitivity and high false positive rate in peptide identification. In RNA-Seq-based proteogenomics, RNA-Seq data are used to identify sample-specific transcripts, sequence variations, and alternative splicing isoforms, which are subsequently validated by proteomics data. Two different assembly methods could be used for generating a protein database from raw sequence reads, one is genome-guided and the other is denovo. The genome-guided approach is relatively more straightforward. The de-novo approach does not rely on an existing reference genome, but de-novo assembly of the short reads is challenging.

Traditional Proteogenomics

RNA-Seq-based Proteogenomics

Genome

Transcriptome

RNA-Seq reads

Six-frame translation +1 +2 +3

-1 -2

Genome guided

De novo assembly

-3

variations

Gene model prediction

Novel splicing Exon combination Novel exon region

Protein DB

MS/MS data

Goals:

Three frame translation from assembled transcripts

Protein DB

Protein DB

MS/MS data

MS/MS data

 Enable complete protein/gene identification  Improve genome annotation

This article is protected by copyright. All rights reserved.

www.proteomics-journal.com

Page 26

Proteomics

Figure 2. Four types of RNA-Seq derived information, including mRNA abundance, single nucleotide variations (SNVs), novel junctions, and novel transcripts. a. Transcript abundance for all transcripts (green) or the subset of all coding transcripts (grey) spreads over a wide range of FPKM (Fragments Per Kilobase Of Exon Per Million Fragments Mapped) values; however, transcript abundance distribution for the subset of transcripts with detectable proteins in the proteomics study is clearly narrower and shifted toward higher FPKM values (red). b. Customized database derived from RNA-Seq data allowed proteomic identification of the KRAS(Gly12Val) alteration in the colon cancer cell line SW480. The upper and lower panels visualize the RNA-seq and proteomics coverage for the KRAS gene, separately. Gene structure of KRAS is shown in the middle panel with black and grey boxes representing coding and UTR regions, respectively.. The y-axes in the RNA-Seq and proteomics plots represent the numbers of sequence reads and spectral counts, respectively. The green and red colors represent data for cell lines SW480 and RKO, respectively. The short blue bar in the proteomics plot indicates the KRAS(Gly12Val) alteration identified in SW480. c. Novel junctions have lower read coverage than the known ones. d. Novel transcripts discovered from RNA-Seq data are much less abundant compared to known transcripts. Figures a, c, and d were generated from a data set with 10 colorectal cancer cell lines [17], and each curve represents data from one cell line. Figure b was generated from a data set with two colorectal cancer cell lines, RKO and SW480 [15].

a

mRNA abundance

b

Variations

c

Novel Junctions

d

Novel transcripts

This article is protected by copyright. All rights reserved.

www.proteomics-journal.com

Page 27

Proteomics

Figure 3. Summary of the complementary nature of RNA-Seq and shotgun proteomics data and various applications. CDSs: coding DNA sequences; SNVs: single nucleotide variants; INDELs: small insertions and deletions; SAAVs: single amino acid variations.

Annotation of functional DNA elements DNA Improving genome annotation

RNA

RNA-Seq

• Novel splicing • Novel exons

• SNVs • RNA edits • INDELs

Identification

Protein

Proteomics

• mRNA abundance

Quantification

• Novel isoforms • Novel coding regions (novel exons, noncoding RNAs, uORFs, pseudogenes, etc.)

• SAAVs • Aberrant peptide

• Novel identifications

• Variant peptides • Allele specific expression

• Protein abundance

• Expression regulation

• Understanding condition-specific isoform usage • Understanding proteomic consequences of sequence variations • Revealing post-transcriptional regulatory mechanisms

This article is protected by copyright. All rights reserved.

www.proteomics-journal.com

Page 28

Proteomics

Table 1. The effects of biological features on mRNA-protein correlation

Organism

N

Method

Major features

Conclusions

S. cerevisiae -

linear

tAI, evolutionary rate

S. cerevisiae 3175

linear

PHD, codon usage, amino acid composition, ribosomal occupancy, ribosomal density, MFE50, mRNA stability

Desulfovibri 349395 o vulgaris

linear

Codon usage, amino acid composition, stop codon context and the Shine-Dalgarno sequence

S. cerevisiae 4152

Nonlinear

ribosomal occupancy, ribosomal density, CAI, saturation effects

S. cerevisiae 1895

linear

ribosomal occupancy, ribosome density, tAI, codon bias

Nonlinear

Coding sequence (coding length, amino acid usage, codon bias etc.) and characteristic of 3’ UTR and 5’ UTR CAI, codon bias, amino acid composition, 16S:SD

Correlation increase from 0.69 to 0.76 Contribute to 33.15% of the total variation of mRNA-protein correlation. Contribute to 15.2–26.2% of the total variation of mRNA–protein correlation Correlation increases from 0.63 to 0.7 Explain 86% of protein expression variation Explain 67% of protein expression variation

H. Sapiens

512

E. coli

>800 linear

Explain 66% of protein expression variation

Ref . [45]

[46]

[47]

[48]

[49]

[50]

[51]

*N: sample size; PHD: protein half-life descriptor; MFE50: minimum free energy for the 5’UTR of length 50 nt; CAI: codon adaptation index; tAI: tRNA adaption index

This article is protected by copyright. All rights reserved.

Leveraging the complementary nature of RNA-Seq and shotgun proteomics data.

RNA sequencing (RNA-Seq) and MS-based shotgun proteomics are powerful high-throughput technologies for identifying and quantifying RNA transcripts and...
909KB Sizes 2 Downloads 6 Views