HHS Public Access Author manuscript Author Manuscript

IEEE EMBS Int Conf Biomed Health Inform. Author manuscript; available in PMC 2017 January 26. Published in final edited form as:

IEEE EMBS Int Conf Biomed Health Inform. 2016 February ; 2016: 78–81. doi:10.1109/BHI. 2016.7455839.

The Selection of Quantification Pipelines for Illumina RNA-seq Data Using a Subsampling Approach Po-Yen Wu [Student Member, IEEE] and May D. Wang [Senior Member, IEEE]

Abstract Author Manuscript

RNA sequencing, or (RNA-seq for short,, is a widely applied technology that for extractings gene and transcript expression from biological samples. Given numerous quantification pipelines for RNA-seq data, one fundamental challenge is to determine identify a pipeline that can produce the most accurate estimate the most accurate gene and/or transcript expression. Exploring all available pipelines requires tremendous extensive computational resources, so. Therefore, we propose to use a subsampling approach that can improve speed up the pipeline evaluation and selection the efficiency process of pipeline performance evaluation for a given RNA-seq dataset. We applied our approach to one simulated and two real RNA-seq datasets and found that expression estimates derived from subsampled data are close surrogates for those derived from original data. In addition, the ranking of quantification pipelines based on the subsampled data was highly correlated concordant with that based on the original data. Therefore, we conclude that subsampling is a valid approach to facilitating efficient quantification pipeline selection using RNA-seq data.

Author Manuscript

I. INTRODUCTION

Author Manuscript

Next-generation sequencing (NGS) is an established technology that determines DNA or RNA sequences in biological samples. It has become an indispensable medium for most genomic, transcriptomic, and epigenomic studies because of its attractive merits such as base-pair resolution, high throughput, and novel gene and transcript discovery [1]. Raw data generated by NGS are short sequence reads, or reads for short, that contain readings of DNA or cDNA fragments. RNA sequencing, or RNA-seq for short, RNA-seq is aa major branch application of NGS that studies transcriptomes. It is capable of capturing comprehensive transcriptomic information such as gene expression, transcript expression, alternative splicing, and gene fusion [1]. To capture the highly dynamic transcriptome, one popular application of RNA-seq approach is to study gene and transcript expression among various biological conditions or samples collected at various time points [1]. The A typical bioinformatics pipeline for RNA-seq expression analysis includes sequence mapping, followed by expression quantification. Some quantification algorithms require transcriptomic mapping information while others require genomic mapping information. Thus, in our study, the term “quantification pipeline” represents a pipeline from the sequence mapping step to the expression quantification step.

*

Corresponding Author, Contact information for the corresponding author: [email protected], Phone: 404-385-2954, Fax: 404-894-4243, Address: Suite 4106, UA Whitaker Building, 313 Ferst Drive, Atlanta, GA 30332, USA.

Wu and Wang

Page 2

Author Manuscript

Accurate expression estimates are the key to many downstream applications such as differentially expressed gene detection. From the bioinformatics perspective, one fundamental question for RNA-seq users is: that which quantification pipeline(s) should they should use to estimate gene and transcript expression that is closer to the truth?. To address this question, given dozens of publicly many available quantification pipelines, systematic evaluation is necessary, and pipeline recommendation has to be explicit. Multiple researchers have conducted comparative studies for several widely used quantification pipelines [2, 3]. However, NGS is a fast-evolving technology, and characteristics of sequencing samples data characteristics are protocol- and library-dependent [4, 5]. Therefore, the selection of quantification pipeline(s) should not be solely based on existing knowledge or evidence, but rather customized for each RNA-seq dataset.

Author Manuscript

To achieve this objective, a comprehensive investigation is necessary for each new RNA-seq dataset, but such effort may be impossible simply due to a large data volume and corresponding computational demand. Thus, we propose a subsampling approach to speed up the search process for a quantification pipeline(s) that can produce the most accurate expression estimates for a given RNA-seq dataset. The speedup mainly results from a reduction in the sequencing depth (i.e., the total number of reads available for downstream analysis). Several studies have investigated the impact of the sequencing depth on RNA-seq applications and have concluded that a higher sequencing depth will lead to more accurate results [6, 7]. Therefore, our approach may result in lead to less accurate expression estimates that may in turn influence the selection of a quantification pipeline(s). for an RNAseq dataset. However, because expression estimates for highly expressioned genes are less sensitive to most biases noise, we hypothesize that by focusing on evaluating these highly expressioned genes, our approach should still be valid.

Author Manuscript

In Section II, we will introduce the overall workflow of this study and the details of each component in the workflow. In Section III, we will demonstrate results and discuss our findings. Finally, in Section IV, we will conclude this study and summarize future work.

II. Methods A. Experimental Design

Author Manuscript

The workflow of the entire study is shown in Figure 1. (A). To validate and demonstrate the feasibility of our subsampling approach, W we incorporate three RNA-seq datasets, including one simulated and two real, in our study. The “Quantification Pipelines” module quantifies both subsampled [Figure 1. (A), path (1)] and original [Figure 1. (A), path (2)] RNA-seq data. Input information for the “Evaluation” module includes estimated expression generated by generated by the “Qquantification pipelines Pipelines" module and true expression derived from the simulated dataset. Both estimated expression and true expression contain the expression profiles for both subsampled data and original data. The detailed workflow of the “Quantification Pipelines” module is shown in Figure 1. (B). We use Top Hat [8] and Bowtie [9] to map RNA-seq data to the reference genome and transcriptome, respectively. Since Top Hat builds on an algorithm similar to Bowtie, we effectively minimize the impact of sequence mapping-related induced variations on expression estimates. It is necessary to include both the genomic mapping tool (i.e., Top IEEE EMBS Int Conf Biomed Health Inform. Author manuscript; available in PMC 2017 January 26.

Wu and Wang

Page 3

Author Manuscript

Hat) and the transcriptomic mapping tool (i.e., Bowtie) because some quantification tools (e.g., Cufflinks [10], HTSeq [11], and MISO [12]) require genomic mapping information, while others (e.g., RSEM [2], eXpress [13], and MMSEQ [14]) require transcriptomic mapping information. B. Datasets We downloaded the hg19 reference genome and the RefSeq genome annotation from the UCSC Genome Browser and the UCSC Table Browser, respectively. We generated the simulated dataset using the rlsim-simNGS pipeline [15], which mimics many documented biases presented in standard Illumina RNA-seq protocols. The simulated dataset contains three technical replicates, each of which has 50 million read pairs with the read length of 101 base pairs (bp).

Author Manuscript

The two real datasets are the SEQC (i.e., Sequencing Quality Control) dataset [16] and the NCBI Sequence Reads Archive SRP043090 dataset [17]. Both were generated by the Illumina HiSeq 2000 sequencing system. The SEQC dataset sequenced two well-established reference RNA samples, UHRR (Universal Human Reference RNA) and HBRR (Human Brain Reference RNA), each of which contains four technical replicates with around 104 million read pairs each and the read length of 100 bp. The SRP043090 dataset sequenced samples in two biological conditions, each of which contains three biological replicates with around 60 million read pairs each and the read length of 100 bp. C. Subsampling Approach

Author Manuscript

For Illumina sequencing, because randomness is inherent in sequencing library preparation (e.g., random priming) and the sequencing process (e.g., random binding to a flow cell), RNA-seq data reads stored in an unprocessed FASTQ file should be are randomly listed in terms of their originating transcripts. Thus, uniformly subsampling the FASTQ file should still retain such the randomness in subsampled data. Many studies have investigated the impact of the sequencing depth on various RNA-seq applications. However, no consensus has been reached regarding an adequate sequencing depth for each type of RNA-seq applications. Therefore Thus, in this preliminary study, we subsample RNA-seq data by a factor of five (i.e., sample one fifth of reads from the original data) that will retain at least ten million read pairs in each subsampled data for all our RNA-seq datasets. D. Quantification Pipelines

Author Manuscript

We apply Cufflinks, HTSeq, and MISO to quantify gene and transcript expression using genomic mapping information produced by Top Hat. The Cuff diff 2 program in t The Cufflinks package generates gene and transcript expression estimates in terms of both read counts and FPKM-normalized (i.e., fragments per kilobase of exon per million fragments mapped) expression. The htseq-count program in the HTSeq package generates read counts for only genes, while MISO provides read counts for both genes and transcripts. In contrast, we apply RSEM, eXpress, and MMSEQ to quantify gene and transcript expression using transcriptomic mapping information produced by Bowtie. These three quantification tools generate gene and transcript expression estimates in terms of both read counts and FPKMnormalized expression. RSEM and eXpress also provides TPM-normalized (i.e., transcripts

IEEE EMBS Int Conf Biomed Health Inform. Author manuscript; available in PMC 2017 January 26.

Wu and Wang

Page 4

Author Manuscript

per million) expression estimates estimates. For read counts estimated by all the six tools, we apply the same RLE (relative log expression) normalization method [18], one of the most robust normalization methods based on [19], to minimize potential variations induced by the normalization process. All programs were executed with default parameters. E. Evaluation Metrics

Author Manuscript

We design two evaluation metrics to assess the validity of our approach. The first metric is based on gene or transcript detection, and the second metric evaluates the deviation between expression estimates. To investigate the impact of expression levels on the two metrics, we stratify all genes into four equally sized bins: high expression (75 – 100 percentiles), medium-high expression (50 – 75 percentiles), medium-low expression (25 – 50 percentiles), and low expression (0 – 25 percentiles). For calculation that involves simulated data, we stratify genes based on their true expression. For other cases, we stratify genes based on the sum of RLE-normalized expression across all samples and pipelines. On the basis of our experimental design, we evaluate each metrics under the following two scenarios: “True Vs Est,” which compares true expression with to estimated expression, and “Orig Vs Sub,” which compares expression derived from the original data with to expression derived from the subsampled data.

Author Manuscript

For the first metric, gene or transcript detection, we define that the expression levels greater than zero is are detected. For the True Vs Est scenario, we treat true expression as the ground truth, and then so that we can calculate the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). As an example, the FP in this scenario is defined as a gene undetected in the true but detected in the estimated expression. For the Orig Vs Sub scenario, we treat the expression derived from the original data as the ground truth, and then compute the number of TP, TN, FP, and FN accordingly. As an example, the FP in this scenario is defined as a gene undetected in the original but detected in the subsampled data. We use the total number of TP, TN, FP, and FN to calculate sensitivity, precision, and accuracy for gene detection. The aforementioned evaluation process is directly applied applicable to the transcript case. For the second metric, deviation between expression estimates, since measures of absolute deviation are scale-dependent, to compensate scales among genes, we first normalize gene expression to the same scale, remove extreme deviations, and finally compute deviation metrics. For the True Vs Est scenario, we use Equation (1) to normalize estimated expression relative to true expression:

Author Manuscript

(1)

where E_(normalized,g) is the normalized expression for a gene g, E_(estimated,g) is the estimated expression for the gene g, and E_(true,g) is the true expression for the gene g. Once we compute the normalized expression for all genes, we filter out the upper and lower 0.05% extremes (i.e., retain 99.9% of the data) so that the deviation metrics will not be affected by extreme cases. After normalization, the deviation for each gene is defined as E_(normalized,g) minus one. The deviation metrics we adopt are root-mean-square IEEE EMBS Int Conf Biomed Health Inform. Author manuscript; available in PMC 2017 January 26.

Wu and Wang

Page 5

Author Manuscript

deviation (RMSD) and median absolute deviation (MAD). For the Orig Vs Sub scenario, we use Equation (2) to normalize expression based on subsampled data relative to expression based on original data: (2)

where E_(subsampled,g) is the expression estimate for the a gene g using the subsampled data, E_(original,g) is the expression estimate for the gene g using the original data, and K is a user-defined subsampling factor (K=5 in our study). After normalization, the rest of the evaluation procedure process remains is the same as that in the True Vs Est scenario. The aforementioned evaluation process is directly applicablelied to the transcript case.

Author Manuscript

III. Results and Discussion In our study, we proposed to use the subsampling approach to efficiently select a quantification pipeline(s) for each RNA-seq dataset. The success of this approach depends on two premises: (1) expression estimates derived from the subsampled data need to be close to those derived from the original data, and (2) quantification pipelines selected using the subsampled data need to be similar to those using the original data. In this section, we will investigate our results and demonstrate evidence in support of the two premises. A. Gene or Transcript Detection

Author Manuscript

Following the definition method we described in Section II.E, we calculated sensitivity, precision, and accuracy for gene or transcript detection under the two scenarios, True Vs Est and Orig Vs Sub. Focusing on high expression genes/transcripts (i.e., 75 – 100 percentiles), Figure 2 demonstrates the distribution of these measures for the True Vs Est scenario using the simulated dataset, and Figure 3 shows results for the Orig Vs Sub scenario using all the three datasets. As demonstrated in Figure 2, sensitivity, precision, and accuracy calculated using the subsampled data (black boxes)using the original data (blue triangles) were very close to those calculated using the original data (blue triangles) using the subsampled data (black boxes) with less than 0.1% difference for all statistics and pipelines. In addition, the ranking of quantification pipelines (i.e., eXpress > MMSEQ > RSEM > MISO > Cufflinks > HTSeq, where “A > B” indicates A outperformeds B) using the original subsampled data was the same as that using the subsampled original data. These observations supported the two premises we described in the beginning of this section.

Author Manuscript

Figure 3 aims to demonstrate that expression estimates derived from the subsampled data were very close to those derived from the original data (i.e., the postulated ground truth) measured by the gene or transcript detection accuracy. For gene detection, all quantification pipelines achieved nearly 100% accuracy for all the three datasets. For transcript detection, the accuracy still achieved 97% or above for the two real datasets with the “eXpress” pipeline consistently outperformed others. These observations supported the first premise we described in the beginning of this section.

IEEE EMBS Int Conf Biomed Health Inform. Author manuscript; available in PMC 2017 January 26.

Wu and Wang

Page 6

B. Deviation Analysis

Author Manuscript

We calculated RMSD and MAD under the two scenarios. Focusing on high expression genes/transcripts (i.e., 75 – 100 percentiles), Figure 4 shows the distribution of RMSD and MAD for the True Vs Est scenario using the simulated dataset, and Figure 5 demonstrates results for the Orig Vs Sub scenario using all the three datasets.

Author Manuscript

As shown in Figure 4, RMSD and MAD calculated using the subsampled data (black boxes)using the original data (blue triangles) were very close to those calculated using the original data (blue triangles) using the subsampled data (black boxes) with the transcriptlevel analysis tended to overestimate deviations using the subsampled data. In addition, the ranking of quantification pipelines using the original subsampled data was the same as that using the subsampled original data for gene-level evaluation, though the RMSD-based ranking slightly differed from the MAD-based ranking. For transcript-level evaluation, the two rankings were not exactly the same because of prevalent overestimated deviations. However, the two rankings were still highly concordant. These observations supported the two premises we described in the beginning of this section. Finally, Figure 5 aims to demonstrate that expression estimates derived from the subsampled data were very close to those derived from the original data (i.e., the postulated ground truth) measured by the MAD. For most quantification pipelines and all the datasets, the MAD was under 0.03 (or 3%) and 0.06 (or 6%) for gene-level and transcript-level deviation analysis, respectively. This observation supported the first premise we described in the beginning of this section.

IV. Conclusions Author Manuscript Author Manuscript

With the need to identify the most suitable quantification pipeline(s) for each RNA-seq dataset, we proposed the subsampling approach that can help researchers achieve this goal more efficiently. Through our preliminary investigation, we found that statistics based on high expression genes in the subsampled data were close to those based on the original data for the two metrics we implemented. In addition, the rankings of pipeline performance using the subsampled data were was concordant with those that using the original data. Currently, our study has presents some limitations that lead to closely relate to the our future work as follows: (1) the two metrics we implemented required reference sampless, either groundtruth expression from the simulated dataset or expression estimates based on the original data. Thus, we will need to implement reference-free metrics such as the coefficient of variation and between-replicate fold-changes; (2) we executed all programs with default parameters, so we will need to investigate the effect of parameter selection on the rankings of quantification pipelines; and (3) we chose to subsample data by a factor of five so that we can retain an adequate sequencing depth for downstream analysis. This was an arbitrary choice, so Thus, we will evaluate whether different subsampling factors will lead to similar conclusions or not. Through our preliminary investigation, we found that the subsampled data can indeed speed up the entire process while retaining close statistics we adopted when evaluating results. In addition, the rankings of pipeline performance using the subsampled data were concordant

IEEE EMBS Int Conf Biomed Health Inform. Author manuscript; available in PMC 2017 January 26.

Wu and Wang

Page 7

Author Manuscript

with those using the original data. Currently, we included only the two metrics we introduced in Section II. E, but to further prove the validity of this approach, we will implement more evaluation metrics and come up with a scoring system that will ultimately facilitate the selection of quantification pipelines. Lastly, we will evaluate whether different subsampling factors will lead to similar conclusions or not.

References

Author Manuscript Author Manuscript Author Manuscript

1. Trapnell C, et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols. 2012 Mar.7:562–578. [PubMed: 22383036] 2. Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. Bmc Bioinformatics. 2011 Aug 4.12 3. Nicolae M, et al. Estimation of alternative splicing isoform frequencies from RNA-Seq data. Algorithms Mol Biol. 2011; 6:9. [PubMed: 21504602] 4. Teer JK, et al. Systematic comparison of three genomic enrichment methods for massively parallel DNA sequencing. Genome Research. 2010 Oct.20:1420–1431. [PubMed: 20810667] 5. Roberts A, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology. 2011; 12 6. Tarazona S, et al. Differential expression in RNA-seq: A matter of depth. Genome Research. 2011 Dec.21:2213–2223. [PubMed: 21903743] 7. Liu Y, et al. Evaluating the impact of sequencing depth on transcriptome profiling in human adipose. PLoS One. 2013; 8:e66883. [PubMed: 23826166] 8. Trapnell C, et al. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009 May 1.25:1105–1111. [PubMed: 19289445] 9. Langmead B, et al. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology. 2009; 10 10. Trapnell C, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology. 2010 May. 28:511-U174. 11. Anders S, et al. HTSeq-a Python framework to work with high-throughput sequencing data. Bioinformatics. 2014 Sep 25. 12. Katz Y, et al. Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat Methods. 2010 Dec.7:1009–1015. [PubMed: 21057496] 13. Roberts A, Pachter L. Streaming fragment assignment for real-time analysis of sequencing experiments. Nature Methods. 2013 Jan.10:71-U99. [PubMed: 23160280] 14. Turro E, et al. Haplotype and isoform specific expression estimation using multi-mapping RNAseq reads. Genome Biology. 2011; 12 15. Sipos B, et al. Realistic simulations reveal extensive sample-specificity of RNA-seq biases. 2013 arXiv:1308.3172. 16. Su ZQ, et al. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nature Biotechnology. 2014 Sep.32:903– 914. 17. Vanharanta S, et al. Loss of the multifunctional RNA-binding protein RBM47 as a source of selectable metastatic traits in breast cancer. Elife. 2014 Jun 4.3 18. Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010; 11:R106. [PubMed: 20979621]

IEEE EMBS Int Conf Biomed Health Inform. Author manuscript; available in PMC 2017 January 26.

Wu and Wang

Page 8

Author Manuscript Author Manuscript

Figure 1.

(A) The workflow of this study. (B) The detailed workflow of the “Quantification Pipelines” module in Panel (A).

Author Manuscript Author Manuscript IEEE EMBS Int Conf Biomed Health Inform. Author manuscript; available in PMC 2017 January 26.

Wu and Wang

Page 9

Author Manuscript Author Manuscript Author Manuscript

Figure 2.

Sensitivity, precision, and accuracy of gene or transcript detection using the simulated dataset under the True Vs Est scenario. Blue triangles and black boxes represent metrics statistics calculated based onusing the original and subsampled data, respectively.

Author Manuscript IEEE EMBS Int Conf Biomed Health Inform. Author manuscript; available in PMC 2017 January 26.

Wu and Wang

Page 10

Author Manuscript Author Manuscript Author Manuscript

Figure 3.

Accuracy of gene or transcript detection using both the simulated and real datasets under the Orig Vs Sub scenario.

Author Manuscript IEEE EMBS Int Conf Biomed Health Inform. Author manuscript; available in PMC 2017 January 26.

Wu and Wang

Page 11

Author Manuscript Author Manuscript Author Manuscript

Figure 4.

Gene- and transcript-level deviation measured by RMSD and MAD using the simulated dataset under the True Vs Est scenario. Blue triangles and black boxes represent metrics calculated metrics statistics calculated based on using the original and subsampled data, respectively. For x-axis tick labels, Count_[Quantifier] and RLE_[Quantifier] represent indicate metrics statistics calculated based on using read counts estimated by each quantifier and RLE-normalized expression using the based on read counts generated estimated by each quantifier, respectively.

Author Manuscript IEEE EMBS Int Conf Biomed Health Inform. Author manuscript; available in PMC 2017 January 26.

Wu and Wang

Page 12

Author Manuscript Author Manuscript Author Manuscript Author Manuscript

Figure 5.

Gene- and transcript-level deviation measured by MAD using both the simulated and real datasets under the Orig Vs Sub scenario. For x-axis tick labels, [FPKM/TPM]_[Quantifier] indicates FPKM-/TPM-normalized expression generated estimated by each quantifier.

IEEE EMBS Int Conf Biomed Health Inform. Author manuscript; available in PMC 2017 January 26.

The Selection of Quantification Pipelines for Illumina RNA-seq Data Using a Subsampling Approach.

RNA sequencing, or (RNA-seq for short,, is a widely applied technology that for extractings gene and transcript expression from biological samples. Gi...
899KB Sizes 0 Downloads 7 Views