Stat. Appl. Genet. Mol. Biol. 2015; 14(3): 227–241

Patrick McNutt*, Ian Gut, Kyle Hubbard and Phil Beske

A novel method to prioritize RNAseq data for post-hoc analysis based on absolute changes in transcript abundance Abstract: The use of fold-change (FC) to prioritize differentially expressed genes (DEGs) for post-hoc characterization is a common technique in the analysis of RNA sequencing datasets. However, the use of FC can overlook certain population of DEGs, such as high copy number transcripts which undergo metabolically expensive changes in expression yet fail to exceed the ratiometric FC cut-off, thereby missing potential important biological information. Here we evaluate an alternative approach to prioritizing RNAseq data based on absolute changes in normalized transcript counts (ΔT) between control and treatment conditions. In five pairwise comparisons with a wide range of effect sizes, rank-ordering of DEGs based on the magnitude of ΔT produced a power curve-like distribution, in which 4.7–5.0% of transcripts were responsible for 36–50% of the cumulative change. Thus, differential gene expression is characterized by the high production-cost expression of a small number of genes (large ΔT genes), while the differential expression of the majority of genes involves a much smaller metabolic investment by the cell. To determine whether the large ΔT datasets are representative of coordinated changes in the transcriptional program, we evaluated large ΔT genes for enrichment of gene ontologies (GOs) and predicted protein interactions. In comparison to randomly selected DEGs, the large ΔT transcripts were significantly enriched for both GOs and predicted protein interactions. Furthermore, enrichments were were consistent with the biological context of each comparison yet distinct from those produced using equal-sized populations of large FC genes, indicating that the large ΔT genes represent an orthagonal transcriptional response. Finally, the composition of the large ΔT gene sets were unique to each pairwise comparison, indicating that they represent coherent and context-specific responses to biological conditions rather than the non-specific upregulation of a family of genes. These findings suggest that the large ΔT genes are not a product of random or stochastic phenomenon, but rather represent biologically meaningful changes in the transcriptional program. They furthermore imply that high abundance transcripts are associated with particularly cellular states, and as cells change in response to internal or external conditions, the relative distribution of the abundant transcripts changes accordingly. Thus, prioritization of DEGs based on the concept of metabolic cost is a simple yet powerful method to identify biologically important transcriptional changes and provide novel insights into cellular behaviors. Keywords: bioinformatics; botulinum neurotoxin; differential gene expression; excitotoxicity; fold-change; functional annotation; gene ontologies; RNA sequencing; neurogenesis; neurotoxicity. DOI 10.1515/sagmb-2014-0018

Introduction Although RNAseq analysis is exceedingly powerful, the discrete nature, sensitivity and reproducibility of the data have implications for sample analysis that have not yet been thoroughly explored. Following *Corresponding author: Patrick McNutt, US Army Medical Research Institute of Chemical Defense, 3100 Ricketts Point Road, Gunpowder, MD 21010, USA, e-mail: [email protected] Ian Gut: National Biodefense Analysis and Countermeasures Center, 110 Thomas Johnson Drive, Frederick, MD 21702, USA Kyle Hubbard and Phil Beske: US Army Medical Research Institute of Chemical Defense, 3100 Ricketts Point Road, Gunpowder, MD 21010, USA

Brought to you by | University of Manitoba Authenticated Download Date | 6/10/15 2:42 PM

228      P. McNutt et al.: A novel method to prioritize differential gene expression for post-hoc analysis e­ numeration of reads, various methods are available to normalize counts, infer probability distributions and identify differentially expressed genes (DEGs) (Robinson and Smyth, 2007; Marioni et al., 2008; Sultan et al., 2008). Typical post-hoc analytical pipelines then prioritize DEGs based on fold-change (FC) and analyze gene subsets for functional enrichment or novel expression patterns (Huang da et al., 2009a,b). The use of FC has proven an effective method for prioritizing DEGs while retaining functional importance; however, there are likely to be other methods of filtering RNAseq data that may reveal additional or novel information about cellular behavior (Marioni et al., 2008; Huang da et al., 2009a,b; Bullard et al., 2010). Here, we evaluate whether prioritizing RNAseq data based on the magnitude of differences in normalized gene counts (ΔT) provides biologically relevant information about cell behaviors under various conditions. RNAseq is a discrete analytical method, such that the relative contribution of measurement error to total error is inversely proportional to read count (Nagalakshmi et al., 2008; Krewski et al., 2010). Consequently, significant changes in expression of abundant transcripts are likely to be representative of biological differences, even if the associated FC is small. As an example, consider the statistically significant regulation of gene A from 2 to 22 copies (ΔT = 20, Log2FC = 3.5), gene B from 1000 to 1300 copies (ΔT = 300, Log2FC = 0.38) and gene C from 40 to 10 copies (ΔT = –30, Log2FC = –2). Most standard analytical pipelines would prioritize genes A and C for further analysis based on the large FC, despite the significantly larger metabolic investment required to upregulate expression of gene B. Intuitively one might assume that changes in production costs in response to altered conditions should reflect purposeful changes in cellular priorities. However, transcriptional regulation has been shown to have a significant stochastic component, so it is unclear whether large changes in transcript abundance should be attributed to network “noise” or be considered to represent context-dependent, programmed changes in gene expression (Salari et al., 2012). The above example illustrates how traditional methods to prioritize DEGs for secondary analysis may overlook metabolically expensive (and therefore possibly important) changes in transcription. Hypothesizing that a population of high-cost, low FC transcriptional changes might represent a previously unappreciated yet biological important set of transcriptional responses, we first evaluated the distributions of ΔT values calculated from normalized counts in five pairwise comparisons. These comparisons encompassed a wide spectrum of biological responses characterized by diverse effect sizes. They include RNAseq data captured at three time points during neurogenesis; the neuronal responses to treatment with two different neurospecific toxins; and a publically available RNAseq dataset comparing the fetal livers of wild-type and Klf1 knockout mice. In all pairwise comparisons, rank-ordering of ΔT values produced a Pareto-like distribution, such that approximately 5% of differentially expressed transcripts (termed the large ΔT transcripts) were responsible for 35–50% of the total change in transcript abundances. Functional annotation of the large ΔT transcripts demonstrated significantly enrichment for gene ontologies (GOs) and protein interactions in comparison to randomly selected DEGs. These enrichments were consistent with the biological context, yet orthagonal to those produced from equal-sized populations of large FC genes, indicating that the large FC and large ΔT genes represent functionally distinct aspects of transcriptional responses. Collectively, these data suggest that prioritizing large changes in transcript abundances offers a robust and context-specific reflection of cellular behavior under different conditions. The relatively simple method of refining expression profiling data based on ΔT has the potential to supplement existing methods to prioritize DEGs for functional analysis and produce novel insights into the roles of abundant transcripts during biological responses to varying conditions.

Materials and methods Reagents Botulinum neurotoxin (BoNT) serotype A (/A; 2.5 × 108 LD50/mg) was obtained from Metabiologics (Madison, WI, USA) at 1 mg/mL in Ca2+/Mg2+-free phosphate buffered saline, pH 7.4 (PBS), and stored at –30°C.

Brought to you by | University of Manitoba Authenticated Download Date | 6/10/15 2:42 PM

P. McNutt et al.: A novel method to prioritize differential gene expression for post-hoc analysis      229

­ onosodium glutamate and all electrophysiology reagents were obtained from Sigma-Aldrich (St. Louis, MO, M USA) and prepared as previously described (Gut et al., 2013). Stock solutions were prepared in Ca2+/Mg2+-free phosphate buffered saline, pH 7.4 (PBS), and stored at 4°C.

Tissue culture Mouse embryonic stem cells (ESCs) were obtained from ATCC (Manassas, VA, USA). ESC culture, production of ESC-derived neurons (ESNs) and ESN culture were conducted as previously described (Hubbard et  al., 2012). For longitudinal comparisons, differentiated neurons were harvested at days in vitro (DIV) 0 (neural progenitor cells), DIV 1 (stage I/II neurons) and DIV 14 (stage III/IV neurons) (Hubbard et al., 2012). For BoNT/ A-treated comparisons, DIV 12 ESNs were exposed to 67 pM BoNT/A for 24 h, washed and returned to the incubator. RNA was harvested at DIV 15 from intoxicated and age-matched control neurons. For glutamate-treated cultures, DIV 15 ESNs were exposed to 1 μM glutamate for 5 min, washed and returned to the incubator. RNA was harvested at DIV 16 from treated and age-matched control neurons.

Sequencing and differential expression analysis RNA was isolated from ESNs grown in 6 cm dishes using the RNeasy mini kit (Qiagen, Germantown, MD, USA) per manufacturer’s instructions. Samples were prepared and sequenced by Expression Analysis (Durham, NC, USA) with an Illumina HiSeq 2000 (San Diego, CA, USA) (Durham, NC, USA). Trimmed reads were aligned to the mus musculus mm9 assembly (http://genome.ucsc.edu), and read counts were generated using the Lasergene Genomic Core Suite (DNASTAR Inc., Madison, WI, USA). The FASTQ files generated for each biological replicate are available at the Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra/), under the accession number SRP032442. FASTQ files for the Klf1 knockout experiment (accession number GSE33979) were downloaded from the NCBI Gene Expression Omnibus (GEO) database and processed as above (Tallack et al., 2012). The R Bioconductor programs DESeq and edgeR were used to test for differential expression in each comparison (Anders and Huber, 2010; Robinson et al., 2010). Normalized counts were exported in edgeR using the cpm() command. For both programs, DEGs were retained for further analysis if (a) they exhibited a false discovery rate (FDR)  

A novel method to prioritize RNAseq data for post-hoc analysis based on absolute changes in transcript abundance.

The use of fold-change (FC) to prioritize differentially expressed genes (DEGs) for post-hoc characterization is a common technique in the analysis of...
3MB Sizes 0 Downloads 10 Views