Interactive visualization and analysis of large-scale sequencing datasets using ZENBU.

correspondence

npg

© 2014 Nature America, Inc. All rights reserved.

Interactive visualization and analysis of large-scale sequencing datasets using ZENBU To the Editor: The advance of sequencing technology has spurred an ever-growing body of sequence tag–based data from protocols such as RNA-seq, chromatin immunoprecipitation (ChIP)-seq, DNaseI-hypersensitive site sequencing (DHS)-seq and cap analysis of gene expression (CAGE). Large-scale consortia are applying standardized versions of these protocols across broad collections of samples1,2 to elucidate genomic function. Beyond this, the affordability of these systems and availability of sequencing services has made these technologies accessible to smaller laboratories focusing on individual biological systems. Data generation is only the beginning, however, and a substantial bottleneck for many labs is going from sequence data to biological insight, especially when the volume of data overwhelms standard paradigms for data visualization. Here, we present a web-based system, ZENBU, which addresses this problem by extending the functionality of the genome browser (Supplementary Fig. 1). For users with limited bioinformatics skills, ZENBU provides a suite of predefined views and data-processing scripts optimized for RNA-seq (Fig. 1), CAGE, short-RNA and ChIP-seq experiments (Supplementary Figs. 2–4). These are available simply upon uploading data as BAM (binary version of the sequence alignment/map)3 files. The system can also generate an optimized set of views for each of the above data types. ZENBU provides a rich selection of data manipulation capabilities, including quality filtering, signal thresholding, signal normalization, peak finding, annotation, collation of signal under peaks or transcript models and expression difference visualization across multiple experiments. We designed ZENBU with large-scale transcriptome projects in mind. In particular, the FANTOM5 project (unpublished data) required a system that would allow rapid incremental data loading, visualization,

interpretation and downloading of thousands of deep CAGE, RNA-seq and small-RNA data sets as they were produced. We reviewed available systems4, including genome browsers, such as University of Tokyo Genome Browser (UTGB)5, Gbrowse6, University of Santa Cruz (UCSC) genome browser7, Ensembl8 and Integrative Genome Viewer (IGV)9, and data management tools, such as Biomart10, and found none with the full set of functions that we sought (see Supplementary Table 1 for comparison). Therefore, we developed ZENBU, a stable, fast, efficient and secure system that is flexible enough to allow customization of data filters, views and analyses. A key feature of ZENBU is that it allows the user to combine multiple experiments on demand (up to thousands of experiments) within any single track and interpret the data through linked views (Fig. 1). Upon combining multiple experiments, the genome browser view shows the combined genomic distribution of tags from these experiments within a given genomic interval (Fig. 1d), conversely the linked expression view shows the relative abundance of tags observed in this region across these experiments (shown as a histogram, Fig. 1j). The genome browser and expression view are ‘linked’, meaning that as the user interacts with one view the other is updated in real-time. This facilitates interactive exploration of the data because selecting features or regions within a browser track results in data for only that region being displayed in the expression view. And symmetrically, hiding specific experiments in the expression view updates the data displayed in the browser track. Data processing and complex views are achieved by a flexible on-demand scripting system based on data transformation and analysis modules. A selection of predefined combinations of simple operations are provided to perform generic tasks such as data normalization (Fig. 1b–d,h,i), data filtering (Supplementary Fig. 2), data

nature biotechnology volume 32 NUMBER 3 MARCH 2014

clustering (Fig. 1c and Supplementary Figs. 3 and 4), and collation (Fig. 1h and Supplementary Fig. 3). With an understanding of these atomic operations, more advanced users can combine them into complex processing scripts. These customized scripts can also be saved and shared allowing efficient re-use of optimized analyses and views. To demonstrate some of ZENBU’s functionality, we show multiple views on the same underlying data, ENCODE RNA-seq2 experiments loaded in BAM format (Fig. 1). One of the more powerful views in this figure is the dynamic projection of RNA-seq reads onto GENCODE11 transcript models to calculate RPKM (reads per kilobase transcript per million reads) values (Fig. 1h). Transcripts are then colored by support and the RPKM expression table for these models can be downloaded in a variety of formats. A variation of the same script projects CAGE expression signal into the proximal promoter regions (±500 bp of 5ʹ end) of the same models (Supplementary Fig. 2). Furthermore, an implementation of the parametric clustering approach paraclu12 allows ZENBU to identify peaks of CAGE signal (Fig. 1c) and extract their corresponding expression values per experiment. Paraclu imposes minimal prior assumptions and is flexible enough for use with small RNA and ChIP-seq data by varying the clustering parameters (Supplementary Fig. 3 and 4). ZENBU also allows investigators to fine tune these settings and rapidly inspect the results on specific loci before genome-wide computation. A more detailed case study recapitulating an integrated ChIP-seq and RNA-seq analysis13 is included in Supplementary Note 1 to demonstrate the power of the complex scripting possible in ZENBU. To enable fast and reliable downloading of data genome-wide and to speed up visualization and processing, we implemented a track caching system, which 217

correspondence HG19, chr19: 36,377,615–36,399,522 (21.9 kb)

a


36,390,000

NFKBID

HCST

ENCODE Carninci Riken CAGE (Gm12878, HeLa, Nhek) - CTSS histogram

d

ENCODE Gingeras CSHL LongRnaSeq (Gm12878, HeLa, Nhek) read-abundance histogram

e

Split-mapped read–based intron support

f

Split-mapped read–based splice-donor usage frequency

g

Split-mapped read–based splice-acceptor usage frequency

h

Gencode v10 collated RPKM normalized expression

j

36,395,000

Gencode v10

b c

i

npg

Entrez Gene

36,385,000

36,380,000

TYROBP

Paraclu-clustered CAGE CTSS

rpkm

Sample-wide heatmap

ENCODE Gingeras CSHL LongRnaSeq–Gm12878, HeLa, Nhekq20_tpm (36,393,243–36,395,276, 2.034 kb) Experiment name

Figure 1 Overview of ZENBU genome browser interface showcasing several on-demand processing tracks of selected ENCODE project experiments related to Gm12878, HeLa and Nhek cell lines. (a) Entrez Gene boundaries and Gencode v10 transcripts. (b,c) ENCODE CAGE reads obtained from 29 BAM alignment files produced by the Carninci laboratory. (b) q20 quality-filtered and normalized CAGE reads measuring transcription start site (TSS) usage. (c) Paraclu-based clustering of CAGE data identifies TSS clusters. (d–i) Multiple dynamic renderings from the same ENCODE long RNA-seq from 38 BAM alignment files produced by the Gingeras laboratory. (d) q20 quality filtering, normalization and histogram binning displaying exonic signal abundance. (e–g) Split-mapped reads processing showing intron, splice-donor and splice-acceptor support. (h) Exonic overlap processing against Gencode v10 with RPKM normalization showing signal color-coded transcript abundance support. (i) Compact heatmap color-coded visualization of exonic signal with each library laid out vertically. Finally, we show ZENBU’s track-linked expression-view panel (j), which shows the relative abundance of signal observed in a mouse-over selected region across the multiple experiments that are dynamically merged within the corresponding linked track. (This figure is available on the ZENBU website at http:// fantom.gsc.riken.jp/zenbu/severin_et_al_fig1.html.)

uses a new binary file format (ZDX, see Supplementary Fig. 5). ZDX uses genome pre-segmentation and variable data-blocks to allow for parallel fast indexed data access and data loading similar to relational databases. The data in each ZDX-block is stored in a binary form as compressed XML of a versatile five-dimensional data abstraction. Signal processing is performed independently for each genome segment and stored into 218

the track cache ZDX files with a parallel computing MapReduce approach12,13. Building a ZDX track cache is carried out in parallel by autonomous-agents with a workclaim design similar to the eHive system14, thus enabling an efficient and well balanced usage of memory, CPU (central processing unit) and disk access. This also facilitates scalability and the federation of both datasources and processing power.

To demonstrate ZENBU’s performance and scalability, we have loaded the entire recent ENCODE data release2, and the FANTOM3 and FANTOM4 (ref. 15,16) data sets, totaling nearly 5 Tb of raw read mapping and annotations at http://fantom.gsc.riken.jp/ zenbu/. To manage such large collections of experimental and annotation data, it is critical to annotate them with metadata and provide an easy to use search system. File formats, such as SAM (sequence alignment/map) and BAM, provide internal metadata descriptions that are automatically stored into the system upon data uploading and can be immediately searched using a faceted search system inspired by modMine17. In addition, ZENBU users can edit metadata on their uploaded data-sources, views, tracks and scripts. ZENBU also provides a common platform for investigators to securely share and publish scientific discoveries. Data made publicly available can be freely used, and OpenIDbased authentication grants users access to their private data, scripts and views and those shared from collaborations of which they are a member. Any user can create a secure collaboration group and invite others to join by adding their OpenID without the need for a system administrator. This enables flexible access control on the data and simplifies collaborative data sharing. In summary, we believe ZENBU advances the state of the art of genome browsers and analysis systems for big data sets by providing a rich interactive visualization experience via native embedded processing. This is in contrast to other systems that provide static visualization of pre-computed data files (for example, UCSC7 and IGV9) or queue processing systems that still need to pre-calculate results via wrappers around external programs (for example, Galaxy18). ZENBU has proven its value in the FANTOM5 consortium by not only facilitating the visualization of thousands of experiments, but also allowing individual users to interactively reconfigure tracks to focus attention on specific experimental samples of interest. We believe future large-scale genomic data management and visualization systems need to empower users with seamless aggregation of their data with the ever-growing body of publicly available data (ENCODE2, FANTOM15,16, TCGA19 and Roadmap Epigenomics1) and embrace the idea that as the volume of data available gets larger, data analysis, interpretation and visualization become intertwined concepts. ZENBU is written in C++ for server web services and in JavaScript for clientside visualization (Supplementary Fig. 6).

volume 32 NUMBER 3 MARCH 2014 nature biotechnology

correspondence ZENBU is freely available as a web service at http://fantom.gsc.riken.jp/zenbu/. ZENBU can also be installed locally from the opensource source code (Supplementary Data) or via preconfigured virtual machines which we provide. There is also a wiki-based documentation available on the website containing a detailed manual and set of case studies (Supplementary Note 1 and Supplementary Figs. 2–4,7–13).

npg


Note: Any Supplementary Information and Source Data files are available in the online version of the paper (doi:10.1038/nbt.2840). ACKNOWLEDGMENTS We would like to acknowledge C. Plessy, P. Carninci and the FANTOM5 consortium members for critical feedback during development of the system. The work was funded by a research grant for RIKEN Omics Science Center from the Ministry of Education, Culture, Sports, Science, and Technology (MEXT) to Y.H. and a grant of the Innovative Cell Biology by Innovative Technology (Cell Innovation Program) from the MEXT, Japan to Y.H. This study is also supported by Research Grants from MEXT through RIKEN Preventive Medicine and Diagnosis Innovation Program to Y.H. and RIKEN Center for Life Science Technologies, Division of Genomic Technologies to Piero Carninci. Author contributions J.S. designed and developed the software; J.S., N.B., C.O.D. and A.R.R.F. contributed to the design of the interface and data views; C.O.D., A.R.R.F. and Y.H. supervised the project; J.S., M.L., J.H. and N.B. contributed to the loading and curation of the data; J.S., J.H. and N.B. contributed to the source code; J.S., N.B., H.K. and A.R.R.F. wrote the manuscript. COMPETING FINANCIAL INTERESTS The authors declare no competing financial interests.

Jessica Severin1,2, Marina Lizio1,2, Jayson Harshbarger1,2, Hideya Kawaji1-3, Carsten O Daub1,2, Yoshihide Hayashizaki2,3, The FANTOM Consortium, Nicolas Bertin1,2,4 & Alistair R R Forrest1,2 1RIKEN Center for Life Science Technologies

(Division of Genomic Technologies), Suehiro-cho, Tsurumi-ku, Yokohama, Japan. 2RIKEN Omics Science Center (OSC), Yokohama, Japan. 3RIKEN Preventive Medicine and Diagnosis Innovation Program, Wako, Japan. 4Present address: Cancer Science Institute of Singapore, National University of Singapore, Singapore. e-mail: [email protected] or [email protected] 1. Chadwick, L.H. Epigenomics 4, 317–324 (2012). 2. The ENCODE Project Consortium. Nature 489, 57–74 (2012). 3. Li, H. et al. Bioinformatics 25, 2078–2079 (2009). 4. Nielsen, C.B., Cantor, M., Dubchak, I., Gordon, D. & Wang, T. Nat. Methods 7, S5–S15 (2010). 5. Saito, T.L. et al. Bioinformatics 25, 1856–1861 (2009). 6. Stein, L.D. et al. Genome Res. 12, 1599–1610 (2002). 7. Kuhn, R.M., Haussler, D. & Kent, W.J. Brief. Bioinform. 14, 144–161 (2013). 8. Hubbard, T. et al. Nucleic Acids Res. 30, 38–41 (2002). 9. Robinson, J.T. et al. Nat. Biotechnol. 29, 24–26 (2011). 10. Zhang, J. et al. Database (Oxford) 2011, bar038 (2011). 11. Derrien, T. et al. Genome Res. 22, 1775–1789 (2012). 12. Frith, M.C. et al. Genome Res. 18, 1–12 (2008). 13. Wei, G. et al. Immunity 35, 299–311 (2011). 14. Severin, J. et al. BMC Bioinformatics 11, 240 (2010). 15. Carninci, P. et al. Science 309, 1559–1563 (2005). 16. Suzuki, H. et al. Nat. Genet. 41, 553–562 (2009). 17. Contrino, S. et al. Nucleic Acids Res. 40, D1082– D1088 (2012). 18. Giardine, B. et al. Genome Res. 15, 1451–1455 (2005). 19. The Cancer Genome Atlas Research Network et al. Nat. Genet. 45, 1113–1120 (2013).

OpenSWATH enables automated, targeted analysis of dataindependent acquisition MS data To the Editor: Liquid chromatography tandem mass spectrometry (LC-MS/MS)-based proteomics is the method of choice for large-scale identification and quantification of proteins in a sample1. Several LC-MS/MS methods have been developed that differ in their objectives and performance profiles2. Among these, shotgun proteomics (also referred to as discovery proteomics) using data-dependent acquisition (DDA) and targeted proteomics using selected reaction monitoring (SRM, also referred to as multiple reaction monitoring, MRM) have been widely adopted. Alternatively, some mass

spectrometers can also be operated in dataindependent acquisition (DIA) mode3–15. In DIA mode, the instrument fragments all precursors generated from a sample that are within a predetermined mass-to-charge ratio (m/z) and retention-time range. Usually, the instrument cycles through the precursor-ion m/z range in segments of specified width, at each cycle producing a highly multiplexed fragment-ion spectrum. Multiple DIA methods have been described with different instrument types and setups, duty cycles and window widths. Methods such as MSE (simultaneous acquisition of exact mass at high and low collision energy) fragment

nature biotechnology volume 32 NUMBER 3 MARCH 2014

all precursors5, whereas others, such as PAcIFIC (precursor acquisition independent from ion count), use precursor selection windows as small as 2.5 m/z6 (see ref. 16 for a recent overview). In this Correspondence, we describe OpenSWATH, a software for automated targeted DIA analysis, benchmark it against manual analysis of >30,000 chromatograms from 342 synthesized peptides and use it to analyze the proteome of Streptococcus pyogenes. DIA methods offer several potential advantages over shotgun proteomics and SRM. Specifically, data acquired in DIA mode is continuous in time and fragment-ion intensity, thus increasing the dimensionality of the data relative to shotgun proteomics, in which full fragment-ion intensity scans are recorded only at selected time points (MS/MS spectra), or SRM, in which continuous time profiles are acquired but only for selected fragment ions (ion chromatograms)1,17–20. Thus, DIA methods produce a complete two-dimensional record of the fragment-ion signal of all precursors generated from a sample (Fig. 1a). By acquiring time-resolved data of all fragment ions, DIA has the potential to overcome some of the limitations of the current proteomic methods and to combine the high-throughput of shotgun proteomics with the high reproducibility of SRM21,22. However, DIA data has historically been more difficult to analyze than shotgun or SRM data. To limit the time needed for data analysis and the amount of sample required, one typically uses larger precursor-isolation windows than in shotgun proteomics or SRM16. This leads to highly complex, composite fragment-ion spectra from multiple precursors and thus to a loss of the direct relationship between a precursor and its fragment ions, making subsequent data analysis nontrivial. To date, DIA data have been analyzed by one of two strategies. In the first, fragment-ion spectra4,6 or pseudo fragment-ion spectra (which are computationally reconstructed from the complex data sets8,9,11–13) are searched by methods developed for DDA. In these approaches, a proteomics search engine compares experimental spectra to theoretical spectra generated by an in silico tryptic digest of a proteome, assuming that the fragmention spectrum is derived from a single precursor. These approaches suffer from the high complexity of the data and the fact that errors in the generation of pseudo-spectra will propagate through the analysis workflow. Recently, we proposed an alternative, fundamentally different DIA data analysis 219

Interactive Exploration, Analysis, and Visualization of Complex Phenome-Genome Datasets with ASPIREdb.

Quality Visualization of Microarray Datasets Using Circos.

MEMHDX: an interactive tool to expedite the statistical validation and visualization of large HDX-MS datasets.

Molli: interactive visualization for exploratory protein analysis.

cell-line viability assay datasets.

Open Environment for Multimodal Interactive Connectivity Visualization and Analysis.

CHiCP: a web-based tool for the integrative and interactive visualization of promoter capture Hi-C datasets.

Forensic-case analysis: from 3D imaging to interactive visualization.

Visualization of large datasets in intensive care.

fluff: exploratory analysis and visualization of high-throughput sequencing data.

PhyloToAST: Bioinformatics tools for species-level analysis and visualization of complex microbial datasets.

WiggleTools: parallel processing of large collections of genome-wide datasets for visualization and statistical analysis.

Interactive Visualization for Patient-to-Patient Comparison.

Visualization of calvarial fractures from MRI volumetric datasets.

Combenefit: an interactive platform for the analysis and visualization of drug combinations.

ALVIS: interactive non-aggregative visualization and explorative analysis of multiple sequence alignments.

AVIA: an interactive web-server for annotation, visualization and impact analysis of genomic variations.

NaviGO: interactive tool for visualization and functional similarity and coherence analysis with gene ontology.

inPHAP: interactive visualization of genotype and phased haplotype data.

GPU-accelerated interactive visualization and planning of neurosurgical interventions.

Interactive microbial distribution analysis using BioAtlas.

A Real-Time Magnetoencephalography Brain-Computer Interface Using Interactive 3D Visualization and the Hadoop Ecosystem.

VR interactive environment for MD simulations, visualization and analysis.

IslandViewer 3: more flexible, interactive genomic island discovery, visualization and analysis.