CMGRN: a web server for constructing multilevel gene regulatory networks using ChIP-seq and gene expression data.

Bioinformatics Advance Access published January 2, 2014

Gene regulation

CMGRN: a web server for constructing multi-level gene regulatory networks using ChIP-seq and gene expression data Daogang Guan1, Jiaofang Shao1, Youping Deng2, Panwen Wang3, Zhongying Zhao1, Yan Liang1, Junwen Wang3,4, Bin Yan1* 1

2

Department of Biology, Hong Kong Baptist University, Hong Kong, China. Department of Internal Medicine and 3 Biochemistry, Rush University Medical Center, Chicago, IL, USA. Department of Biochemistry and HKU-SIRI, 4 and Centre for Genomic Sciences, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong, China. Associate Editor: Prof. Gunnar Ratsch

1

INTRODUCTION

The combinational performance of transcriptional and epigenetic factors is central regulatory mechanisms to control gene expression networks. A fundamental challenge in post-genome era is to dissect such complex regulation associations underlying biological functions. ChIP-seq technology allows high fidelity mapping of different regulators, such as transcription factors (TFs) or epigenetic modifications to genomic locations, thus providing a basis for profiling transcriptional or epigenetic regulatory relationships. Accurate binding features of these factors on the genomic sequences can also be employed to identify regulatory modules and reconstruct gene regulatory networks (GRNs). Various algorithms and computational methods have been proposed to investigate gene regulation promoted by cooperation of TFs and epigenetic alterations. Some online tools or servers were designed to build GRNs derived from ChIP-seq data, for instance, Cscan (Zambelli, et al., 2012) and ChIP-Array (Qin, et al., 2011). Although the existing web-based software can predict potential target genes of the multiple regulators, they are not able to discover hierarchical organizations formed by cross-interactions between the regulators and genes simultaneously. Currently, there is a lack of effective web resources to generate the topology of networks controlled by the interacting factors at transcriptional, post-transcriptional and epigenetic layers. *To

whom correspondence should be addressed.

To provide an easy-to-use bioinformatics tool to interpret these ChIP-seq high-throughput data, we devised an integrative web server, Constructing Multi-level Gene Regulatory Networks (CMGRN) to unveil interrelationships between TFs or epigenetic modifications and to robustly construct hierarchical regulatory network structures in biological complex systems. It is systemwide and enables biologists to analyze the mixture information of ChIP-seq and gene expression levels, using standard data formatted with minimal need of bioinformatics skills.

2 METHODS 2.1 Overview The schematic procedure and layout of CMGRN are illustrated in Figure 1. Our server provides a web-based framework with two main tasks: 1) Inference of causal relationships between TFs or epigenetic modifications by employing Bayesian network-based methodology. One input is required: gene counts of TFs or epigenetic modifications based on ChIP-seq BED data mapped to genomic regions. 2) Construction of GRNs coordinated by the multi-level factors. Two types of input are required: regulatory signal of TFs, epigenetic modifications or microRNAs and gene expression data. The regulatory signal represents binding/modification activities of TFs or epigenetic factors identified by calling peaks of ChIP-seq, or binding targets of microRNAs. Through the integrated analysis, CMGRN would establish hierarchical regulatory network organizations at multiple levels, which quantify the interactions among regulators and genes by employing Bayesian hierarchical model and Gibbs sampling. Thus, CMGRN can generate two outputs following the two procedures, and also provides an option for network reconstruction depending on what types of input data (Figure 1). The first one will obtain a causal network linking TFs or epigenetic factors if only ChIP-seq gene counts are provided. The second one will construct a final network exploring interactions of both regulatorregulator and regulator-gene if all three types of input are satisfied. The output tables and figures will be displayed via a user-friendly web interface and be easily downloaded. Usage of our service is simple and does not need user’s knowledge on specific algorithm. The detailed information for the web tool implementation is explained in supplementary information and help page of the website.

2.2 Algorithms and statistical methods 2.2.1 Inference of causal relationships between regulators We employed a Bayesian network-based model to infer causal interactions among regulators (TFs or epigenetic modifications), and examine how these regulators influence or associate with each other. The basic algorithm of Bayesian network structure searching for possibility of dependence between two nodes was described by Hecherman et al (Heckerman, et al., 1995). There are two challenges to overcome, high posterior probability and the compelled edges in the network. In particular, compelled edge identification is critical because these edges indicate causal relationships when certain assumptions hold. A series of optimization is needed for such “cause-effect” network structures with high posterior probability. In this study, we adopted BD metric (Bayesian metric with Dirichlet prior) strategy proposed by

© The Author (2014). Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

1

Downloaded from http://bioinformatics.oxfordjournals.org/ at Belgorod State University on February 3, 2014

ABSTRACT Summary: ChIP-seq technology provides an accurate characterization of transcription or epigenetic factors binding on genomic sequences. With integration of such ChIP-based and other high-throughput information, it would be dedicated to dissecting cross-interactions among multi-level regulators, genes and biological functions. Here, we devised an integrative web server CMGRN, to unravel hierarchical interactive networks at different regulatory levels. The newly developed method used the Bayesian network modeling to infer causal interrelationships among transcription factors or epigenetic modifications by using ChIP-seq data. Moreover, it employed Bayesian hierarchical model with Gibbs sampling to incorporate binding signals of these regulators and gene expression profile together for reconstructing gene regulatory networks. The example applications indicate that CMGRN provides an effective web-based framework that is able to integrate heterogeneous high-throughput data, and to reveal hierarchical “regulome” and the associated gene expression programs. Availability: http://bioinfo.icts.hkbu.edu.hk/cmgrn; http://www.byanbioinfo/cmgrn Contact: [email protected] Supplementary information: supplementary data are available at bioinformatics online

Heckerman et al, 1995 to find the high likelihood of posterior probability. First, we set a Bayesian network of regulators as: B D, ,

"

2 A B C" DE " F"A G A A

Figure 1. An overview of data integrated analysis and regulatory network construction. The triangle nodes represent TFs (pink), epigenetic regulators (red) and cofactors (blue), whereas the V-type nodes represent microRNAs (yellow). Target genes were indicated by using the circle nodes. D , … , , , D is an acyclic directed graph (DAG); represents nodes (regulators) of the DAG, is set of edges of DAG, contains the set of conditional probability distributions that correspond to D. To find likelihood structures, the probability can be calculated by following formula:

log , | log | , log | where is the (possibly) null parent of . For edge " → , the strategy is to associate with a weight $ , " % log $ | " % & log | , is a node set without edges. A weight is defined for edge to " as ω$ , " % ()*+$ , " % & ()*+ |- . To search for the best network structure, we utilized maximal value for ∑ , . The above algorithm uses greedy strategy, so the next step is to identify compelled edges as potential network structures. Chickering developed mathematical algorithm for identifying compelled edges between two nodes in order to determine their causal relationship (Chickering, 1995). Yu, et al., applied a similar methodology to predict causal relationships between histone modifications and gene expression in human T cells (Yu, et al., 2008). The ChIP-seq reads in BED format were mapped to genomic regions and were then transferred to gene counts or signals of TFs or epigenetic modifications that could be detected. CMGRN used the gene count of ChIP-seq as input for inferring the causal relationships between regulators. First, the input gene counts will be discretized by using k-measns++ algorithm. The k-means method is a widely used clustering technique that seeks to minimize the average and squared distance between points in the same cluster. Although its simplicity and speed are very appealing in practice, it oﬀers no accuracy guarantees for discretization. One reason is that k-means usually chooses initial k centers in an arbitrary way. If the centers are not properly taken, the resulting clusters will be also arbitrary and thus leading to generation of unstable “cause-effect” relationships. In this study, we adopted a kmeans++ method which proposes a speciﬁc approach to choose centers like “ 1 23*4536*" (Arthur and Vassilvitskii, 2007). The main step of kmeans++ is: 1) take one center 8 , chosen uniformly at random from χ; 2) take a new center 8 , choosing x ∈ χ with probability ∑

> ?∈@

TFmiR: a web server for constructing and analyzing disease-specific transcription factor and miRNA co-regulatory networks.

PTHGRN: unraveling post-translational hierarchical gene regulatory networks using PPI, ChIP-seq and gene expression data.

WoPPER: Web server for Position Related data analysis of gene Expression in Prokaryotes.

CMIP: a software package capable of reconstructing genome-wide regulatory networks using gene expression data.

GEPIA: a web server for cancer and normal gene expression profiling and interactive analyses.

Urothelial cancer gene regulatory networks inferred from large-scale RNAseq, Bead and Oligo gene expression data.

Data Integration for Microarrays: Enhanced Inference for Gene Regulatory Networks.

Inferring interaction type in gene regulatory networks using co-expression data.

Analytical Strategy to Prioritize Alzheimer's Disease Candidate Genes in Gene Regulatory Networks Using Public Expression Data.

Inferring nonlinear gene regulatory networks from gene expression data based on distance correlation.

Reverse Engineering of Genome-wide Gene Regulatory Networks from Gene Expression Data.

Properties of sparse penalties on inferring gene regulatory networks from time-course gene expression data.

Genomic data assimilation using a higher moment filtering technique for restoration of gene regulatory networks.

A Sparse Reconstruction Approach for Identifying Gene Regulatory Networks Using Steady-State Experiment Data.

Galahad: a web server for drug effect analysis from gene expression.

SEGEL: A Web Server for Visualization of Smoking Effects on Human Lung Gene Expression.

Foundations for modeling the dynamics of gene regulatory networks: a multilevel-perspective review.

Gene Regulatory Network Analysis for Triple-Negative Breast Neoplasms by Using Gene Expression Data.

RAMONA: a Web application for gene set analysis on multilevel omics data.

Mergeomics: a web server for identifying pathological pathways, networks, and key regulators via multidimensional data integration.

Recursive random forest algorithm for constructing multilayered hierarchical gene regulatory networks that govern biological pathways.

Inferring active regulatory networks from gene expression data using a combination of prior knowledge and enrichment analysis.

Fast Bayesian inference for gene regulatory networks using ScanBMA.

Genonets server-a web server for the construction, analysis and visualization of genotype networks.