Journal of Bioinformatics and Computational Biology Vol. 12, No. 6 (2014) 1442006 (16 pages) # .c The Authors DOI: 10.1142/S0219720014420062

Supervised learning method for predicting chromatin boundary associated insulator elements

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by 69.70.201.226 on 01/24/15. For personal use only.

Paweł Bednarz* and Bartek Wilczyński† Institute of Informatics, Warsaw University, Banacha 2 Warsaw 02-089, Poland *[email protected][email protected] Received 14 July 2014 Revised 12 September 2014 Accepted 12 September 2014 Published 11 November 2014 In eukaryotic cells, the DNA material is densely packed inside the nucleus in the form of a DNAprotein complex structure called chromatin. Since the actual conformation of the chromatin ¯ber de¯nes the possible regulatory interactions between genes and their regulatory elements, it is very important to understand the mechanisms governing folding of chromatin. In this paper, we show that supervised methods for predicting chromatin boundary elements are much more e®ective than the currently popular unsupervised methods. Using boundary locations from published Hi-C experiments and modEncode tracks as features, we can tell the insulator elements from randomly selected background sequences with great accuracy. In addition to accurate predictions of the training boundary elements, our classi¯ers make new predictions. Many of them correspond to the locations of known insulator elements. The key features used for predicting boundary elements do not depend on the prediction method. Because of its miniscule size, chromatin state cannot be measured directly, we need to rely on indirect measurements, such as ChIP-Seq and ¯ll in the gaps with computational models. Our results show that currently, at least in the model organisms, where we have many measurements including ChIP-Seq and Hi-C, we can make accurate predictions of insulator positions. Keywords: Chromatin boundary; Bayesian network; random forest; insulator element.

1. Introduction One of the de¯ning features of a eukaryotic cell is the compartmentalization of the genomic DNA into the nucleus. This structural feature of eukaryotic cells has many consequences for its functioning. In particular, the enormous compaction of the DNA material necessitates a whole range of regulatory mechanisms acting at the level of chromatin.1 Many proteins are involved in regulating the dynamic state of the chromosomes    making it possible for the cells to maintain a stable epigenetic state †Corresponding

author. 1442006-1

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by 69.70.201.226 on 01/24/15. For personal use only.

P. Bednarz & B. Wilczy n  ski

of the chromosomes. At any given time-point certain parts of the chromosomes remain closed and inaccessible for the transcription machinery, while other genes can remain open and actively transcribed.2 This ability of cells to maintain certain parts of chromosomes in di®erent states is absolutely essential for robust development of multicellular organisms, which need to utilize di®erent gene batteries in di®erent tissues.3 Even though the miniscule size of chromatin makes it challenging to make observations of the details of the chromatin state it has been an object of scienti¯c interest for over a hundred years.4 The modularity of chromatin has been observed in many di®erent experimental assays, beginning with the early stainings of the karyotype,5 through multiple assays using light6 and electron microscopy7 to ¯nally the latest assays using next-generation sequencing in population of cells8 and single cell Hi-C experiments.9 All these studies show evidently, that each eukaryotic chromosome is divided into multiple distinct chromosomal domains and that these domains remain largely consistent in their boundaries between experimental conditions and even between species.10 These, so-called topologicaly associated domains (TADs), are generally thought to be mostly de¯ned through large-scale chromosomal interactions and independent of the cell type in an organism. As they usually contain many genes, they are considered a large-scale sca®old upon which the ¯ne-scale regulation of epigenetic state of di®erent genes occurs. It is an important question, what is the molecular mechanism of stabilization of the TAD boundary location. While such mechanisms are currently not fully elucidated, we have at least statistical data correlating the locations of the TAD boundaries and presence of certain proteins (see Fig. 1(a)). In case of TAD boundaries in Drosophila melanogaster genome, the presence of the CP190 protein has been statistically associated with TAD boundary locations.8 Importantly, this protein does not bind speci¯c DNA sequences and is considered to be a co-factor recruited by di®erent complexes of DNA-binding factors. While the association between the positioning of the CP190 protein with boundary location is clear, it remains an open question to what extent it is a de¯ning factor in the TAD organization. Given the large size of such domains (10 5 –10 7 nucleotides) it is unlikely to be the only factor. On the other hand, many studies of tissue-speci¯c gene regulation have shown that transcriptional regulation in metazoan genomes involves usually arrays of tissue-speci¯c regulatory elements    so called enhancers    that are mediating 11 selective activation of gene promoters. As enhancers can be positioned quite far from their target promoters, they require chromosomal interactions to perform their function. This leads to another level of chromosomal interactions, mediated by relatively short range loops between enhancers and promoters (see Fig. 1(b)). Recent studies12,13 have uncovered that although these interactions might sometimes span hundreds of thousands of basepairs, they are still relatively short compared to the sizes of topological domains. Importantly, these interactions require tight control, as the activity of enhancers is highly tissue speci¯c and from the point of view of gene regulation it is highly undesirable to allow any spurious promoter–enhancer 1442006-2

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by 69.70.201.226 on 01/24/15. For personal use only.

Supervised learning method for predicting chromatin boundary associated insulator elements

Fig. 1. Outline of the approach. The chromatin domain structure (panel A, top) can be experimentally measured with Hi-C technology. The boundary locations emerging from the Hi-C data are statistically enriched with binding of insulator proteins such as CTCF (panel A, bottom). The insulator functional DNA elements (panel B) are recruiting speci¯c insulator factors facilitating DNA looping necessary for mediation of the desired enhancer–promoter interactions and blocking spurious interactions. In this example, we see the well-known example of the fushi tarazu gene that is insulated from the in°uence of the surrounding regulatory elements of the sex combs reduced gene. Our supervised learning approach (panel C) begins with a subset of topological boundary elements established from a genome-wide Hi-C experiment and constructs a training set for a classi¯er. Then we assess the quality of the classi¯er, measure the relative feature importance and make new predictions in the Drosophila melanogaster genome.

interactions. Given the vast number of interactions between enhancers and promoters, understanding transcriptional regulation in development might prove to be impossible without understanding the function of regulatory domains.14 The role of policing tissue-speci¯c regulatory interactions is usually assigned to the so-called insulator elements.1 These short DNA fragments harbor binding sites for specialized insulator proteins such as the CTCC-binding factor (CTCF) or 1442006-3

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by 69.70.201.226 on 01/24/15. For personal use only.

P. Bednarz & B. Wilczy n  ski

boundary element associated factor (BEAF32) and allow them to form stable loops that stabilize local chromatin structure. It has been shown that there exists a hierarchy of such elements,15 with di®erent proteins responsible for formation of chromatin loops at di®erent scales. Given this dichotomy between functional insulator elements, supposedly acting mostly at the relatively short range and the higher level of interactions leading to TAD formation and \emerging" TAD boundaries, we set out to ¯nd out if it is possible to bridge the gap between these two paradigms with a predictive computational model. As we need to have both the data on local chromatin state and di®erent levels of chromatin interactions we have chosen Drosophila melanogaster genome for our study. For this well studied model organism with moderate genome size there exist both many data on chromatin associated proteins from the modEncode project16 and relatively high-resolution maps of chromosomal interactions from the recent study performed with the Hi-C technology.8 Our approach (see Fig. 1(c)) is based on the main idea to use the boundary locations established by the Hi-C protocol and using them as positive examples in a supervised cross-validated machine learning scheme. While our approach is not the ¯rst to make an attempt at computational prediction of chromatin boundaries, the previous approaches were either aimed at the unsupervised detection of TADs from ChIP-Seq data17 or supervised approaches to ¯nding insulator elements from sequence motifs.18 Our study aims to use the synergy between the utilization of the richer ChIP-Seq data and the more powerful supervised approach to see if a uni¯ed model can solve both problems: TAD delineation and insulator element prediction. In the following sections, we will show the results of applying Bayesian network19 and random forest20 classi¯ers in the task of predicting chromatin boundaries in the Drosophila genome and the analysis of the relative importance of di®erent features used together with some exemplary predictions made by the model. 2. Results 2.1. Predicting boundary elements from modEncode tracks In order to test whether it is possible to predict domain boundary locations, we ¯rst needed to compile the training set. We chose the recent whole-genome study of chromosomal interactions in Drosophila melanogaster assayed with the Hi-C technique.8 This gave us more than a thousand examples of topological boundary elements in a genome that is well studied and relatively small (less than 200 million bases) to facilitate computational approaches. For the training features we have selected over thirty ChIP-Chip and ChIP-Seq tracks from the modEncode16 project that were assayed during the embryonic development making them relevant for the prediction of the boundary elements. As the negative set, random windows of certain size were chosen from the genome. In order to generate actual measurements, the average signal in a window of a given size was computed for each of the tracks. We have used both supervised and unsupervised methods on this very dataset to 1442006-4

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by 69.70.201.226 on 01/24/15. For personal use only.

Supervised learning method for predicting chromatin boundary associated insulator elements

compare their relative performance. The details of the comparison are presented in the following paragraphs and in the methods section. As the simplest, baseline method, we have chosen the k-means clustering. This approach represents the simplest of the unsupervised approaches: We assume that there is a ¯xed number of domain types and use all observed windows of a given size to ¯nd the clustering of observations into k classes. Then we use the assigned clusters as labels for genomic locations and assign domain boundaries to locations where the label changes. As we can see in Fig. 2, this approach, depending on the choice of the parameter k, gives the sensitivity (true-positive rate, tpr) of 0.2 to 0.3 at the falsepositive rate (fpr) of approximately 0.1. While this is far from perfect, it should be noted that it is statistically signi¯cantly better than random. In the same Fig. 2, we can see the performance of the ChromHMM method applied to the same data. As expected, the method based on hidden Markov models outperforms the k-means method giving us up to 0.4 tpr with only slightly higher fpr (0.15). In comparison with the unsupervised methods, the cdBEST method18 that was trained on several known Drosophila boundary elements gives a clear improvement with fpr almost the same as ChromHMM and tpr greater than 0.5. However, this is still quite far from the performance needed in practical applications. It should be noted here, that the cdBEST method is not using chromatin data to make predictions. When we compare the performance of the BNFinder21 on the same data as the unsupervised approaches, we can see that it can very clearly outperform all the above-mentioned methods. In particular, at the threshold of the fpr of 0.3, BNFinder classi¯er reaches the sensitivity of more than 0.8. Overall, by varying the posterior

Fig. 2. Comparison of the unsupervised and supervised classi¯ers. Comparison of performance of unsupervised (k-means, chromHMM), motif-based (cdBEST) and supervised methods (BNFinder) on the ROC plot clearly indicates that using BNFinder on the modEncode data gives better results than either using unsupervised methods or using supervised method on sequence features trained from a few examples. 1442006-5

P. Bednarz & B. Wilczy n  ski

probability threshold, we can measure the performance of BNFinder classi¯er to be more than 0.89 (area under receiver operating characteristics curve  AUC). This can be interpreted as just above 10% of wrongly ordered pairs between positive and negative examples. While this is not a perfect score, it shows promise that the supervised approach on the ChIP-Seq data can be successfully used for predicting boundary locations.

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by 69.70.201.226 on 01/24/15. For personal use only.

2.2. Predicting boundary elements from modEncode tracks and sequence features Given the results from the previous section, we wanted to verify, whether the improvements from using supervised methods on ChIP-Seq data and on sequence features can be combined to obtain an even better performing classi¯er. As the BNFinder software is too computationally expensive for use with hundereds of features, we have decided to use random forest classi¯ers as they have been shown previously to perform well in such scenarios. As a simple sequence feature, we have chosen 4-mers (words of length 4) occurence in each of the sequences. This has been recently used to predict enhancers22 and gives comparable results to using sequence motifs23 while being much simpler to compute. As we can see in Fig. 3, the random forest based classi¯er using only words of length 4 gives signi¯cantly better performance than the cdBEST method, however, not as good as the one using ChIP-Seq data. Overall, the classi¯er combining sequence and ChIP-Seq data is the best performer (AUC ¼ 0:92), however, it is practically indistinguishable from the one using

Fig. 3. Comparison of Random Forest classi¯ers using di®erent feature sets. Comparison of the performance of random forest classi¯ers trained on the same training set of boundary locations from Hi-C data, but with di®erent classi¯cation features. The performance of the joint classi¯er is comparable with the one based only on modEncode data, both clearly superior to the sequence features alone or the unsupervised methods. 1442006-6

Supervised learning method for predicting chromatin boundary associated insulator elements

just the ChIP-Seq data. This would support the hypothesis that the sequence, while important, does not contain all the information needed to de¯ne a boundary element. It seems to indicate that there is some context information encoded in either cofactor recruitment or epigenetic modi¯cations that is required before the sequence speci¯c factor can ¯nd their binding sites. However, it seems that all the information encoded in the sequence can be retrieved from the modEncode ChIP-Seq data. It should be noted that this result does not signi¯cantly depend on the length of the k-mer, as we have tested also k ¼ 2; 3; 4; 5; 6 (see Supplementary Table 1).

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by 69.70.201.226 on 01/24/15. For personal use only.

2.3. Comparing relative feature importance As we can see that using di®erent feature sets can lead to di®erent results, we have set out to compare the relative importance of particular features utilized in classi¯ers on di®erent cross-validation folds of data. These data are presented in Fig. 4, giving the relative importance on the Y -axis in the form of box-plots summarizing results from multiple runs in di®erent cross-validation experiments. First of all, it is reassuring to see that the feature importance of di®erent modEncode datasets is very similar between Bayesian network classi¯er and the random forest (panels A and B). The most important seem to be the CP190 and BEAF32 with CTCF and CtBP following closely. These are quite consistent with previously reported correlations between insulator proteins and boundary locations,8,24 however, it is important to say that neither one of the tracks is enough to make these predictions and that the ¯nal prediction score depends (in particular in the case of random forests) on multiple features. We have also analyzed the role of sequence features used in the classi¯ers. First thing to note is that no single k-mer has importance comparable with that of the top ChIP-Seq features. This leads to the fact that in the case of combined classi¯er, the top 20 features is practically the same as in the case of ChIP-Seq data alone (Supplementary Figure 1), however when we used the Boruta package25 that allows for correcting for the expected information content in each feature, many of the sequence features appear to be more signi¯cant than it seems from their direct contribution to prediction (Supplementary Figure 2). Importantly, all of the expected insulator binding words (CTCC for CTCF, GAGA for GAF and GATC for BEAF32) are found to be signi¯cant by Boruta (Supplementary Table 2). 2.4. Validation of the approach While it is reassuring that our approach gives very good results in a cross-validation experiment on a balanced training set, in order to make sure that the approach is generalizable to di®erent datasets we have decided to test it on a di®erent training set. As there are no other genome-wide Hi-C experiments for Drosophila, we have decided to use the dataset published by Filion et al.2 based on HMM segmentation of data from multiple DamID experiments. This dataset is quite realistic, as it contains more than 8,000 domains and seems to be slightly more noisy than the original. Since 1442006-7

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by 69.70.201.226 on 01/24/15. For personal use only.

P. Bednarz & B. Wilczy n  ski

Fig. 4. Relative feature importance in di®erent boundary element classi¯ers. Comparison of the relative feature importance in di®erent classi¯ers: BNFinder (a), random forest based on modEncode data (b) and random forest based on sequence data (c).

1442006-8

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by 69.70.201.226 on 01/24/15. For personal use only.

Supervised learning method for predicting chromatin boundary associated insulator elements

Fig. 5. Examples of new predictions made by the boundary element classi¯ers. The ¯gure presents several loci exemplifying predictions made with the presented approach. The AbdB locus (panel A), exemplifying the overlap of our predictions with the locations of known Fab insulator elements; The ftz locus with two predictions of known insulators mediating the looping (panel B, see also Fig. 1(b)). A novel prediction of an insulator element present between the tinman and bagpipe genes (panel C). The Dad locus exemplifying how di®erent protein occupancy pro¯les lead to predictions of insulator elements (panel D).

1442006-9

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by 69.70.201.226 on 01/24/15. For personal use only.

P. Bednarz & B. Wilczy n  ski

the authors of the Hi-C study we used previously report signi¯cant coincidence between their 1,000 boundaries and the DamID derived ones, we have decided to remove the overlapping boundaries from the training set leaving us with more than 7,000 examples. As it turns out the classi¯ers (both using BNs and RFs) trained on the same features but with completely nonoverlapping set of positive examples give still highly nonrandom results (AUC  0:74). More importantly the most important feature being CP190 is found in this set as well as in the original data leading us to conclusion that it is a bona ¯de predictive feature of the biological phenomenon and not just an artifact of the particular dataset. Lastly, we have analyzed the set of approximately 3,500 predictions given by our approach in the whole Drosophila genome to test whether we can see some new predictions that would be useful to biologists. In Fig. 5(a), we can see the Abd-B gene locus with locations of four known insulators.18 Even though these were not the part of any of the training sets used in our study, all of the supervised approaches can ¯nd them perfectly among their predictions. It should be noted that these boundary elements were part of the training set for the cdBEST tool and that this tool can predict them as well. In Fig. 5(b), we are showing the ftz locus mentioned already in the introduction, where again all our classi¯ers are making correct predictions of the two known insulator elements,26 even though they were not present in the training dataset. It should be noted that the prediction tracks are shown from the level of 0:5 showing only the signi¯cant predictions, so that even a low signal that is visible in the random forest classi¯er tracks is in fact signi¯cant. In Fig. 5(c), we can see the new, biologically relevant prediction of a boundary element between the genes tinman and bagpipe.27 Again, even though this particular location is absent from the training set, our classi¯er makes a proper call as we know that the temporal occupancy of enhancers on both sides of this putative insulator element is di®erent and matching the di®erent expression patterns of the genes tinman and bagpipe, respectively.28 Finally, as exempli¯ed by the Dad locus in Fig. 5(d), our classi¯ers are quite °exible in providing high con¯dence predictions even if the underlying ChIP-Seq signals are visibly quite di®erent. This proves, that the machine learning approach is indeed necessary and able to learn more complex patterns found in the data.

3. Materials and Methods 3.1. Data material In this study, we used 34 tracks from the modEncode library.16 Since Hi-C data used further in the analysis were measured between 16th and 18th hour of embryo development, we chose the datasets from the stage of development closest to this period. Table 1 shows the chosen tracks. Data was binned using bins of size 4,000 bp and averaged within the bins. This length was chosen based on the comparison of 1442006-10

Supervised learning method for predicting chromatin boundary associated insulator elements Table 1. Collection of modEncode16 datasets used for classi¯er training.

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by 69.70.201.226 on 01/24/15. For personal use only.

Name of the factor BEAF-32 cbp CP190 CTCF-C CTCF-N GAF H3K27ac H3K27me3 H3K36me3 H3K4me1 H3K4me3 H3K9ac H3K9me3 HDAC11 2629 HDAC11 2630 HDAC3 2633 HDAC3 2634 HDAC4a 2627 HDAC4a 2628 HDAC6 2631 HDAC6 2632 HP1a HP1b 3015 HP1b 3290 HP1c HP1 3391 HP1 3392 HP2 mod(mdg4) pol2 Su(Hw) 27 Su(Hw) 901 total-RNA 111 total-RNA 112

Hour of development

Technology used

modEncode id

0–12 0–12 0–12 0–12 0–12 16–24 16–20 16–20 14–16 16–20 16–20 16–20 16–20 0–12 0–12 0–12 0–12 0–12 0–12 0–12 0–12 14–16 14–16 14–16 14–16 16–24 16–24 14–16 8–16 16–20 0–12 0–12 16–18 16–18

ChIP-Chip ChIP-Chip ChIP-Chip ChIP-Chip ChIP-Chip ChIP-Seq ChIP-Seq ChIP-Seq ChIP-Chip ChIP-Seq ChIP-Seq ChIP-Seq ChIP-Seq ChIP-Seq ChIP-Seq ChIP-Seq ChIP-Seq ChIP-Seq ChIP-Seq ChIP-Seq ChIP-Seq ChIP-Chip ChIP-Chip ChIP-Chip ChIP-Chip ChIP-Seq ChIP-Seq ChIP-Chip ChIP-Chip ChIP-Chip ChIP-Chip ChIP-Chip RNA-tiling-array RNA-tiling-array

21 607 22 769 770 3238 838 814 931 781 793 825 805 2629 2630 2633 2634 2627 2628 2631 2632 3024 3015 3290 3021 3391 3392 3025 2602 888 27 901 111 112

performance of two classi¯ers using Bayesian networks and random forests in 10-fold cross-validation tests. We tested bin sizes from 300 bp to 10,000 bp and both of the classi¯ers unequivocally showed the best performance with the bin size equal to 4,000 bp (see Supplementary Tables 3 and 4). 3.2. Construction of training and test sets For the construction of the training set we used the partition of Drosophila genome into regulatory domains from the work by Sexton et al.,8 which was obtained by comparing the expected number of interactions between every two regions on a genome and the results of Hi-C experiment. We centered bins of size 4,000 bp around the boundaries between the domains and for such bins computed vectors of average signal of modEncode tracks. This gave us 1,175 positive examples of boundaries, 1442006-11

P. Bednarz & B. Wilczy n  ski

which we complemented with the same number of negative examples randomly drawn from the genome. We made sure that the bins corresponding to negative and positive examples do not overlap by more than half of their size. For the construction of a test set we divided the whole genome uniformly into bins of size 4,000 bp.

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by 69.70.201.226 on 01/24/15. For personal use only.

3.3. K-means clustering To perform k-means clustering, we used the algorithm implemented in the Biopython package (see Chapman et al.29 and de Hoon et al.30). After averaging signals inside every bin, we treated vectors of 34 values (each one corresponding to a respective modEncode track) as coordinates of points in 34-dimensional space and performed a clustering with enforced clusters number of 4, 8, and 34. After assigning every bin to one cluster we obtained the partitioning of the genome into distinct classes. In order to eliminate too short domains, we left only those ones, that were not shorter than 8,000 bp, assigning shorter ones to a domain preceding it. By scanning the genome bin by bin we could then identify places, where the class was changing and those places we returned from the method as a map of putative domain boundaries. 3.4. Clustering with hidden Markov models For unsupervised clustering with hidden Markov models we used software called ChromHMM (see Ernst et al.17). To reduce computational complexity of the problem authors decided to ¯rst binarize input data. Since the binarization method used by ChromHMM was designed to deal with ChIP-Seq data and some of datasets from modEncode were only measured with ChIP-Chip technology, we decided to binarize the data ourselves. For more details on this procedure, see supplemental materials. After the binarization step we learned hidden Markov models and using Viterbi algorithm obtained the assignment of every bin to one of the classes (again we tried 4, 8, and 34 di®erent classes). Removing single-bin domains (identically to what was done in case of k-means clustering), led us to the map of domain boundaries. 3.5. Training Bayesian network classi¯ers We used BNFinder19 to train classi¯er based on Bayesian networks on training set derived from Hi-C data and modEncode tracks. We ¯rst performed 10-fold crossvalidation on the training set which gave the mean area under the ROC curve of 0.9062, reassuring that what we see is not a result of over¯tting. Then we used a classi¯er trained on the whole training set on the dataset prepared from the whole genome of Drosophila. Since the problem of learning Bayesian network for such a huge datasets is computationally challenging, we had to limit our procedure of training the classi¯er. First, we imposed the structure of Bayesian network where all the modEncode features are potential parents to the introduced variable BOUNDARY (which was the only added variable) and those are the only edges that we allowed for in the network (see Supplementary Figure 3). We also restricted our 1442006-12

Supervised learning method for predicting chromatin boundary associated insulator elements

searching procedure to consider only 10 potential parents of a node. The exact invocation of a BNFinder tool is available in supplemental materials.

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by 69.70.201.226 on 01/24/15. For personal use only.

3.6. Training random forest classi¯ers Using package scikit-learn31 we also trained random forest classi¯er on three di®erent training sets. The ¯rst one was identical to the one used in training Bayesian network-based classi¯er, the second one used k-mer content of the same 1,175 positive and 1,175 negative examples as features. Finally, the last one combined both training sets. In 10-fold cross-validation we obtained area under the ROC curve of 0.9238 again making sure that our classi¯er is able to generalize to new pffiffiffiffiffi test sets. During training we used 100 trees in the forest and we considered d 35e ¼ 5 features in every split. For the ¯nal predictions on the whole genome (see Fig. 5) we used the procedure of moving window of size 4,000 bp with 100 bp-long step and averaged the values over every 100 bp bin to smooth the signal. 3.7. Comparison to supervised methods and cdBEST We used obtained classi¯ers to ¯nd the boundaries across the genome of Drosophila. To compare both supervised and unsupervised methods, we checked how far are their predictions from those of Sexton et al.8 For unsupervised methods, we treated a prediction as correct when this distance was not larger than 4,000 bp. In the same manner, we computed the quality of boundaries from Srinivasan et al.18 For supervised methods, we simply drew ROC curve using the vector of probabilities given by the classi¯er and the vector of zeros and ones, where we had one if there was a boundary from Sexton et al.8 present in the bin. The second method of measuring the performance is more restrictive: Instead of having a boundary anywhere within a 8,000 bp-large window centered around the prediction, we demanded a boundary to be exactly within a bin of size 4,000 bp for which we predicted it. 3.8. Preparation of training set for classi¯er based on k-mers For comparison of the classi¯cation based on modEncode features, we developed a classi¯er based on sequential features. We used k-mer approach, which uses all the possible k-mers as features, count their occurrences in positive and negative dataset and try to infer, which of them are predictive for one of the groups. We tested parameter k from 2 to 6 and we observed the best performance for 4-mers. We also mixed 4-mer features with modEncode data which resulted in classi¯er slightly better than when we used only modEncode data.

4. Discussion and Conclusion In this paper, we have shown a new approach for genome-wide prediction of boundary elements from ChIP-Seq data. From our results, we can conclude 1442006-13

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by 69.70.201.226 on 01/24/15. For personal use only.

P. Bednarz & B. Wilczy n  ski

that, at least in case of the model organism Drosophila melanogaster, where the functional genomics data is available through the e®orts of the modEncode project, it is now possible to accurately predict (AUC  0:9) locations of boundary elements. Another conclusion from our study is that the ChIP-Seq data on histone modi¯cations and chromatin associated protein occupancy is a superior source of features for such a classi¯er. Given that the ChIP-Seq data tracks can improve sequencebased classi¯ers and that adding sequence information does not improve the ChIPSeq based classi¯er we speculate that while sequence motifs are important for recruitement of some key insulator factors, then the formation of active boundary elements depends on some additional context re°ected in the ChIP-Seq data. Currently, we are unable to tell the exact structure of causal dependencies in this problem, but we think that we are partly using the actual context epigenetic information that is causal to boundary formation and some downstream e®ects of such event. This hypothesis is consistent with what we see in the comparison of feature importance in random forest based classi¯ers. The expected insulation factors CP190, BEAF32, and CTCF are the most important factors determining boundary formation, however, many additional variables play a role in speci¯c locations. On the sequence level, we can see that all the expected k-mers relevant for insulator factor recruitment are found to be signi¯cant (CTCC, GATC, and GAGA), however, there are many nonspeci¯c sequences that seem to be more generally connected to chromatin openness (e.g. nucleosome repelling stretches of poly A). Finally, we are able to show, that not only is our approach succesful in crossvalidated supervised learning experiments, but it is more generally applicable to di®erent datasets. We show it both by using a di®erent training set of indirectly established boundary locations from DamID study as well as pointing several examples of known insulator positions, which are not present in our positive training set, but are nonetheless correctly predicted by our approach. Given the results presented in this paper, we are con¯dent that such supervised approaches as the one we present will gain in popularity among the biological community. In our opinion such approaches can be useful in the studies on chromatin modularity and are not limited to model organisms as soon as the quality of the available Hi-C datasets for these species improves to the quality currently available for the fruit°y. Author's Contributions BW conceptualized the approach, suggested the datasets and methods to use and supervised the experiments. PB performed all the experiments and made all the ¯gures. Both authors contributed to writing and accepted the ¯nal text of the manuscript. 1442006-14

Supervised learning method for predicting chromatin boundary associated insulator elements

Acknowledgment This work was supported by the Polish National Center for Research and Development grant No. ERA-NET-NEURON/10/2013.

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by 69.70.201.226 on 01/24/15. For personal use only.

References 1. Bell AC, West AG, Felsenfeld G, Insulators and boundaries: Versatile regulatory elements in the eukaryotic genome, Science 291(5503):447–450, 2001. 2. Filion GJ, Van Bemmel JG, Braunschweig U et al., Systematic protein location mapping reveals ¯ve principal chromatin types in drosophila cells, Cell 143(2):212–224, 2010. 3. Vasanthi D, Mishra RK, Epigenetic regulation of genes during development: A conserved theme from °ies to mammals, J Genetics Genomics 35(7):413–429, 2008. 4. Kornberg RD, Chromatin structure: A repeating unit of histones and DNA, Science 184 (4139):868–871, 1974. 5. Shanker V, Bhatia S, Utility of sex chromatin predictor of genetic merit in animal production a review, World Review of Animal Production, Vol XIX, No. 1, 1903. 6. Hanson CV, Shen CK, Hearst JE, Cross-linking of DNA in situ as a probe for chromatin structure, Science 193(4247):62–64, 1976. 7. Oudet P, Gross-Bellard M, Chambon P, Electron microscopic and biochemical evidence that chromatin structure is a repeating unit, Cell 4(4):281–300, 1975. 8. Sexton T, Ya®e E, Kenigsberg E et al., Three-dimensional folding and functional organization principles of the drosophila genome, Cell 148:458–472, 2012. 9. Nagano T, Lubling Y, Stevens TJ, Schoenfelder S, Ya®e E, Dean W, Laue ED, Tanay A, Fraser P, Single-cell hi-c reveals cell-to-cell variability in chromosome structure, Nature 502(7469):59–64, 2013. 10. Dixon JR, Selvaraj S, Yue F, Kim A, Li Y, Shen Y, Hu M, Liu JS, Ren B, Topological domains in mammalian genomes identi¯ed by analysis of chromatin interactions, Nature 485(7398):376–380, 2012. 11. Wilczyński B, Hvidsten TR, A computer scientist's guide to the regulatory genome, Fundamenta Informaticae 103(1):323–332, 2010. 12. Wilczyński B, Liu Y-H, Yeo ZX, Furlong EEM, Predicting spatial and temporal gene expression using an integrative model of transcription factor occupancy and chromatin state, PLoS Comput Biol 8(12):e1002798, 2012. 13. Ghavi-Helm Y, Klein FA, Pakozdi T, Ciglar L, Noordermeer D, Huber W, Furlong EEM, Enhancer loops appear stable during development and are associated with paused polymerase, Nature 512(7512):96–100, 2014. 14. Wilczyński B, Furlong EEM, Challenges for modeling global gene regulatory networks during development: Insights from Drosophila, Developmental Biol 340(2):161–169, 2010. 15. Phillips-Cremins JE, Sauria MEG, Sanyal A et al., Architectural protein subclasses shape 3d organization of genomes during lineage commitment, Cell 153(6):1281–1295, 2013. 16. Roy S, Ernst J, Kharchenko PV et al., Identi¯cation of functional elements and regulatory circuits by drosophila modencode, Science 330(6012):1787–1797, 2010. 17. Ernst J, Kellis M, Discovery and characterization of chromatin states for systematic annotation of the human genome, Nat Biotechnol 28:817–825, 2010. 18. Srinivasan A, Mishra RK, Chromatin domain boundary element search tool for drosophila, Nucleic Acids Res 40:4385–4395, 2012. 19. Wilczyński B, Dojer N, Bn¯nder: Exact and e±cient method for learning bayesian networks, Bioinformatics 25(2):286–287, 2009. 20. Breiman L, Random forests, Mach Learn 45(1):5–32, 2001. 1442006-15

21. Dojer N, Bednarz P, Podsiadło A, Wilczyński B, Bn¯nder2: Faster bayesian network learning and bayesian classi¯cation, Bioinformatics 29(16):2068–2070, 2013. 22. Erwin GD, Truty RM, Kostka D, Pollard KS, Capra JA, Integrating diverse datasets improves developmental enhancer prediction. arXiv preprint arXiv:1309.7382, 2013. 23. Podsiadło A, Wrzesień M, Paja W, Rudnicki W, Wilczyński B, Active enhancer positions can be accurately predicted from chromatin marks and collective sequence motif data, BMC Syst Biol 7(Suppl 6):S16, 2013. 24. Negre N, Brown CD, Shah PK et al., A comprehensive map of insulator elements for the drosophila genome, PLoS Genetics 6(1):e1000814, 2010. 25. Kursa MB, Rudnicki WR, Feature selection with the boruta package, 2010. 26. Belozerov VE, Majumder P, Shen P, Cai HN, A novel boundary element may facilitate independent gene regulation in the antennapedia complex of drosophila, The EMBO J 22(12):3113–3121, 2003. 27. Azpiazu N, Frasch M, Tinman and bagpipe: Two homeo box genes that determine cell fates in the dorsal mesoderm of drosophila, Genes Development 7(7b):1325–1340, 1993. 28. Wilczyński B, Furlong EEM, Dynamic CRM occupancy re°ects a temporal map of developmental progression, Molecular Syst Biol 6(1):383, 2010. 29. Chapman B, Chang J, Biopython: Python tools for computational biology, ACM SIGBIO Newslett 20(2):15–19, 2000. 30. de Hoon MJ, Imoto S, Nolan J, Miyano S, Open source clustering software, Bioinformatics 20(9):1453–1454, 2004. 31. Pedregosa F, Varoquaux G, Gramfort A et al., Scikit-learn: Machine learning in python, The J Mach Learning Res 12:2825–2830, 2011.

Pawel Bednarz received his M.Sc. degree in Computer Science from the University of Warsaw, Poland in 2012. Now he is a Ph.D. student working there with Bartek Wilczyński on computational models of gene regulation in development. /

J. Bioinform. Comput. Biol. 2014.12. Downloaded from www.worldscientific.com by 69.70.201.226 on 01/24/15. For personal use only.

P. Bednarz & B. Wilczy n  ski

Bartek Wilczy nski received his M.Sc. degree in Computer Science from the University of Warsaw in 2003 and Ph.D. in Mathematics from the Institute of Mathematics, Polish Academy of Sciences in 2009. Since 2002, when he worked as a research scholar at the Lawrence Livermore National Laboratory he is interested in computational models of gene regulation. Since then, he has approached that problem from many angles, beginning with regulatory sequence analyses in yeast, then modeling simple regulatory networks with stochastic models and now switching to inference of regulatory sequence state from chromatin modi¯cation data. He is now an assistant professor in Computer Science department at the University of Warsaw.

1442006-16

Supervised learning method for predicting chromatin boundary associated insulator elements.

In eukaryotic cells, the DNA material is densely packed inside the nucleus in the form of a DNA-protein complex structure called chromatin. Since the ...
5MB Sizes 0 Downloads 6 Views