IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 18, NO. 3, MAY 2014

817

A Composite Framework for the Statistical Analysis of Epidemiological DNA Methylation Data with the Infinium Human Methylation 450K BeadChip Ioannis Valavanis, Member, IEEE, Emmanouil G. Sifakis, Member, IEEE, Panagiotis Georgiadis, Soterios Kyrtopoulos, and Aristotelis A. Chatziioannou, Member, IEEE

Abstract—High-throughput DNA methylation profiling exploits microarray technologies thus providing a wealth of data, which however solicits rigorous, generic, and analytical pipelines for an efficient systems level analysis and interpretation. In this study, we utilize the Illumina’s Infinium Human Methylation 450K BeadChip platform in an epidemiological cohort, targeting to associate interesting methylation patterns with breast cancer predisposition. The computational framework proposed here extends the— established in transcriptomic microarrays—logarithmic ratio of the methylated versus the unmethylated signal intensities, quoted as M -value. Moreover, intensity-based correction of the M -signal distribution is introduced in order to correct for batch effects and probe-specific errors in intensity measurements. This is accomplished through the estimation of intensity-related error measures from quality control samples included in each chip. Moreover, robust statistical measures exploiting the coefficient variation of DNA methylation measurements between control and case samples alleviate the impact of technical variation. The results presented here are juxtaposed to those derived by applying classical preprocessing and statistical selection methodologies. Overall, in comparison to traditional approaches, the superior performance of the proposed framework in terms of technical bias correction, along with its generic character, support its suitability for various microarray technologies. Index Terms—Bootstrap correction, DNA methylation profiling, epigenomic analysis, intensity-based normalization, microarrays, statistical selection.

I. INTRODUCTION PIGENETIC methylation events are heritable modifications that regulate gene expression without altering the DNA sequence itself. Their participation in the molecular regulatory mechanisms, governing a wide range of biological processes [1], is ever increasing. Irregular methylation patterns have been correlated with certain cancers and specifically hypermethylation of CpG islands at tumor suppressor genes switches off these genes. CpG islands are typically 300–3000 base pairs in length and are within or near to approximately 40% of promoters of mammalian genes [2]. DNA methylation, the first epigenetic alteration to be observed in cancer cells may be

E

Manuscript received March 30, 2013. Date of publication January 9, 2014; date of current version May 1, 2014. This work has been funded by the European Union, Grant Agreement 226756. The authors are with the Institute of Biology, Medicinal Chemistry & Biotechnology, National Hellenic Research Foundation, 11635 Athens, Greece (e-mail: [email protected]; [email protected]; [email protected]; [email protected]; achatzi@eie. gr). Digital Object Identifier 10.1109/JBHI.2014.2298351

altered by a number of factors like aging of tissues, nutrition, and environment [3]–[5]. Over the recent years, rapid progress in microarray technologies, enabling the interrogation of a larger number of DNA/RNA transcripts more efficiently and less costly, has opened new avenues for epigenetic monitoring [6]. Specifically, two broad microarray-based assay categories have been developed to measure DNA methylation, i.e., enrichment-based microarrays and bisulfite sequencing microarrays, with the latter mainly adopted by Illumina [7]. One of the newest microarray platforms is the Illumina’s Infinium Human Methylation 450K BeadChip, which can detect CpG methylation changes in more than 480 000 cytosines distributed over the whole genome [8]. Despite these revolutionary developments in microarray technologies, inherent, technical problems, obscure the estimation of true biological signal, by introducing measurement bias. The idiosyncratic nature of DNA methylation data imposes serious restrictions concerning the successful porting of popular statistical tools and methodologies developed for transcriptomic analysis. This renders well-established approaches for transcriptomic data inapplicable in their original form for DNA methylation data [7], [9]. Therefore, the selection of the appropriate preprocessing and analysis methods, concerning bisulfite sequencing microarray datasets, entails significant, research challenges [7] at the moment. Various preprocessing and analysis methods have been proposed in the literature to date. In [10], researchers managed to efficiently eliminate almost all unwanted variability by applying a multivariate regression model and adjusting for batch effects using the Illumina BeadArray control probes. Moreover, Aryee et al. developed a more universal normalization strategy tailored to DNA methylation data, as well as an empirical Bayes percentage methylation estimator that altogether yielded accurate absolute methylation estimates [11]. They utilized within and between sample normalization approaches, based on platform-specific, control features, used for loess regression fitting [12], [13] and subset quantile normalization [14], respectively. They proposed that control-probe loess procedure may be applied to other two-color tiling array DNA methylation protocols, while the subset quantile normalization is even more widely applicable as it is not tied to microarray data [11]. Also, Sabbah et al. implemented “SMETHILLIUM,” a nonparametric, spatial normalization method for Illumina’s Human Methylation 27 K BeadChip [15]. Later, in a similar fashion to Aryee et al., Sun et al. recommended empirical Bayes

2168-2194 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

818

IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 18, NO. 3, MAY 2014

correction along with normalization for an effective batch effect removal [16]. Finally, Wang et al. designed and developed the—specially tailored to Illumina’s Infinium Human Methylation 450K BeadChip—computational package “IMA,” in order to automate the exploratory analysis in epigenetic studies [17]. The challenges related to the efficient processing of the highthroughput DNA methylation microarray data produced by the novel, high density Illumina’s BeadChip [8] are yet imminent. Thus, no available “gold-standard” methods have been proposed to date. To tackle the issue of technical biases, we propose here a new generic framework for the preprocessing and analysis of such voluminous datasets. The introduced framework is applied to a pool of 96 control and case samples in order to examine retrospectively breast cancer manifestation. DNA methylation is measured through the use of the log2 ratio of the two channels intensities per probe (methylated and unmethylated). This ratio, referred to as M-value, is widely used in transcriptomic microarray analysis. Correction of the M-signal distribution is introduced and utilizes intensity-related error measures. These error estimators are based on the variation of the DNA methylation measurements of the same biological samples, referred to as quality control (QC) samples. The experimental design is such to accommodate the fine screening of the DNA methylation in CpG rich regions, which are probed in terms of their differential methylation between control and case samples. Additionally, bootstrap-based statistical estimates are computed on the distributions of the scaled coefficient variation (CV) of methylation measurements. The latter are defined as the ratio of the CVs of the control and case samples to the respective QCs. In this way, a robust measure of the true variation, compared to the technical one, is derived, exploiting the measured methylation in the QC samples. Standard statistical differential methods are applied to the whole dataset, which consists of cases (subjects with breast cancer) and controls (healthy subjects). Selected subsets of candidate differentially methylated CpG sites are compared to each other. In addition, these subsets are also juxtaposed to those provided when traditional choices for data preprocessing and statistical selection are applied. II. DATASET High-throughput DNA methylation analysis making use of the Illumina’s Infinium technology was first introduced with the Infinium Human Methylation 27 K BeadChip [18]. The dataset studied here contains methylation data extracted using a more recent chip, i.e., the Illumina’s Infinium Human Methylation 450K BeadChip, that includes 485 577 probes (482 421 CpG sites, 3091 non-CpG sites and 65 random SNPs). The available breast cancer dataset encompassed 114 samples, organized in 12-sample chips (along with additional samples). Ninety-six (96) samples correspond to breast cancer case and control samples, matched with cases in terms of age, body mass index, and pre-post menopause. The remaining 18 samples are QCs samples which correspond to the same sample measured in different chips and can therefore be used for reliable estimation of the technical variation observed in the dataset. All 114 samples were organized in 17 arrays, 15 of which contained 1–2 QC samples.

The remaining two arrays contained no QCs. For all samples, blood specimens were obtained from cases and controls. At probe level, two channels, referring to the degree of methylation or not (unmethylation), are used to measure average methylation of the corresponding CpG site. These correspond to two channel intensities available for each probe: IM eth and IUn−M eth , respectively. In order to ensure that statistically nonsignificant detected signals in the chips are excluded from further computations, a detection p-value criterion (p ≤ 0.05) is applied. III. METHODS In this section, computational and statistical aspects of the analysis are described. Particularly, methylation measures that are applied in this paper along with a novel method for normalizing the corresponding distribution based on intensity measurements are first described. Furthermore, statistical modules, i.e., measures of scaled CV of methylation, statistical tests, and bootstrap-based corrections of p-values derived from statistics, are also introduced. A. Methylation Measures To date, two methods are used to measure DNA methylation. The first one is Beta-value, which is used to measure the percentage of methylation (ranging from 0 to 1). It is defined as Beta = IM eth /(IM eth + IUn−M eth ). Beta-value, with a direct biological interpretation, corresponds roughly to the percentage of a site that is methylated. The second method is the—widely used in gene expression microarray analysis—M -value [19], defined as M = log2 (IM eth /IUn−M eth ). M -value is statistically more valid in differential and other statistical analyses compared to Beta-value, as it is approximately homoscedastic [19], and is thus utilized here. B. Intensity-Based Normalization In this paper, M-signal distribution is normalized: 1) taking into account the average intensity level of both channels I = 0.5×log2 (IM eth ×IUn−M eth ) and 2) using the QCs incorporated in each chip. The impact of technical bias in the signal estimates is thus mitigated. The normalization takes place in two successive steps: 1) within-chip and 2) across all probes. 1) Within-Chip Normalization: An error estimator is calculated across all intensity levels after partitioning the intensity space I into percentiles. To this end, available QCs samples (1–2 per chip) are used. A certain probe error is estimated, considering the average M of all probe values in the QCs. All probes measured at a certain intensity level are utilized for the calculation of the error at this intensity level. Probe estimates for all arrays (case, control, and QC samples) are then updated: the corrected M -value of a probe results by subtracting the respective error calculated for the corresponding intensity level. Algorithmic steps at this stage are presented in the frame below. For the case of a chip containing no QC sample, the chip is assigned the I-S error estimates of another chip that contains at least one QC sample, and simultaneously presents the highest similarity in its intensity distributions with chip bearing no QC. To this end, k-means clustering of all samples across chips is

VALAVANIS et al.: COMPOSITE FRAMEWORK FOR THE STATISTICAL ANALYSIS OF EPIDEMIOLOGICAL DNA METHYLATION DATA

819

corresponding p-values are further steering a bootstrap-based correction method (presented in Section III-E). D. Differential Methylation Analysis and Statistical Selection

employed, using the Euclidean distance of vectors of intensities for 104 randomly chosen probes as a similarity measurement. When applying various k parameters (k = 15), samples of the same chip cluster together and along with samples of the two chips with no QCs, since they share similar intensity distributions. A chip without QCs is thus linked to another chip with at least one QC sample, and its signal values are then updated using the I-S error estimates of this chip. Since a chip without QCs contains 12 samples, a majority voting scheme applied resolves the selection of the appropriate chip based on the similarity of the incorporated samples after completion of the k-means clustering. 2) Across All Probes Normalization: At this step, a second normalization, this time per probe, is applied exploiting the standard deviation of the M -values across all t QCs for each probe (where t corresponds to the number of QCs per probe). M -values of each probe across all samples are then updated by subtracting the probe-based error estimate.

C. Scaled CV Measurements of DNA Methylation

b

= abs[CVcontrols∪cases b /CVQCs b ]

E. Correction of p-Values Based on Bootstrap In order to immunize statistical findings against the effect of multiple hypothesis testing, bootstrap-based correction is followed. The scope is to examine whether the obtained p-values either from a statistical test or extracted based on CV measurements, are indeed that extreme, or they could represent random false selections. The p-value in the original p-value distribution is compared to those computed from a series of p-value distributions derived by bootstrapping. The procedure is repeated for a predefined number of rounds. In each round, the original p-value is compared to the p-value distribution derived from the bootstrap distribution and the number of times where the original p-value is greater than the derived p-value is calculated. This number represents the corrected p-value with respect to the original one. In order to incorporate the significance of the initial statistics, we designated a Bayesian-type product term that multiplies both the p-value estimate derived by the bootstrapping, and that of the original statistical test. The calculated value is normalized by the average estimate of this product for all probes. The algorithmic steps for this correction are presented in the frame below. IV. RESULTS AND DISCUSSION

With the purpose of identifying probes that are reliable candidates for differential DNA methylation between the two sample categories (cases versus controls), a scaled CV measurement [(1), (2)] is introduced for a probe b (CVscaled b ). This represents a robust measure of the real interclass variability observed for probe b in the whole sample pool (controls∪cases), when compared to that observed among QC samples, which measures solely the technical variation. The greater CVscaled b , the greater is the real differential methylation (beyond technical signal variation). Therefore, we define CVscaled

In order to extract statistically significant differential DNA methylation patterns between controls and cases at the probe level, a paired t-test is applied to the total of 47 case-control sample pairs that are matched by using clinicopathological variables (see Dataset Section). Moreover, an unpaired t-test was also applied as a stringent measure to test the hypothesis of sample matching. The two sets of p-values computed are further processed according to the bootstrap-based correction method presented in the section III E.

(1)

CVsam ples(1,..,k) b = std(Mb (1,..,k))/mean(Mb (1,..,k)). (2) Based on the distribution of the CVscaled values across all probes, a z-test is applied in the CVscaled distribution to assign a p-value to each CVscaled value. A higher CVscaled is related to a lower p-value, implying more intense interclass variability for this probe. The distributions of CVscaled measurements and

In this section, results are presented separately for: 1) the proposed methods for intensity-based normalization, scaled CV measurements, and statistical selection of differentially methylated CpG sites, and 2) comparative analyses between the introduced and the traditional methodologies for preprocessing and statistical selection. A. Intensity-Based Normalization, Scaled CV Measurements, and Statistical Selection of CpG Sites Regarding intensity-based correction of M -values, the estimation of the error across all intensity levels proved to be greater at lower intensity levels. This implies lower statistical power, 1 When p-values are derived based on the CV sc a le d distribution, CVsc a le d measurements are also inputted to the algorithm.. 2 When p-values are corrected based on the CV sc a le d , the original CVsc a le d distribution is bootstrapped and comparisons in b.i and b.ii steps are performed based on CVsc a le d measurements (original or bootstrapped).

820

IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 18, NO. 3, MAY 2014

TABLE I NUMBER OF SELECTED PROBES PER STATISTICAL SELECTION METHOD AND THEIR INTERSECTIONS

Fig. 1. (a), (b) M -value error estimator across average intensity levels for two of the available 12-sample chips.

Fig. 2. Histogram of all CVsc a le d measurements ([mean median]=[1.2582 0.9964]).

due to increased noise influence at these levels, requesting adoption of stronger signal to noise thresholds for these signals. This seemed to be a consistent finding observed in all 12-sampled chips (two examples are illustrated in Fig. 1). As it can be seen in Fig. 2, where the histogram of the CVscaled measurements is illustrated, the impact of the various normalization steps of our composite framework is such that median of the CVscaled distribution equals to 0.9964 (very close to 1). This supports the plausibility of the processing steps applied, as a CVscaled measurement of 1 implies that the variation observed between case and control samples is the same with that observed in QC samples (technical variation). Extrapolating, one can conclude that the top 50% probes in this distribution demonstrate

higher variation that the technical one; something that is ever strengthening when narrowing the selection threshold. Furthermore, if coupled with another statistical selection method (e.g., statistical selection using paired or unpaired t test), it can filter out unreliable probes in terms of signal quality, thus retaining sound candidates with respect to their actual differential DNA methylation expression. In this study, this strategy was adopted, resulting in selected probe subsets that were qualified by both approaches. The two sets of p-values extracted from the conventional statistical tests (paired and unpaired t-tests) and the one from the CVscaled measurements were further corrected exploiting bootstrap resampling (nboot = 105 ). Regarding statistical thresholds of both t-tests, top 1% and top 5% of probes were adopted, using their corrected p-values. In the case of the CVscaled measurements, CVscaled > 1 criterion as well as the top 10% performing probes, utilizing the corresponding corrected p-values, were selected. In Table I, the results of the unpaired and the paired t-test selection are juxtaposed with each other or the selection using CVscaled measurements criteria. Paired t-test seems to intensify the significance effect in the selection process compared to unpaired t-test, i.e., lower p-value estimates were found for the same selection threshold (top 1% or top 5%). In this paper, the top 1% criterion in paired t-test was combined with the CVscaled > 1 criterion, in order to derive a finally selected robust subset of CpG sites that are both significantly differentially methylated and immunized against technical variation. Both criteria provided a subset of 2397 CpG sites (see Table I, last column) corresponding, according to the Illumina’s Infinium Human Methylation 450K BeadChip annotation, to 1717 unique UCSC Gene IDs. B. Comparison Between Traditional and Proposed Preprocessing and Statistical Selection Methods Typically, to minimize extraneous variation in the measured gene expression levels, gene expression M -values from twochannel arrays are preprocessed using within-array (e.g., loess [12], [13]) and across-array (e.g., quantile [20]) normalization

VALAVANIS et al.: COMPOSITE FRAMEWORK FOR THE STATISTICAL ANALYSIS OF EPIDEMIOLOGICAL DNA METHYLATION DATA

TABLE II NUMBER OF SELECTED PROBES USING THE TRADITIONAL FC VERSUS THE PROPOSED SCALED CV METHOD, AND THEIR INTERSECTIONS

techniques. Then, the differentially expressed genes are selected using both fold-change and p-value cutoffs. To compare the proposed approaches for preprocessing and statistical selection with their traditional choices, we additionally applied the aforementioned procedures in our data (in which we refer from now on in the manuscript as the traditional methodology). More specifically, linear loess normalization was applied within each sample, whereas quantile normalization was applied across all samples. Statistical selection was based on a fold-change threshold of 2 and two different p-value thresholds of 0.01 and 0.05 (for both unpaired and paired t-tests). When comparing with the CVscaled criterion, the distribution derived according to the traditional methodology and the one we introduce here, the statistical properties of the latter yield larger values: [mean median] = [1.2582 0.9964] versus [mean median] = [1.1965 0.9133]. This implies that the detrimental effect of technical variation is removed more effectively through our methodology. In Table II, the results of the traditional fold-change values are juxtaposed to those of our scaled CV measurement criterion. In this comparison, the two criteria have been separately combined with the unpaired and the paired t-test statistical selection (bootstrap corrected) using the normalized data of the introduced methodology. It is worth noting that the application of the combined fold change (FC) together with the statistical selection increases significantly the overlap with the scaled CV plus statistical selection (100 out of 266 or 37.59% for the unpaired 0.01 case—96 out of 252 or 38.10% for the paired 0.01—354 out of 1002 or 35.33% for the unpaired 0.05—329 out of 924 or 35.61% for the paired 0.05) in relation to that derived by the application of the statistical selection methods alone (see Table I, the top 2 rows). For the latter, the observed overlap is 483 out of 4856 probes or 9.95% for the unpaired 0.01 case, and 478 out of 4856 or 9.84% for the paired 0.01 case. This implies that the application of the FC coupled to the statistical significance selection criterion manages to partly alleviate the effect of noise regarding the selection of possible false positives. However, it still keeps a high proportion of it in the candidate probes. Furthermore, in Table III, results about the sets of differentially methylated CpG sites derived after applying the traditional normalization procedure, are juxtaposed to those derived

821

TABLE III NUMBER OF SELECTED PROBES USING THE PROPOSED STATISTICAL SELECTION METHOD AFTER THE TRADITIONAL PREPROCESSING METHOD VERSUS THE PROPOSED PREPROCESSING, AND THEIR INTERSECTIONS

after application of the introduced normalization (unpaired and paired t-test statistical selection with bootstrap correction and the proposed scaled CV criterion were used for the comparison). First, it is observed that the introduction of the intensity-based normalization has a significant impact in the estimation of the actual differential expression values of the DNA methylation. This impacts essentially the selection procedure and results to around a 30% overlap for the stringent (0.01) confidence limit. The overlap is somehow improved extending to 35.5% for the milder 0.05 level. The application of paired parametric t-tests is alleviating the discrepancies among the two normalization methods. Specifically, paired t-tests application yields higher overlaps between the selected sets of differentially methylated probes, namely 37.03% for the stringent 0.01 level and 43.94% for the milder 0.05 one. It should be mentioned here that the intensity-based normalization, due to the very high density of the 450K Illumina chips exploited here, presents a very good tradeoff between: 1) the need for calculation of a smoothing coefficient correcting for systematic errors and 2) not overfitting the signal estimates. The fact that our intensity-based normalization method is applied in a percentile mode permits the derivation of smoothing values, according to a very neat separation of the signal intensities range. At the same time and due to the very high density, the error estimates for each percentile are derived from more than 4000 probes each. This enables extremely precise error estimates. In addition, there has been extensive validations for the differential methylation estimates, implemented by pyrosequencing (unpublished results), confirming the power of the proposed signal processing methodology. C. Application of the Proposed Framework to Another Dataset The methodology proposed here was applied to another dataset derived by Illumina’s Infinium Human Methylation 450K BeadChip, which concerns the same disease (breast cancer) [21]. This dataset regards 15 twin pairs (breast cancer cases and controls) that however contained no QC samples. Under the assumption that the variations observed in the methylation signal distributions of the control samples share a certain degree of similarity across the majority of probes, the control samples could be used instead of QCs, to derive intensity correction measures, thus extending application of the proposed normalization

822

IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 18, NO. 3, MAY 2014

strategy in a dataset with no QCs. Heyn et al. [21] used quantile normalization, and on the basis of a Wilcoxon paired test (p < 0.001) reported a set of 403 differentially methylated CpG sites. We applied the intensity-based normalization method presented here and a paired t-test followed by the bootstrap-based p-value correction method. It was found that 123 out of the 403 CpG sites reported by Heyn et al. were in the top 403 CpG sites, as sorted by our framework. Furthermore, 29 (9) out of top 50 (10) CpG sites shown by our methodologies were found in the top 403 CpG sites as reported by Heyn et al. This comparative assessment confirms the applicability of the methodology presented for the analysis of DNA methylation microarray datasets, thus designed. D. General Remarks—Future Work In general, the modular methodological framework presented here is highly generic in the way it processes the population of signals. This concerns two aspects. The former regards its application to the entirety of the signal population. Thus, it enables the inference of the signal measurement using robust intensitybased error estimators. The latter is the modularity regarding the assessment of the probe quality. This exploits a reliable measure of the variation of each signal, which when applied to the whole distribution suspends intensity-specific bias. In this way, a normalized signal distribution across comparisons is derived, which reveals the true extent of signal variation. Reliable statistical cutoffs can be therefore defined, which are further corrected by resampling methods, in order to explore the true differential methylation. Whenever the experimental design is such as to afford QC samples (to enable derivation of reliable technical variation measures), the methodology proposed here is straightforward applicable. It is even applicable for the case of mRNA data analysis as well or when no QCs are available. Here, the methodology was applied successfully to a highthroughput DNA methylation dataset obtained with the Infinium Human Methylation 450K BeadChip and regards breast cancer samples and controls. The dataset has been deposited in NCBI’s Gene Expression Omnibus [22] and is accessible through GEO Series accession number GSE52635 (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc = GSE5 2635). As future work, already in the phase of implementation, the development of a web pipeline is envisaged to provide access to the algorithmic framework presented previously. We also intent to analyze genes related to the finally selected CpG sites selected here in terms of their functional content (Gene Ontology terms, KEGG pathways) aiming to promote the elucidation of the epigenomic molecular regulatory mechanisms relevant to the genesis and progression of breast cancer. V. CONCLUSION In this paper, a composite computational framework for the analysis of high-throughput DNA methylation data, specifically those produced from Illumina’s technological platform is presented. Particularly, an intensity-based normalization method of M -values is proposed. In addition, a scaled CV ratio term is

introduced, so as to tackle effectively the detrimental impact of technical bias when assessing real interclass variability (control versus case samples). The methods proposed here exploit the technical controls available due to the experimental design of the specific dataset. However, these methods that have been applied in this work, are quite generic so as to accommodate other designs, even such that dispense with QC samples. As a third step, to enhance the reliability of the statistical values derived, a bootstrap-based p-value correction algorithm is followed, either to the statistical selection results or to the scaled CV measurements. The framework is applied to a breast cancer dataset. The preliminary results seem promising as it can be surmised by the comparison of the extracted results with the ones extracted using traditional methods. The composite framework proposed here is generic enough so as to be extended to or accommodate other tangible high-throughput analysis tasks, as those of various microarray technologies.

REFERENCES [1] A. Bird, “DNA methylation patterns and epigenetic memory,” Genes Develop., vol. 16, no. 1, pp. 6–21, 2002. [2] M. Fatemi, M. M. Pao, S. Jeong, E. N. Gal-Yam, G. Egger, D. J. Weisenberger, and P. A. Jones, “Footprinting of mammalian promoters: Use of a CpG DNA methyltransferase revealing nucleosome positions at a single molecule level,” Nucleic Acids Res., vol. 33, no. 20, e176, 2005. [3] M. Esteller and J. G. Herman, “Cancer as an epigenetic disease: DNA methylation and chromatin alterations in human tumours,” J. Pathol., vol. 196, no. 1, pp. 1–7, 2002. [4] L. Johnson, X. Cao, and S. Jacobsen, “Interplay between two epigenetic marks. DNA methylation and histone H3 lysine 9 methylation,” Current Biol., vol. 12, no. 16, pp. 1360–1367, 2002. [5] J. Zhang, Y. Fu, J. Li, J. Wang, B. He, and S. Xu, “Effects of subchronic cadmium poisoning on DNA methylation in hens,” Environ. Toxicol. Pharmacol., vol. 27, no. 3, pp. 345–349, 2009. [6] J. J. Ter Linde, H. Liang, R. W. Davis, H. Y. Steensma, J. P. Van Dijken, and J. T. Pronk, “Genome-wide transcriptional analysis of aerobic and anaerobic chemostat cultures of Saccharomyces cerevisiae,” J. Bacteriol., vol. 181, no. 24, pp. 7409–7413, 1999. [7] K. D. Siegmund, “Statistical approaches for the analysis of DNA methylation microarray data,” Human Genetics, vol. 129, no. 6, pp. 585–595, 2011. [8] J. Sandoval, H. A. Heyn, S. Moran, J. Serra-Musach, M. A. Pujana, M. Bibikova, and M. Esteller, “Validation of a DNA methylation microarray for 450,000 CpG sites in the human genome,” Epigenetics: Official J. DNA Methylation Soc., vol. 6, no. 6, pp. 692–702, 2011. [9] P. W. Laird, “Principles and challenges of genome-wide DNA methylation analysis,” Nature Rev. Genetics, vol. 11, no. 3, pp. 191–203, Feb. 2010. [10] A. E. Teschendorff, U. Menon, A. Gentry-Maharaj, S. J. Ramus, S. A. Gayther, S. Apostolidou, A. Jones, M. Lechner, S. Beck, I. J. Jacobs, and M. Widschwendter, “An epigenetic signature in peripheral blood predicts active ovarian cancer,” PloS One, vol. 4, no. 12, e8274, 2009. [11] M. J. Aryee, Z. Wu, C. Ladd-Acosta, B. Herb, A. P. Feinberg, S. Yegnasubramanian, and R. A. Irizarry, “Accurate genome-scale percentage DNA methylation estimates from microarray data,” Biostatistics, vol. 12, no. 2, pp. 197–210, 2011. [12] W. S. Cleveland, “Robust locally weighted regression and smoothing scatterplots,” J. Amer. Statist. Assoc., vol. 74, no. 368, pp. 829–836, 1979. [13] W. S. Cleveland and S. J. Devlin, “Locally weighted regression: An approach to regression analysis by local fitting,” J. Amer. Statist. Assoc., vol. 83, no. 403, pp. 596–610, 1988. [14] Z. Wu and M. J. Aryee, “Subset quantile normalization using negative control features,” J. Comput. Biol.: J. Comput. Molecular Cell Biol., vol. 17, no. 10, pp. 1385–1395, 2010. [15] C. Sabbah, G. Mazo, C. Paccard, F. Reyal, P. Hup´e, and P. Hupe, “Smethillium: Spatial normalization method for Illumina infinium HumanMethylation BeadChip,” Bioinformatics, vol. 27, no. 12, pp. 1693–1695, 2011.

VALAVANIS et al.: COMPOSITE FRAMEWORK FOR THE STATISTICAL ANALYSIS OF EPIDEMIOLOGICAL DNA METHYLATION DATA

[16] Z. Sun, H. S. Chai, Y. Wu, W. M. White, K. V Donkena, C. J. Klein, V. D. Garovic, T. M. Therneau, and J. P. Kocher, “Batch effect correction for genome-wide methylation data with Illumina Infinium platform,” BMC Med. Genomics, vol. 4, 84, 2011. [17] D. Wang, L. Yan, Q. Hu, L. E. Sucheston, M. J. Higgins, C. B. Ambrosone, C. S. Johnson, D. J. Smiraglia, and S. Liu, “IMA: An R package for high-throughput analysis of Illumina’s 450 K Infinium methylation data,” Bioinformatics, vol. 28, no. 5, pp. 729–730, 2012. [18] M. Bibikova, J. Le, B. Barnes, S. Saedinia-Melnyk, L. Zhou, R. Shen, and K. L. Gunderson, “Genome-wide DNA methylation profiling using R Infinium assay,” Epigenomics, vol. 1, no. 1, pp. 177–200, Oct. 2009. [19] P. Du, X. Zhang, C. C. Huang, N. Jafari, W. A. Kibbe, L. Hou, and S. M. Lin, “Comparison of beta-value and M-value methods for quantifying methylation levels by microarray analysis,” BMC Bioinformatics, vol. 11, 587, 2010. [20] B. M. Bolstad, R. A. Irizarry, M. Astrand, and T. P. Speed, “A comparison of normalization methods for high density oligonucleotide array data based on variance and bias,” Bioinformatics, vol. 19, no. 2, pp. 185–193, Jan. 2003. [21] H. Heyn, F. J. Carmona, A. Gomez, H. J. Ferreira, J. T. Bell, S. Sayols, K. Ward, O. A. Stefansson, S. Moran, J. Sandoval, J. E. Eyfjord, T. D. Spector, and M. Esteller, “DNA methylation profiling in breast cancer discordant identical twins identifies DOK7 as novel epigenetic biomarker,” Carcinogenesis, vol. 34, no. 1, pp. 102–108, Jan. 2013. [22] R. Edgar, M. Domrachev, and A. E. Lash, “Gene expression omnibus: NCBI gene expression and hybridization array data repository,” Nucleic Acids Res., vol. 30, pp. 207–210, 2002.

Ioannis Valavanis (M’05) received the Diploma degree in electrical and computer engineering from the National Technical University of Athens (N.T.U.A.), Athens, Greece, in 2003, the M.Sc. degree in bioinformatics from the University of Athens, Athens, in 2006, and the Ph.D. degree from the N.T.U.A. in 2009. He is currently with the National Hellenic Research Foundation, Institute of Biology, Medicinal Chemistry and Biotechnology, Athens, Greece, as a Postdoctoral Research Fellow in bioinformatics. He is also an Adjunct Lecturer in the Department of Informatics and Telecommunications, University of Peloponnese, Tripolis, Greece. His research interests include the development of novel artificial intelligence and data mining techniques, microarray data pre-processing and analysis, structural bioinformatics, and decision support systems. He has participated in several EU-funded and National Research initiatives and has published more than 35 journal and conference articles or book chapters.

Emmanouil G. Sifakis (M’07) received the Diploma degree in electronic and computer engineering from the Technical University of Crete, Chania, Greece, in 2004, the M.Sc. degree in biomedical engineering from the University of Patras and National Technical University of Athens (N.T.U.A.), Athens, Greece, in 2007, where he received the Ph.D. degree in 2011. He is currently with the National Hellenic Research Foundation, Institute of Biology, Medicinal Chemistry and Biotechnology, Athens, Greece, as a Postdoctoral Research Fellow in bioinformatics. His research interests include the development of novel DNA microarray data preprocessing methods, clinical bioinformatics, and biomedical signal and image processing and analysis. He has participated in several EU-funded and National Research initiatives and has published more than 15 journal and conference articles or book chapters.

823

Panagiotis Georgiadis received the Bachelor’s degree in chemistry from University of Athens, Athens, Greece, in 1986, and the Ph.D. degree in biochemistry from the University College London, London, U.K., in 1992. He is currently a Research Associate Professor at the Institute of Biology, Medicinal Chemistry and Biotechnology, National Hellenic Research Foundation, Athens, Greece. His research interests include the use of omics technologies in human population studies as biomarkers of environmental exposure and cancer risk, environmental carcinogenesis, and molecular epidemiology studies. He has coauthored a long series of journal and international conference papers in these fields.

Soterios Kyrtopoulos received the Bachelor’s degree in chemistry and the Ph.D. degree in enzymology, both from King’s College University, London, U.K., in 1969 and 1972, respectively. He is currently a Research Professor at the Institute of Biology, Medicinal Chemistry and Biotechnology, National Hellenic Research Foundation, Athens, Greece. He is the Head of the Laboratory of Chemical Carcinogenesis and Genetic Toxicology and the Unit of Environmental Toxicology. His research interests include the role of DNA damage and repair in the mechanism of chemical carcinogenesis, mechanism-based assessment of carcinogenic hazards and risks, and environmental health. He has coauthored a long series of journal and international conference papers in these fields.

Aristotelis A. Chatziioannou (M’10) received the Diploma degree in electrical and computer engineering and the Ph.D. degree in metabolic engineering and biomedical informatics from the National Technical University of Athens, Athens, Greece, in 1996 and 2005, respectively. He has been also active in the field of industrial consulting. From June 2004 to August 2005, he was with Alexander Fleming Biomedical Science Research Centre, as an expert in bioinformatics. Since September 2005, he is a member of the Institute of Biology, Medicinal Chemistry and Biotechnology, National Hellenic Research Foundation, Athens, whereas since 2010 he has been the Principal Investigator of the Metabolic Engineering and Bioinformatics Program. His current research interests include in-silico biology, knowledge mining in life sciences, genomic analysis, molecular pathway analysis, biological ontologies, bioinformatics, biostatistics, metabolic engineering, computational biology, and biomedical image processing. He is the author or coauthor of more than 100 journal and international conference papers in these fields.

A composite framework for the statistical analysis of epidemiological DNA methylation data with the Infinium Human Methylation 450K BeadChip.

High-throughput DNA methylation profiling exploits microarray technologies thus providing a wealth of data, which however solicits rigorous, generic, ...
1MB Sizes 0 Downloads 3 Views