A Bayesian Network-based Approach for Discovering Oral Cancer Candidate Biomarkers Konstantina Kourou, Konstantinos P. Exarchos, Costas Papaloukas, and Dimitrios I. Fotiadis, Senior Member, IEEE Abstract—Oral cancer can arise in the head and neck region. Due to the aggressive nature of the disease, which often leads to poor prognosis, Oral Squamous Cell Carcinoma (OSCC) constitutes the 8th most common neoplasms in humans. In the present work we formulate gene interaction network from oral cancer genomic data using Dynamic Bayesian Networks (DBNs). Four modules were extracted after applying a clustering technique to the network. We consequently explore them by applying topological and functional analysis methods in order to identify significant network nodes. Our analysis revealed that these important nodes may correspond to candidate biomarkers of the disease.

molecules and genes involved in cancer progression can lead to improved cure rates. This is personalized cancer treatment. The development of genetic tests for identifying higher and lower risk groups of cancer recurrence have been suggested for improving treatment rates. Apart from that, the extensive use of computational and statistical methods for exploring the genetic changes within biological pathways can give rise to better knowledge regarding the process of a complex disease, such as OSCC. Additionally, the accurate extraction of predictive models from experimental data through computational techniques has been proven beneficial for providing patient-specific treatment.

Index Terms—Oral cancer, gene expression data, DBNs, gene interaction networks

To this end, several studies in the literature have applied data analysis methods to microarray cancer data sets, aiming to (i) detect differentially expressed genes, and (ii) identify gene interactions within molecular pathways. Specifically, in [4-6] the authors have integrated genomic data, pathway knowledge and network analysis for obtaining genes of interest. In a similar manner, in [7] gene expression profiling is performed for identifying possible biomarkers in OSCC. In the same context, in [8, 9] the profiles of differentially expressed genes are studied aiming to provide accurate predictions for the development or metastasis of the disease. In [10], a network-based approach is proposed for identifying genes in cancer genomic data related to disease outcomes. These prognostic markers are derived by means of reliable functional interaction networks. In [11-13], computational analysis of time series gene expression data is performed through the use of DBNs. The exploitation of time course data for learning the structure of a gene network allows for the identification of gene interactions that can be proven crucial for disease outcomes. Furthermore, the integration of genomic data with network knowledge allows for the identification of biomarkers not only as individual genes but as functional hubs as well [14, 15].

INTRODUCTION Oral cancer is also referred to as head and neck cancer. It can arise in the head and neck region, i.e. in any part of the oral cavity or oropharynx. Due to the aggressive nature of the disease, OSCC constitutes the 8th most frequent neoplasms in humans [1]. A number of causative factors of oral cancer have been listed, with tobacco and excessive alcohol consumption to be the predominant risk factors for the development of the disease [1]. Moreover, the high rates of locoregional recurrences in oral cancer, reveal that an early identification of a disease relapse can be proven very valuable for the patient’s prognosis and treatment [2]. Although recent advances in clinical trials may result in the successful elimination of the disease [3], OSCC still exhibits high recurrence rates. Current research focuses on providing more targeted treatment based on the cancer genetic map of each patient. It is generally argued that cancer arises from changes in parts of the genome, i.e. genes, and that several gene mutations can cause the development of the disease. Thus, the ability to explore the interactions of the D.I. Fotiadis is with the Unit of Medical Technology and Intelligent Information Systems, Dept. of Materials Science and Engineering, University of Ioannina, GR 45110, Ioannina, Greece, and with the Foundation for Research and Technology-Hellas (FORTH) (corresponding author phone: +302651009006; fax: +302651008889; e-mail: [email protected]). K. Kourou is with the Unit of Biological Applications and Technology, University of Ioannina, GR 45110, Ioannina, Greece (e-mail: [email protected]). K.P. Exarchos is with the Unit of Medical Technology and Intelligent Information Systems, Dept. of Materials Science and Engineering, University of Ioannina, GR 45110, Ioannina, Greece (e-mail: [email protected]). C. Papaloukas is with the Unit of Biological Applications and Technology, University of Ioannina, GR 45110, Ioannina, Greece (e-mail: [email protected]).

978-1-4244-9270-1/15/$31.00 ©2015 IEEE

In this work, we propose a framework for learning the structure of oral cancer genetic network based on DBNs. Time series gene expression data are exploited in order to identify the most significant genes and to formulate gene interaction networks. Following a genomic analysis based on computational methods, DBNs can be used for constructing the causal relationships among the variables within the same time-slice and between consecutive time-slices. We have formulated three DBNs: (i) one for the total number of patients, (ii) one for the group of patients that had suffered a relapse and (iii) one for those that had not been diagnosed with a disease relapse. The resulting gene networks are then analyzed in terms of topological and functional parameters. The accurate prediction of gene interactions related to oral cancer can determine possible connections of disease states

7663

during different time slices. Moreover, the integration of time-course gene expression values with network knowledge allowed us for the discovery of important nodes related to oral cancer recurrence. MATERIALS AND METHODS A. OSCC Dataset Figure 1 depicts the schematic representation of the proposed methodology. In this study we consider 23 patients [NeoMark project, FP7-ICT-2007-224483, ICT enabled prediction of cancer recurrence]. After these patients had been diagnosed with OSCC and had reached complete remission, genomic data from circulating blood cells had been collected, at the baseline state and during scheduled visits, in consecutive time intervals of each patient. Consequently, patients had been discriminated into two groups, namely relapsers and non-relapsers based on the occurrence or not of a disease relapse during the follow-up period. More specifically, 12 out of 23 patients had already suffered a recurrence, while the remaining 11 were still disease-free, during the follow-up period. Initially, the genomic data are applied to preprocessing steps in order to avoid any systematic variations [16]. After the employment of an algorithm for microarray analysis, a significant number of differentially expressed genes was summarized and the quality of our dataset was enhanced [16]. The retained genes (n=9) along with those that were extracted from the literature as oral cancer risk associated genes (n=28), were then fed as input to the next step of our methodology aiming to infer their interaction network.

Figure 1. Schematic representation of the proposed methodology.

B. Methods Dataset formulation The initial blood genomic data file consisted of 45,015 gene expression values for each patient. After preprocessing steps, duplicate and control features were removed, as well as genes of low quality or high rates of missing values. The output was a set of 33, 491 genes which were then fed as input to the Significance Analysis of Microarrays (SAM) algorithm aiming to identify a limited subset of the most differentially expressed genes [17]. SAM searches for genes

that differ significantly in terms of their expression during the follow-up period. The final gene list was decided upon the False Discovery Rate (FDR) of the gene expression values between the two groups of patients. TABLE I contains the number of the most significant genes as pinpointed by the employment of the SAM algorithm to our initial data set. TABLE II contains the number of genes that have been selected as oral cancer risk associated genes in the literature [18, 19]. The total number of genes from both tables was exploited along with their expression values for the next steps of our methodology. TABLE I. List of the most significant genes after the employment of the SAM algorithm to our initial genomic dataset.

HMCN1 RGMA TSC1

AK023526 NOTCH2 STX6

THC2447689 THC2344152 LEPRE1

TABLE II. List of the most significant genes identified in the literature towards the progression of oral cancer.

LOC401010 FKBPL SMARCC2 CIDEB CLDN1 ANP32A GALNT6 LOC492303 CRYAA KIAA1033

PRKAR2A AP1G2 PKD2 THC2280373 BF368414 ENST00000344339 BQ333643 A_24_P170365 LOC389786

A_32P133402 C21orf87 GTF2H4 C1orf144 LOC644276 NP285481 LIG3 ZNF205 C17orf71

Dynamic Bayesian Networks After the dataset of the 37 genes for each one of the 23 patients had been formulated, it was fed as input to the next step of our analysis in order to learn the structure of the gen interaction network. It should be clarified that the input file for our analysis contains the expression values of 37 genes in two time slices, i.e (i) the baseline (t) and (ii) the followup (t+1). BNFinder2 [20] was employed aiming to infer the structure of the networks from our experimental expression data. DBNs can be considered as temporal extensions of Bayesian Networks (BNs) [21]. They can be used for “exploring” biological networks in terms of temporal changes of nodes (genes and proteins) as well as of formation of new nodes or removal of existing, over timeslices. The employment of DBNs to formulate the causal relationships among variables can be defined as the process of inferring the possible interactions between genes from experimental genomic data and through computational analysis. As mentioned above, we have constructed gene networks using DBNs, thus, producing directed acyclic graphs (DAGs). These networks were then fed as input to Cytoscape [22] for visualization purposes, as well as, for further functional and topological analysis. Regarding the gene networks inferred separately for each group of patients,

7664

we subsequently mapped their interactions and extract the significant nodes in terms of topological analysis. For the total gene dataset, i.e the 37 genes with their expression values in two consecutive time-slices for the 23 oral cancer patients, a DBN was formulated aiming to infer their interactions over time. Furthermore, the functional and topological analysis led to better insight into the mechanisms underlying their interactions. We further exploited the resulting information for the identification of important nodes related to oral cancer progression. After the employment of a clustering technique [10] four modules were found. Figure 2 depicts the output of the DBN approach along with the identified modules. Each module was functionally analyzed in terms of the GO biological process domain. A number of different biological processes were found in each module. Protein stabilization, placenta blood vessel development, determination of left/right symmetry, nucleotide-excision repair and cell cycle arrest have the highest number of proteins in specific genes in the network.

Furthermore, the degree index was calculated for each module in the network. The degree is a topological parameter and refers to the number of nodes that have direct connections to a given node. Nodes with the highest degree of connectivity are called hubs. In our gene interaction network the node with the highest degree in module_0 is the RGMA gene. Additionally, in module_1, module_2 and module_3 the highly connected nodes are the LOC492303, LOC401010 and C1orf144 genes, respectively. It should be noted that in module_1 the next node, after the LOC492303 gene, with high degree of connectivity is the HMCN1. This node along with RGMA from module_0 was pinpointed as differential expressed genes between relapsers and nonrelapsers in our preprocessing analysis. Thus, we further studied their interactions with the remaining seven important genes from our dataset after the employment of DBNs. For the most significant genes of our initial genomic dataset, as pinpointed by the SAM algorithm, we formulated two DBNs for each one of the two patient groups, i.e. relapsers and non-relapsers. Figure 3 and Figure 4 represent the gene interaction networks for the two groups of patients, respectively.

Figure 2. The resulting DBN architecture. Nodes with the same color belong to the same module.

Figure 3. The constructed DBN for the patients that had suffered a relapse during the follow-up period.

RESULTS AND DISCUSSION

Topological parameters were also calculated for each module. TABLE III depicts the topological differences among the four modules of the gene interaction network. TABLE III. Topological parameters of each module in the gene interaction network.

Parameters Clustering coefficient Connected components Avg. number of neighbors Number of nodes

Module 0 0.096

Module 1 0.112

Module 2 0.107

Module 3 0.084

1

1

1

1

8.4

8.4

7.625

6.4

13

13

8

5

Figure 4. The constructed DBN for the patients that had not suffered a relapse during the follow-up period.

7665

Searching for hubs in these networks we identified a common node which is highly connected in both networks, i.e. TSC1. Only one common interaction of this gene was found between the relapsers and non-relapsers networks, i.e TSC1 → AK023526. Further exploitation of the different interactions of TSC1 among the two groups may provide important information related to their activation during the progression of oral cancer. Furthermore, we searched each node of the networks in the cancer gene index according to [10]. We found that TSC1, NOTCH2 and STX6 genes are annotated as important to several cancer types. After searching these genes in the four modules of the initial DBN, we found that NOTCH2 and STX6 belong to the same module, whereas TSC1 belong to a different one. Moreover, we detect that NOTCH2 was present in five different biological processes of the network. Thus, this node in accordance to biological and network knowledge may reveal very important interactions in cancer progression.

We presented a computational approach towards the inference of gene networks using the knowledge of timecourse gene expression data. Specifically, we exploited genomic data from 23 patients with oral cancer, consisting of the most differentially expressed genes. This gene set along with the genes that have been identified in the literature as oral cancer risk associated genes were utilized for learning the structure of their interaction network, in terms of DBNs. Furthermore, we formulated and evaluated the performance of the gene networks among relapsers and non-relapsers. The proposed work revealed optimal gene network structures which can be considered crucial for oral cancer progression. Based on our results, in a future work the inference of gene interaction networks from larger datasets and expanded with literature knowledge can lead to personalized genetic tests and improve patient’s treatment. Moreover, for a more realistic inference of gene networks based on our approach, the integration of further types of data, such as miRNAs and transcription factors could also be considered for more robust results. REFERENCES

[2]

[3] [4]

[5]

[7]

[8]

[9]

[10] [11]

CONCLUSIONS

[1]

[6]

Y. Safdari, M. Khalili, S. Farajnia, M. Asgharzadeh, Y. Yazdani, and M. Sadeghi, "Recent advances in head and neck squamous cell carcinoma—A review," Clinical biochemistry, vol. 47, pp. 11951202, 2014. W. M. Mendenhall, J. W. Werning, and D. G. Pfister, "Treatment of head and neck cancer," DeVita VT Jr, Lawrence TS, Rosenberg SA: Cancer: Principles and Practice of Oncology. 9th ed. Philadelphia, Pa: Lippincott Williams & Wilkins, pp. 729-80, 2011. A. Forastiere, R. Weber, and K. Ang, "Treatment of head and neck cancer," N Engl J Med, vol. 358, p. 1076, 2008. W. Yang, K. Yoshigoe, X. Qin, J. S. Liu, J. Y. Yang, A. Niemierko, Y. Deng, Y. Liu, A. K. Dunker, and Z. Chen, "Identification of genes and pathways involved in kidney renal clear cell carcinoma," BMC bioinformatics, vol. 15, p. S2, 2014. K. Kalantzaki, E. S. Bei, K. P. Exarchos, M. Zervakis, M. Garofalakis, and D. I. Fotiadis, "Nonparametric network design and analysis

[12] [13] [14] [15]

[16]

[17]

[18]

[19]

[20] [21] [22]

7666

of disease genes in oral cancer progression," Biomedical and Health Informatics, IEEE Journal of, vol. 18, pp. 562-573, 2014. M. S. Cline, M. Smoot, E. Cerami, A. Kuchinsky, N. Landys, C. Workman, R. Christmas, I. Avila-Campilo, M. Creech, and B. Gross, "Integration of biological networks and gene expression data using Cytoscape," Nature protocols, vol. 2, pp. 2366-2382, 2007. C. L. Estilo, O. Pornchai, S. Talbot, N. D. Socci, D. L. Carlson, R. Ghossein, T. Williams, Y. Yonekawa, Y. Ramanathan, and J. O. Boyle, "Oral tongue cancer gene expression profiling: Identification of novel potential prognosticators by oligonucleotide microarray analysis," BMC cancer, vol. 9, p. 11, 2009. Y. Wang, J. G. Klijn, Y. Zhang, A. M. Sieuwerts, M. P. Look, F. Yang, D. Talantov, M. Timmermans, M. E. Meijer-van Gelder, and J. Yu, "Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer," The Lancet, vol. 365, pp. 671-679, 2005. L. J. van't Veer, H. Dai, M. J. Van De Vijver, Y. D. He, A. A. Hart, M. Mao, H. L. Peterse, K. van der Kooy, M. J. Marton, and A. T. Witteveen, "Gene expression profiling predicts clinical outcome of breast cancer," Nature, vol. 415, pp. 530-536, 2002. G. Wu and L. Stein, "A network module-based method for identifying cancer prognostic signatures," Genome Biol, vol. 13, p. R112, 2012. M. A. Mutalib, L. E. Chai, C. K. Chong, Y. W. Choon, S. Deris, R. M. Illias, and M. S. Mohamad, "Inferring Gene Networks from Gene Expression Data Using Dynamic Bayesian Network with Different Scoring Metric Approaches," in Advances in Biomedical Infrastructure 2013, ed: Springer, 2013, pp. 77-86. Y. Kim, S. Han, S. Choi, and D. Hwang, "Inference of dynamic networks using time-course data," Briefings in bioinformatics, vol. 15, pp. 212-228, 2014. Z. Bar-Joseph, "Analyzing time series gene expression data," Bioinformatics, vol. 20, pp. 2493-2503, 2004. Y. Zhang, J. Xuan, R. Clarke, and H. W. Ressom*, "Module-based breast cancer classification," International journal of data mining and bioinformatics, vol. 7, pp. 284-302, 2013. Y. Zhang, J. J. Xuan, R. Clarke, and H. W. Ressom, "Module-based biomarker discovery in breast cancer," in Bioinformatics and Biomedicine (BIBM), 2010 IEEE International Conference on, 2010, pp. 352-356. K. P. Exarchos, Y. Goletsis, and D. I. Fotiadis, "A multiscale and multiparametric approach for modeling the progression of oral cancer," BMC medical informatics and decision making, vol. 12, p. 136, 2012. V. G. Tusher, R. Tibshirani, and G. Chu, "Significance analysis of microarrays applied to the ionizing radiation response," Proceedings of the National Academy of Sciences, vol. 98, pp. 5116-5121, 2001. G. C. Warner, P. P. Reis, I. Jurisica, M. Sultan, S. Arora, C. Macmillan, A. A. Makitie, R. Grénman, N. Reid, and M. Sukhai, "Molecular classification of oral cancer by cDNA microarrays identifies overexpressed genes correlated with nodal metastasis," International journal of cancer, vol. 110, pp. 857-868, 2004. P. Saintigny, L. Zhang, Y.-H. Fan, A. K. El-Naggar, V. A. Papadimitrakopoulou, L. Feng, J. J. Lee, E. S. Kim, W. K. Hong, and L. Mao, "Gene expression profiling predicts the development of oral cancer," Cancer Prevention Research, vol. 4, pp. 218-229, 2011. N. Dojer, P. Bednarz, A. Podsiadło, and B. Wilczyński, "BNFinder2: Faster Bayesian network learning and Bayesian classification," Bioinformatics, p. btt323, 2013. N. Friedman, M. Linial, I. Nachman, and D. Pe'er, "Using Bayesian networks to analyze expression data," Journal of computational biology, vol. 7, pp. 601-620, 2000. R. Saito, M. E. Smoot, K. Ono, J. Ruscheinski, P.-L. Wang, S. Lotia, A. R. Pico, G. D. Bader, and T. Ideker, "A travel guide to Cytoscape plugins," Nature methods, vol. 9, pp. 1069-1076, 2012.

A Bayesian Network-based approach for discovering oral cancer candidate biomarkers.

Oral cancer can arise in the head and neck region. Due to the aggressive nature of the disease, which often leads to poor prognosis, Oral Squamous Cel...
566B Sizes 0 Downloads 8 Views