This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2014.2377517, IEEE Journal of Biomedical and Health Informatics

JBHI-00400-2014.R1

1

A Predictive Model for Personalized Therapeutic Interventions in Non-small Cell Lung Cancer Nelofar Kureshi, Syed Sibte Raza Abidi, and Christian Blouin

Abstract— Non-small cell lung cancer (NSCLC) constitutes the most common type of lung cancer and is frequently diagnosed at advanced stages. Clinical studies have shown that molecular targeted therapies increase survival and improve quality of life in patients. Nevertheless, the realization of personalized therapies for NSCLC faces a number of challenges including the integration of clinical and genetic data and a lack of clinical decision support tools to assist physicians with patient selection. To address this problem, we used frequent pattern mining to establish the relationships of patient characteristics and tumor response in advanced NSCLC. Univariate analysis determined that smoking status, histology, EGFR mutation, and targeted drug were significantly associated with response to targeted therapy. We applied four classifiers to predict treatment outcome from EGFR-TKIs. Overall, the highest classification accuracy was 76.56% and the AUC was 0.76. The decision tree used a combination of EGFR mutations, histology, and smoking status to predict tumor response and the output was both easily understandable and in keeping with current knowledge. Our findings suggest that support vector machines and decision trees are a promising approach for clinical decision support in the patient selection for targeted therapy in advanced NSCLC. Index Terms— decision support, epidermal growth factor receptor, non-small cell lung cancer, personalized medicine.

I. INTRODUCTION

S

cientific advances since the completion of the Human Genome Project have confirmed that the genetic composition of individual humans has a significant role to play in predisposition to common diseases and therapeutic interventions. The traditional medicine model has relied on best practices emerging from large population studies and dictates a one-size-fits-all approach [1]. Although synthesized evidence is essential to demonstrate the overall safety and efficacy of medical approaches, it falls short in explaining the Manuscript received June 27, 2014; revised October 10, 2014; accepted Nov 20, 2014. This work was supported in part by NSERC Discovery Grant and CIHR Catalyst Grant. Nelofar Kureshi is with the NICHE Research Group, Faculty of Computer Science, Dalhousie University, Halifax NS (e-mail: [email protected]). Syed Sibte Raza Abidi is with the NICHE Research Group, Faculty of Computer Science, Dalhousie University, Halifax NS (e-mail: [email protected]) Christian Blouin is with Faculty of Computer Science and Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax NS (e-mail: [email protected]).

individual variations that exist among patients. Recent advances in genome-wide association studies have revolutionized the practice of medicine, causing a shift to a patient-centered model [2] and offering tailored diagnostic and therapeutic strategies. The translation of genetic and genomic data into the knowledge of patient care for prevention, diagnosis, prognosis and treatment has introduced a new paradigm for healthcare: personalized medicine. Non-small cell lung cancer (NSCLC) is an example of a disease where personalized medicine has shown tremendous progress. NSCLC comprises 80-90% of all diagnosed lung cancers [3] and approximately two-thirds of patients are not diagnosed until the late stages of the disease, limiting the role of surgical resection as a treatment option. The discovery of epidermal growth factor receptor (EGFR), a key component in the epidermal growth factor signaling pathway, advanced our understanding of the molecular basis of lung cancer. In an attempt to target this molecule, molecules which compete with ATP binding to the tyrosine kinase domain of EGFR were tested and developed [4]. Currently, gefitinib (Iressa®, AstraZeneca) and erlotinib (Tarceva®, Roche) are the two EGFR tyrosine kinase inhibitors (EGFR-TKI) that have been approved for use in clinical practice [5]. Patients with somatic mutations in exons 18-21 of the tyrosine kinase domain of EGFR show marked response to gefitinib and erlotinib [6], [7]. Clinical practice guidelines for NSCLC recommend EGFR mutation testing prior to treatment with EGFR-TKIs [8]. Unfortunately, most NSCLC patients are diagnosed with advanced disease and their histologic specimens are often insufficient for routine molecular profiling. In this patient population, clinical predictors continue to be an important part of treatment selection with EGFR-TKIs. As such, a comprehensive analysis of multiple risk factors is critical for personalized health planning and predictive modeling approaches that appraise all forms of potentially meaningful information are deemed likely to produce informed treatment decisions. Predictive models that combine both clinical and molecular patient information have been successfully developed in many areas of oncology including breast cancer [9], large B-cell lymphoma [10], and prostate cancer [11]. In lung cancer, mixed models that combine multifactorial features have been shown to provide superior prognostic benefit [12], [13]. A small number of composite predictive models that combine

2168-2194 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2014.2377517, IEEE Journal of Biomedical and Health Informatics

JBHI-00400-2014.R1

2

anatomical, clinical, and molecular factors have been developed specifically for NSCLC. Lopez et al developed a prognostic survival model for early stage NSCLC using a supervised learning classification algorithm and demonstrated that the prognostic discrimination of integrated models surpasses that of individual risk factors [14]. Additionally, Spira et al constructed a gene expression biomarker model to predict lung cancer in smokers and then tested it in combination with clinical information, suggesting that an integrated model provided superior specificity for diagnosis [15]. The objective of this study was to investigate the influence of a combination of factors—i.e. clinical predictors, environmental risk factors, and EGFR mutation status—to predict tumor response to EGFR-TKI therapy in patients with advanced NSCLC. We developed a proof of concept datadriven prediction model to assist in personalizing treatment and patient selection in advanced NSCLC. In this paper we present a data-driven prediction model, developed by applying two data mining methods—i.e. association rule mining and decision trees to predict a patient’s response to EGFR-TKI chemotherapy. Given the scarcity of patient data available for such a study, we formulated a novel data curation methodology to derive data from a multitude of secondary sources. We applied association rule mining to validate the inherent relationships between patient characteristics and tumor response patterns as reported in the literature. Decision tree classifiers were then used to develop a visually interpretable prediction model that combined clinicopathological data and EGFR mutation status to categorize patients’ responsiveness to EGFR-TKI chemotherapy. Despite the rather small dataset, our decision tree based prediction model is able to produce meaningful and useful predictions with an accuracy of 75%. Our work presents a unique and interesting approach to develop a prediction model based on both clinical and genetic characteristics for predicting optimal treatments for non-small cell lung cancer with EGFR-TKI inhibitors. In the context of personalized health, we have contributed as follows: (a) despite challenges in the procurement of patient data we have shown the utility and application of secondary data for predictive modeling; and (b) we demonstrated the application of a data mining decision tree algorithm to generate a prediction model that health professionals can readily visualize, interpret, validate and objectively apply to evaluate EGFR-TKI response in advanced NSCLC. II. METHODS The availability of a rich patient data resource to conduct such a study is always a major challenge. Therefore, in this work, we explored and used patient data from secondary sources. Although potential databases were identified but a closer examination revealed that very few stored the attributes of interest to this study (age, gender, histology, EGFR mutation status, EGFR TKI therapy, and objective response) and were publically available. Thus, we took an innovative approach of drawing upon multiple freely available data

sources to construct a research dataset containing attributes for patient demographics, smoking history, NSCLC histology, EGFR mutation, EGFR-TKI treatment and clinical response to targeted therapy. An evidence based approach was used to determine the predictors of response to EGFR TKIs [6]. A. Data Curation Sources for data curation included PubMed, Catalogue of somatic mutations in cancer (COSMIC) [16], and EGFR Mutations database (SM-EGFR-DB) [17]. MeSH terms ("Carcinoma, Non-Small-Cell Lung" AND "Receptor, Epidermal Growth Factor") were used to search for case series, case reports, and research publications between the years 2000-2012. Only articles reporting individual-level patient data on human subjects were selected. The bibliographies of these articles also pointed to relevant literature containing patient-level data. Studies reporting acquired EGFR-TKI resistance treatment with secondgeneration EGFR-TKI were excluded. This search resulted in the identification of 1962 papers from PubMed, 4681 unique mutated samples (each associated with a PubMed ID) from the COSMIC database and 167 articles from SM-EGFR-DB. After screening for inclusion criteria and removal of duplicate studies, 34 articles and 14 case reports were selected for data extraction. Our approach for manual curation and structuring of relevant data from research articles and case reports was as follows: 34 research articles were selected based on the inclusion criteria. For each research article, we reviewed the study design, methods, interventions, and outcomes. Candidate articles reported attributes of demographics, smoking status, diagnosis, EGFR mutations, and response to EGFR-TKI therapy for selected study subjects in the form of tables or supplementary files. Pertinent data was extracted from these tables and combined. Some authors reported patients and rare mutations in multiple papers and care was taken to delete duplicate entries for individual cases. We identified 14 case reports pertaining to NSCLC and EGFR-TKI treatment response. These clinical narratives each presented detailed account of patient history, investigations, treatment and outcome. Through a careful examination of the free text for each case report, we identified the key attributes of interest that were scattered through the report. The aggregation of attributes from individual case reports comprised a single data record. Data extracted from article and case reports was pooled to create the final working dataset. Overall, 355 cases were extracted from 34 research articles and 14 case reports (Table I), where each case was represented by 7 attributes (age, gender, smoking status, diagnosis, drug, EGFR mutation, and tumor response). Most studies reported treatment response as complete response (CR), partial response (PR), stable disease (SD) or progressive disease (PD) as defined by the Response Evaluation Criteria in Solid Tumors (RECIST) [18]. If the study did not specify the use of RECIST and response was classified using terms such as partial regression, complete regression, partial remission, complete remission, major

2168-2194 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2014.2377517, IEEE Journal of Biomedical and Health Informatics

JBHI-00400-2014.R1

3

response or minor response, it was labeled miscellaneous response (MR). Subsequently, a binary outcome using the multiclass outcome was constructed, where the original response attributes were recoded into two classes. The attributes were categorized such that CR, PR and MR acquired the label “responder”, whereas SD and PD were labeled “nonresponder”. The original attribute of EGFR mutation status included 70 distinct mutations, some of which are rare and thus occurred infrequently in the dataset. Several authors have studied mutations and their relationship to EGFR-TKI response according to the mutation’s physical location in the EGFR gene sequence (exon 18-21), type of mutation (point mutation, insertion, deletion, or duplication), and complexity (single mutation, double mutation) [19], classical mutation or complex mutation [20]. Using this domain knowledge, a new attribute was constructed from the existing attribute of EGFR mutation [21]. This was intended to improve the representation of the problem and aimed to facilitate the discernment of patterns that may otherwise have been difficult to recognize.

two groups. Univariate analysis comparing clinical characteristics and mutation status was performed using Pearson’s chi square or Fischer’s exact test, where appropriate. A feature vector was constructed informed by the results of the univariate analysis where tumor response was selected as the outcome of interest. Data were divided into a training set with 291 instances and a test set with 64 instances. Next, attribute selection was performed to determine the subset of features that were highly correlated with the class while having low intercorrelation. Finally, classification algorithms were employed to predict tumor response to targeted therapy in advanced NSCLC. Classifier performance was evaluated using the metrics of accuracy, error, precision, recall and the area under the curve (AUC). Output interpretability of the decision tree was assessed as the qualitative performance criteria. All association-rule mining and classification algorithms were implemented using the Waikato Environment for Knowledge Analysis (WEKA) [25] and the R statistical software.

B. Association Rule Discovery Given that the data for this study was derived from secondary sources, prior to developing the prediction model we needed to objectively validate the generated study dataset. We applied association rule mining to identify inherent associations between patient characteristics and tumor responses within the generated study dataset. Frequent associations were then compared with previously wellunderstood patterns and trends in order to validate the study dataset. Various authors [22], [23], [24] have reported the success of using frequent patterns (or association rules, expressed as IF-THEN rules, in medical databases to confirm previous biomedical knowledge and to discover novel associations between variables. We applied the constrained association rule learning method, where the consequent was limited to include only tumor response as the intent was to detect tumor response patterns. We used the association measures of confidence and lift that indicate the reliability and relevance of an association rule to establish the efficacy of a ‘discovered’ association rule. To derive the association rules, we used an upper bound minimum support was set to 100% and the lower bound to 10%. Initially, the confidence was set to 90% and incrementally dropped by 10% until constrained rules were obtained.

III. RESULTS

C. Predictive Modeling Predictive models to predict the tumor response to erlotinib or gefitinib, at individual-level patient level, were developed using support vector machine and decision tree classifiers. Patients were divided into responder and non-responder groups. Student’s t-test was used to compare ages between the

A. Description of Dataset The study sample comprised 143 (40%) males and 212 (60%) females and mean age was 60 ± 12.03 years. The majority of cases were non-smokers; 50% patients reported never smoking, 40% were former smokers and 10% were current smokers. 11 histologies for NSCLC were documented; of these, adenocarcinoma (79%), squamous cell carcinoma (21%), bronchoalveolar carcinoma (21%), and large cell carcinoma (10%) were the most common histologies recorded. The dataset included 70 distinct EGFR mutations spanning exons 18, 19, 20 and 21. These included point mutations, insertions, deletions, duplications, classical mutations, and complex mutations. 8 EGFR classes were constructed from the existing EGFR mutations, in addition to wild-type status. In general, a complex mutation represented point mutations from two different exons in a single patient. For example, exon 20/exon 21 complex mutations include 'exon 20 R776G + exon 21 L858R', 'exon 20 G779S + exon 21 L858R ', and 'exon 20 R776H + exon 21 L858R'. T790M complex mutations included the exon 20 T790M point mutation in combination with another EGFR mutation. A classical complex mutation was defined as exon 19 deletion and exon 21 L858R point mutation co-existing. The term double mutation denoted two different point mutations on the same exon of EGFR. If a mutation could not be grouped into the aforementioned classes, it was made into its own class, for example, the commonly occurring L858R mutation, occurred independent of any grouping.

2168-2194 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2014.2377517, IEEE Journal of Biomedical and Health Informatics

JBHI-00400-2014.R1

4 TABLE II

TABLE I

SELECTED ASSOCIATION RULES FROM APRIORI ALGORITHM

PATIENT CHARACTERISTICS

Attributes Gender

Smoking status

N

%

Male

143

40.28

Histology=Adenocarcinoma

Female

212

59.72

EGFR mutation=Wildtype EGFR mutation=Wildtype

Current

34

9.58

Former

145

40.85

Never

176

49.58

Antecedent

Consequent

Lift

Confidence

1.93

0.79

1.84

0.75

Responder

1.33

0.79

Responder

1.06

0.63

Responder

1.05

0.62

Responder

1.01

0.60

Nonresponder Nonresponder

EGFR mutation=Exon 19 del Histology

AD

281

79.15

BAC

21

5.92

E746-A750

LC

10

2.82

SCC

21

5.92

Gender=Female, Diagnosis=

Other

22

6.20

Adenocarcinoma Druggefitinib

EGFR status

Mutated

260

73.24

Wildtype

95

26.76

Erlotinib

72

20.28

Gefitinib

226

63.66

Either

57

16.06

Responder

210

59.15

Non-responder

145

40.85

Age=>49.2 Gender=Female Drug

Response

Smoking=Never Histology=Adenocarcinoma

Gender=Female Smoking=Never Histology=Adenocarcinoma

Abbreviations: AD, adenocarcinoma; LC, large cell carcinoma; SCC, squamous cell carcinoma

In our study sample, 92% of patients with EGFR exon 20 insertions/deletions/duplications (n=12) and 75% of EGFR wild-type patients (n=95), were non-responders to EGFR TKIs (Fig. 1). Furthermore, 86% of patients harboring classical complex mutations (n=14) and 92% of subjects with exon 19 non-LREA deletions (n=36) were responders to targeted therapy with erlotinib or gefitinib. L858R was observed in 68 patients; of these 74% were responders and 26% were nonresponders.

The highest lift value corresponded to a rule with a combination of clinical and molecular attributes illustrating that the integration of these characteristics is more informative of treatment response in advanced NSCLC than clinical predictors alone [26]. The diagnosis of adenocarcinoma is frequently seen in patients who achieve response to EGFRTKI therapy. However, when the antecedent contains the additional information of wildtype EGFR status with adenocarcinoma histology, there is a drastic change in the tumor response and the patient is labeled as a non-responder.

B. Association Rule Mining Results The association rule mining algorithm discovered 30 association rules with lift values (indicating interestingness) ranging from 0.9 to 1.93. The largest itemset contained three items. A selection of six ‘interesting’ rules are shown in Table II. Twenty seven rules reported the interaction of only clinical attributes (age, gender, smoking status, and histology). In general, females with never smoking status who had been diagnosed with an adenocarcinoma often achieved a favorable response with TKIs. Two association rules demonstrated the relationship of only EGFR mutations with the tumor response to EGFR-TKI therapy. These rules demonstrated that the presence of sensitizing mutations increase the efficacy of the EGFR-TKI and lead to an improved response, whereas wildtype status leads to progressive disease.

C. Prediction Modeling Results Prior to developing the prediction models, we performed univariate analysis which identified smoking status, histology, drug, and EGFR mutation status as significant predictors of tumor response to EGFR-TKI therapy (Table III). Further attribute selection reduced the predictors to smoking status, histology, and EGFR mutation status. Results of the top four performing classifiers, support vector machine (SMO), J48 (open source Java implementation of the C4.5 algorithm), Random Forest and classification and regression tree (CART), are reported here. The classification accuracy of all algorithms was comparable (Table IV). SMO had the highest area under the curve AUC (0.76) and all decision trees had similar AUCs (0.72-0.73). Given the comparable results of decision trees we decided to pursue decision trees as our prediction model since decision trees provide an illustration of the decision logic that

2168-2194 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2014.2377517, IEEE Journal of Biomedical and Health Informatics

JBHI-00400-2014.R1

5 TABLE III

UNIVARIATE ANALYSIS

Attribute

Responder

Non-responder

p-value

Age

60.18 ±11.99

60.24 ±12.21

0.970 0.218

Gender Male

79(55.24)

64(44.76)

Female

131(61.79)

81 (38.21) 0.002

Smoking status Never

113 (64.20)

63 (35.80)

Former

87 (60)

58 (40)

Current

10 (29.41)

24 (70.59)

IV. DISCUSSION

0.0002

Histology AD

166 (59.07)

115 (40.93)

BAC

19 (90.48)

2 (9.52)

LCC

3 (30)

7 (70)

SCC

6 (28.57)

15 (71.43)

Other

35 (81.40)

8 (18.60) 0.0006

Drug Erlotinib

53 (73.61)

19 (26.39

Gefitinib

134 (59.29)

92 (40.71)

Either

23 (40.35)

34 (59.65) 0.0001

EGFR status Mutated

186 (71.54)

74 (28.46)

Wildtype

24 (25.26)

71 (74.74)

Abbreviations: AD, adenocarcinoma; LCC, large cell carcinoma; SCC, squamous cell carcinoma TABLE IV PERFORMANCE EVALUATION OF CLASSIFIERS

Classifier

is visually comprehensible. Fig. 2 illustrates a portion of the decision tree generated from the training data where EGFR mutation status is the root node. Accuracy of each decision rule from root to leaf is displayed with individual leaves. Within the most commonly detected mutations of exon 19, NSCLC histologies differed in their individual responses. Adenocarcinoma (AD) and bronchoalveolar carcinoma (BAC) were responsive to EGFR-TKI therapy. However, patients with squamous cell carcinoma (SCC) were non-responders. In the case of wildtype status, patients were non-responders regardless of histology, with the exception of BAC, which led to a response from targeted therapy.

Accuracy

Error

Precision

Recall

AUC

SMO

76.56

23.44

0.77

0.77

0.76

J48

75.00

25.00

0.76

0.75

0.73

RF

75.00

25.00

0.76

0.75

0.72

CART

73.44

26.56

0.74

0.73

0.72

Abbreviations: SMO, sequential minimal optimization which is an algorithm for training a support vector classifier; J48, C4.5 classifier in WEKA; RF, random forest; CART, classification and regression tree.

Sensitizing EGFR mutations are an established independent predictor of treatment outcome from TKIs and previous studies have shown that adenocarcinoma histology is associated with higher likelihood of response to gefitinib or erlotinib. Previous research has also established that the combination of clinical and genetic or genomic features has enhanced predictive ability for risk stratification in lung cancer [27], [28]. In keeping with prior investigations, our results demonstrate that the combination of smoking status, tumor tissue histology, and EGFR mutation status can be used to predict treatment outcome from EGFR-TKIs in advanced NSCLC. Our findings are also consistent with previous research that investigated the importance of combining clinical and molecular predictors of outcome to identify cost-effective patient subgroups for treatment in NSCLC. The authors from this analysis determined that patients with adenocarcinoma and never smoker status who had received one prior chemotherapy regimen had the lowest incremental cost effectiveness ratio. This demonstrates that the integration of molecular and clinical predictors has the capacity to determine treatment efficacy as well as treatment cost-effectiveness [29]. In particular, we explored the use of decision trees to generate interpretable classification rules allowing users to understand the underlying decision-making process. Our decision tree illustrates that EGFR mutation is the most powerful variable in determining patient response to TKIs, with histology being the next most important discriminating factor. Interestingly, current management guidelines for NSCLC suggest that patient selection for EGFR mutation testing should be based on tumor histology. Thus, decision trees were shown as a promising technique for analyzing clinical and genetic data for patient outcome prediction. We contend that decision trees can optimize and potentially automate medical decision-making for targeted treatment selection. The model accuracy is comparable to the response rate reported by several phase III trials which have compared TKI therapy with chemotherapy for EGFR mutant patients. The response rate of mutation positive patients receiving erlotinib was 83% as reported by the OPTIMAL trial [30], whereas the EURTAC trial showed a response rate of 58% in genotypically selected patients with NSCLC [31].

2168-2194 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2014.2377517, IEEE Journal of Biomedical and Health Informatics

JBHI-00400-2014.R1

FIG. 1: DISTRIBUTION OF EGFR MUTATION WITH RESPONSE Abbreviations: CCM, classical complex mutation; CM, complex mutation; E19LREA, exon 19 LREA deletion; E19NLREA, exon 19 non-LREA deletion; E21D, exon 21 double mutation; E20IDD, exon 20 insertion/deletion/duplication; WT, wildtype; R, responder; NR, non-responder

FIG.2: DECISION TREE GENERATED FROM LEARNING DATA (PARTIAL TREE SHOWN) Abbreviations: E19LREADEL, exon 19 LREA deletion AD, adenocarcinoma; LC, large cell carcinoma; SCC, squamous cell carcinoma; R, responder; NR, non-responder. Number of instances reaching each leaf with classification accuracy are noted.

2168-2194 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

6

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2014.2377517, IEEE Journal of Biomedical and Health Informatics

JBHI-00400-2014.R1 Recently, a meta-analysis of six trials involving 1,021 EGFR mutation-positive patients with clinical disease stage IIIB-IV reported an overall response rate of 66.6% of EGFRTKIs [32]. This work is the first attempt to apply decision tree algorithms to a dataset that incorporates both clinical and genetic attributes to produce a clear decision tree for the treatment of non-small cell lung cancer with EGFR-TKI inhibitors. Another line of work that was pursed in this work was the curation of the study dataset from publically available patient data, this involved abstracting the data and then validating the information captured within the dataset. In this regard, we developed an approach to extract study-related information from various sources to create a representative sample of advanced NSCLC, complete with demographics, pathological diagnosis, and EGFR mutation data. For this study, the data was extracted manually from the literature and public data repositories, where we were able to abstract elements essential to describe the characteristics of NSCLC patients: age, gender, smoking status, tumor histology, EGFR mutations, therapeutic intervention and tumor response to molecularly targeted therapy. For validating the dataset, we applied a rule association method to observe the presence of (a) well defined patterns; (b) superfluous or irregular patterns; and (c) even new and interesting patterns. Unselected trials for advanced NSCLC have indicated that clinical characteristics, such as female gender, a history of never smoking and the histology of adenocarcinoma were all highly correlated to a favorable response to erlotinib or gefitinib [33], [34], [35]. The results of the association rule mining algorithm (a priori) consistently revealed frequent itemsets that have been previously described from large randomized controlled trials. And, also the association rules highlighted inherent regularities in the data and the frequency of attributes occurring together. We noted that EGFR wildtype status frequently leads to progressive disease after EGFR-TKI therapy, whereas sensitizing mutations in exon 19 confer increased sensitivity [36]. High lift ratios, indicating relevance and interestingness, were found in the association rules containing EGFR mutation characteristics or a combination of clinical and genetic factors. We content that this is the first attempt to apply association based rule learning to explore clinical and molecular characteristics in NSCLC, and our results indicate the aptness of frequent pattern mining methods to both identify new associations and to validate the presence of known association within the study dataset. Limitations of this work include the potential omission of rare patterns and low discriminative capacity of the decision model. As described previously, the current research dataset was created from diverse sources, each with a slightly different experimental approach, DNA sequencing methodology and assessment of tumor response. The limited sample size (n=355) included both common and rare mutations. Pattern mining techniques applied to validate the data discovered many of the well-established clinical and molecular associations. However, traditional pattern mining algorithms are designed to discover interesting patterns in

7 potentially large datasets. Rare features, including several insertions, duplications, deletions and point mutations, had low frequency counts and would not have occurred in frequent itemsets. The highest predictive accuracy of the data-driven decision support model reached 76.56%, implying that a more extensive training set may improve the overall performance of this binary classification. Additional clinical features such as performance status, ethnicity, line of treatment, and additional genetic mutations may help further explain tumor response to erlotinib or gefitinib. Although there is an overwhelming amount of clinical and genomic data being captured and collected, the data are not being analyzed in a manner to produce actionable information [1]. As of yet, this represents lost opportunities for making improvements to personalized healthcare. The current research demonstrates a data-driven decision modeling approach for predicting tumor response to targeted therapies in advanced stage NSCLC. This proof of concept can be extended and further validated using alternative data sources. Longitudinal electronic health records contain comprehensive patient data, through which learning models can decipher the complex interaction between clinical and molecular characteristics that predict tumor response. Alternatively, data warehouses which are not tied to specific institutes or organizations offer increased scope and utility for mining healthcare data. Predictive ability and reliability of decision models relies heavily on the quality and volume of training data; this underlies the importance of data aggregation initiatives from both public and private provider organizations.

V. CONCLUSION The success of personalized medicine will depend on the accurate identification of patients who can benefit from targeted therapies [37]. This work leverages the research on successful predictive modeling in NSCLC that has been previously established by several researchers [14], [15], [38]. In this paper, we demonstrate the development of a decision support model for patient selection in NSCLC using realworld patient data. With rapid advances in biomedicine, we expect that ongoing research efforts will identify new genetic mutations and protein expression signatures involved in NSCLC. Our decision tree model provides a framework in which new forms of phenotypic information and genomic data can be evaluated. This type of data-driven decision support has the potential to rapidly implement research findings into clinical practice and help healthcare providers plan and deliver individualized treatment.

REFERENCES [1]

[2]

C. C. Bennett, T. W. Doub, and R. Selove, "EHRs connect research and practice: Where predictive modeling, artificial intelligence, and clinical decision support intersect," Health Policy and Technology, vol. 1, no. 2, pp. 105-114, Jun 2012. L. Chouchane, R. Mamtani, A. Dallol, and J. I. Sheikh, "Personalized medicine: a patient-centered paradigm," J Transl Med, vol. 9, p. 206, 201.

2168-2194 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2014.2377517, IEEE Journal of Biomedical and Health Informatics

JBHI-00400-2014.R1 [3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

S. Navada, P. Lai, A.G. Schwartz, G.P. Kalemkerian. “Temporal trends in small cell lung cancer: analysis of the national Surveillance Epidemiology and End-Results (SEER) database,” J Clin Oncol, vol. 24, no. 18 suppl, pp. 7082, 2006. R.S. Herbst, M. Fukuoka, and J. Baselga,"Gefitinib—a novel targeted approach to treating cancer," Nature Reviews Cancer, vol. 4, no. 12, pp. 956-965, Dec, 2004. Z. Zhang, A. L. Stiegler, T. J. Boggon, S. Kobayashi, and B. Halmos, "EGFR-mutated lung cancer: a paradigm of molecular oncology," Oncotarget, vol. 1, no. 7, pp. 497-514, Nov, 2010. J. G. Paez, P. A. Janne, J. C. Lee, S. Tracy, H. Greulich, S. Gabriel, et al., "EGFR mutations in lung cancer: correlation with clinical response to gefitinib therapy," Science, vol. 304, no. 5676, pp. 1497-500, Jun 2004. T. J. Lynch, D. W. Bell, R. Sordella, S. Gurubhagavatula, R. A. Okimoto, B. W. Brannigan, P. L. Harris, S. M. Haserlat, J. G. Supko, F. G. Haluska, D. N. Louis, D. C. Christiani, J. Settleman, and D. A. Haber, “Activating mutations in the epidermal growth factor receptor underlying responsiveness of non-small-cell lung cancer to gefitinib,” N Engl J Med, vol. 350, no. 21, pp. 2129-39, May 2004. National Comprehensive Cancer Network (NCCN), “NCCN Clinical Practice Guidelines in Oncology (NCCN Guidelines). Nonsmall cell lung cancer version 2.2013,” [Online].Available at: http://www.nccn.org/professionals/physician_gls/pdf/nscl.pdf [Accessed June 3, 2013]. J. Pittman, E. Huang, H. Dressman, C. F. Horng, S. H. Cheng, M. H. Tsou, C. M. Chen, A. Bild, E. S. Iversen, A. T. Huang, J. R. Nevins, and M. West, “Integrated modeling of clinical and gene expression information for personalized prediction of disease outcomes,” Proc Natl Acad Sci U S A, vol. 101, no. 22, pp. 8431-6, Jun, 2004. L. X. Li, “Survival prediction of diffuse large-B-cell lymphoma based on both clinical and gene expression information,” Bioinformatics, vol. 22, no. 4, pp. 466-471, Feb, 2006. A. J. Stephenson, A. Smith, M. W. Kattan, J. Satagopan, V. E. Reuter, P. T. Scardino, and W. L. Gerald, “Integration of gene expression profiling and clinical variables to predict prostate carcinoma recurrence after radical prostatectomy,” Cancer, vol. 104, no. 2, pp. 290-298, Jul, 2005. S. K. Lau, P. C. Boutros, M. Pintilie, F. H. Blackhall, C. Q. Zhu, D. Strumpf, M. R. Johnston, G. Darling, S. Keshavjee, T. K. Waddell, N. Liu, D. Lau, L. Z. Penn, F. A. Shepherd, I. Jurisica, S. D. Der, and M. S. Tsao, “Three-gene prognostic classifier for early-stage non-small-cell lung cancer,” Journal of Clinical Oncology, vol. 25, no. 35, pp. 55625569, Dec, 2007. A. Potti, S. Mukherjee, R. Petersen, H. K. Dressman, A. Bild, J. Koontz, R. Kratzke, M. A. Watson, M. Kelley, G. S. Ginsburg, M. West, D. H. Harpole, and J. R. Nevins, “A genomic strategy to refine prognosis in early-stage non-small-cell lung cancer,” New England Journal of Medicine, vol. 355, no. 6, pp. 570-580, Aug, 2006. A. Lopez-Encuentra, F. Lopez-Rios, E. Conde, R. Garcia-Lujan, A. Suarez-Gauthier, N. Manes, G. Renedo, J. L. Duque-Medina, E. GarciaLagarto, R. Rami-Porta, G. Gonzalez-Pont, J. Astudillo-Pombo, J. L. Mate-Sanz, J. Freixinet, T. Romero-Saavedra, M. Sanchez-Cespedes, A. Gomez de la Camara, P. Bronchogenic Carcinoma Cooperative Group of the Spanish Society of, and S. Thoracic, “Composite anatomical-clinicalmolecular prognostic model in non-small cell lung cancer,” Eur Respir J, vol. 37, no. 1, pp. 136-42, Jan, 2011. A. Spira, J. E. Beane, V. Shah, K. Steiling, G. Liu, F. Schembri, S. Gilman, Y. M. Dumas, P. Calner, P. Sebastiani, S. Sridhar, J. Beamis, C. Lamb, T. Anderson, N. Gerry, J. Keane, M. E. Lenburg, and J. S. Brody, “Airway epithelial gene expression in the diagnostic evaluation of smokers with suspect lung cancer,” Nat Med, vol. 13, no. 3, pp.3616, Mar, 2007. Wellcome Trust Sanger Institute, “Catalogue of Somatic Mutations in Cancer (COSMIC),” [Online]. Available: http://cancer.sanger.ac.uk/cancergenome/projects/cosmic.[Accessed: June 2, 2013]. “Somatic Mutations in Epidermal Growth Factor Receptor DataBase (SMEGFR-DB),” EGFR Mutations Database, 2008. [Online]. Available: http://somaticmutations-egfr.org. [Accessed: June 2, 2013]. E. A. Eisenhauer, P. Therasse, J. Bogaerts, L. H. Schwartz, D. Sargent, R. Ford, J. Dancey, S. Arbuck, S. Gwyther, M. Mooney, L. Rubinstein, L. Shankar, L. Dodd, R. Kaplan, D. Lacombe, and J. Verweij, “New response evaluation criteria in solid tumours: revised RECIST guideline (version 1.1),” Eur J Cancer, vol. 45, no. 2, pp. 228-47, Jan, 2009.

8 [19] I. Y. Tam, E. L. Leung, V. P. Tin, D. T. Chua, A. D. Sihoe, L. C. Cheng, L. P. Chung, and M. P. Wong, “Double EGFR mutants containing rare EGFR mutant types show reduced in vitro response to gefitinib compared with common activating missense mutations,” Mol Cancer Ther, vol. 8, no. 8, pp. 2142-51, Aug, 2009. [20] S. G. Wu, Y. L. Chang, Y. C. Hsu, J. Y. Wu, C. H. Yang, C. J. Yu, M. F. Tsai, J. Y. Shih, and P. C. Yang, “Good response to gefitinib in lung adenocarcinoma of complex epidermal growth factor receptor (EGFR) mutations with the classical mutation pattern,” Oncologist, vol. 13, no. 12, pp. 1276-84, Dec, 2008. [21] L. H. Liu, and H. Motoda, Feature extraction, construction and selection: A data mining perspective: Springer, 1998. [22] A. Agrawal, and A. Choudhary, "Identifying HotSpots in Lung Cancer Data Using Association Rule Mining.". 2011 IEEE 11th Int. Conf. Data Min. Work. 2011, 995–1002. [23] C. Ordonez, N. Ezquerra, and C. A. Santana, “Constraining and summarizing association rules in medical data,” Knowledge and Information Systems, vol. 9, no. 3, pp. 1-2, 2006 [24] L. Elfangary, and W. A. Atteya, "Mining Medical Databases Using Proposed Incremental Association Rules Algorithm (PIA)."in Digital Society,Second International Conference on the, Feb. 10-15,2008, Sainte Luce, pp.88-92. [25] I. H. Witten, and E. Frank, Data Mining: Practical machine learning tools and techniques: Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2005. [26] M. S. Tsao, A. Sakurada, J. C. Cutz, C. Q. Zhu, S. Kamel-Reid, J. Squire, I. Lorimer, T. Zhang, N. Liu, M. Daneshmand, P. Marrano, G. D. Santos, A. Lagarde, F. Richardson, L. Seymour, M. Whitehead, K. Y. Ding, J. Pater, and F. A. Shepherd, “Erlotinib in lung cancer Molecular and clinical predictors of outcome,” New England Journal of Medicine, vol. 353, no. 2, pp. 133-144, Jul 14, 2005. [27] E. S. Lee, D. S. Son, S. H. Kim, J. Lee, J. Jo, J. Han, H. Kim, H. J. Lee, H. Y. Choi, Y. Jung, M. Park, Y. S. Lim, K. Kim, Y. M. Shim, B. C. Kim, K. Lee, N. Huh, C. Ko, K. Park, J. W. Lee, Y. S. Choi, and J. Kim, “Prediction of Recurrence-Free Survival in Postoperative NonSmall Cell Lung Cancer Patients by Using an Integrated Model of Clinical Information and Gene Expression,” Clinical Cancer Research, vol. 14, no. 22, pp. 7397-7404, Nov, 2008. [28] K. Shedden, J. M. G. Taylor, S. A. Enkemann, M. S. Tsao, T. J. Yeatman, W. L. Gerald, S. Eschrich, I. Jurisica, T. J. Giordano, D. E. Misek, A. C. Chang, C. Q. Zhu, D. Strumpf, S. Hanash, F. A. Shepherd, K. Ding, L. Seymour, K. Naoki, N. Pennell, B. Weir, R. Verhaak, C. Ladd-Acosta, T. Golub, M. Gruidl, A. Sharma, J. Szoke, M. Zakowski, V. Rusch, M. Kris, A. Viale, N. Motoi, W. Travis, B. Conley, V. E. Seshan, M. Meyerson, R. Kuick, K. K. Dobbin, T. Lively, J. W. Jacobson, D. G. Beer, and D. s. C. C. Mo, “Gene expression-based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study,” Nature Medicine, vol. 14, no. 8, pp. 822-827, Aug, 2008. [29] P. A. Bradbury.“Impact of clinical and molecular predictors of benefit from erlotinib in advanced non-small cell lung cancer on costeffectiveness,” Journal of Clinical Oncology, vol. 26, pp. 6531, 2008. [30] C. Zhou, Y. L. Wu, G. Chen, J. Feng, X. Q. Liu, C. Wang, S. Zhang, J. Wang, S. Zhou, S. Ren, S. Lu, L. Zhang, C. Hu, C. Hu, Y. Luo, L. Chen, M. Ye, J. Huang, X. Zhi, Y. Zhang, Q. Xiu, J. Ma, and C. You, “Erlotinib versus chemotherapy as first-line treatment for patients with advanced EGFR mutation-positive non-small-cell lung cancer (OPTIMAL, CTONG-0802): a multicentre, open-label, randomised, phase 3 study,” Lancet Oncol., vol. 12, no. 8, pp. 735–742, 2011. [31] R. Rosell, E. Carcereny, R. Gervais, A. Vergnenegre, B. Massuti, E. Felip, R. Palmero, R. Garcia-Gomez, C. Pallares, J. M. Sanchez, R. Porta, M. Cobo, P. Garrido, F. Longo, T. Moran, A. Insa, F. De Marinis, R. Corre, I. Bover, A. Illiano, E. Dansin, J. de Castro, M. Milella, N. Reguart, G. Altavilla, U. Jimenez, M. Provencio, M. A. Moreno, J. Terrasa, J. Muñoz-Langa, J. Valdivia, D. Isla, M. Domine, O. Molinier, J. Mazieres, N. Baize, R. Garcia-Campelo, G. Robinet, D. Rodriguez-Abreu, G. Lopez-Vivanco, V. Gebbia, L. Ferrera-Delgado, P. Bombaron, R. Bernabe, A. Bearz, A. Artal, E. Cortesi, C. Rolfo, M. Sanchez-Ronco, A. Drozdowskyj, C. Queralt, I. de Aguirre, J. L. Ramirez, J. J. Sanchez, M. A. Molina, M. Taron, and L. Paz-Ares, “Erlotinib versus standard chemotherapy as first-line treatment for European patients with advanced EGFR mutation-positive non-smallcell lung cancer (EURTAC): A multicentre, open-label, randomised phase 3 trial,” Lancet Oncol., vol. 13, pp. 239–246, 2012.

2168-2194 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2014.2377517, IEEE Journal of Biomedical and Health Informatics

JBHI-00400-2014.R1 [32] G. Gao, S. Ren, A. Li, J. Xu, Q. Xu, C. Su, J. Guo, Q. Deng, and C. Zhou, “Epidermal growth factor receptor-tyrosine kinase inhibitor therapy is effective as first-line treatment of advanced non-small-cell lung cancer with mutated EGFR: A meta-analysis from six phase III randomized controlled trials,” Int. J. Cancer, vol. 131, no. 5, pp. E822– E829, 2012. [33] M. Fukuoka, S. Yano, G. Giaccone, T. Tamura, K. Nakagawa, J. Y. Douillard, Y. Nishiwaki, J. Vansteenkiste, S. Kudoh, D. Rischin, R. Eek, T. Horai, K. Noda, I. Takata, E. Smit, S. Averbuch, A. Macleod, A. Feyereislova, R. P. Dong, and J. Baselga, “Multi-institutional randomized phase II trial of gefitinib for previously treated patients with advanced non-small-cell lung cancer (The IDEAL 1 Trial) [corrected],” J Clin Oncol, vol. 21, no. 12, pp. 2237-46, Jun 15, 2003. [34] V. A. Miller, M. G. Kris, N. Shah, J. Patel, C. Azzoli, J. Gomez, et al., "Bronchioloalveolar pathologic subtype and smoking history predict sensitivity to gefitinib in advanced non–small-cell lung cancer," Journal of Clinical Oncology, vol. 22, no. 6, pp. 1103-1109, 2004. [35] F. A. Shepherd, J. Rodrigues Pereira, T. Ciuleanu, E. H. Tan, V. Hirsh, S. Thongprasert, D. Campos, S. Maoleekoonpiroj, M. Smylie, R.

9 Martins, M. van Kooten, M. Dediu, B. Findlay, D. Tu, D. Johnston, A. Bezjak, G. Clark, P. Santabarbara, L. Seymour, and G. National Cancer Institute of Canada Clinical Trials, “Erlotinib in previously treated nonsmall-cell lung cancer,” N Engl J Med, vol. 353, no. 2, pp. 123-32, Jul 14, 2005. [36] Y. L. Wu, C. Zhou, Y. Cheng, S. Lu, G. Y. Chen, C. ang, H. H. Yan, S. Ren, Y. Liu, and J. J. Yang, “Erlotinib as second-line treatment in patients with advanced non-small-cell lung cancer and asymptomatic brain metastases: a phase II study (CTONG-0803),” Ann Oncol, vol. 24, no. 4, pp. 993-9, Apr, 2013. [37] M. A. Hamburg and F. S. Collins, "The path to personalized medicine," New England Journal of Medicine, vol. 363, no. 4, pp. 301-304, 2010. [38] J. Beane, P. Sebastiani, T. H. Whitfield, K. Steiling, Y. M. Dumas, M. E. Lenburg, and A. Spira, “A prediction model for lung cancer diagnosis that integrates genomic and clinical features,” Cancer Prev Res (Phila), vol. 1, no. 1, pp. 56-64, Jun, 2008.

2168-2194 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

A Predictive Model for Personalized Therapeutic Interventions in Non-small Cell Lung Cancer.

Non-small cell lung cancer (NSCLC) constitutes the most common type of lung cancer and is frequently diagnosed at advanced stages. Clinical studies ha...
420KB Sizes 18 Downloads 5 Views