Comparing the Value of Mammographic Features and Genetic Variants in Breast Cancer Risk Prediction Yirong Wu, PhD1, Jie Liu, MS1, David Page, PhD1, Peggy Peissig, PhD2, Catherine McCarty, PhD3, Adedayo A. Onitilo MD, MSCR, FACP2,4,5, and Elizabeth S. Burnside, MD, MPH, MS1 1

University of Wisconsin, Madison, WI, USA;2 Marshfield Clinic Research Foundation, Marshfield, WI, USA; 3 Essentia Institute of Rural Health, Duluth, MN, USA;4 Department of Hematology/Oncology, Marshfield Clinic Weston Center, Weston, WI, USA; 5 School of Population Health, University of Queensland, Brisbane, Australia Abstract The goal of this study was to compare the value of mammographic features and genetic variants for breast cancer risk prediction with Bayesian reasoning and information theory. We conducted a retrospective case-control study, collecting mammographic findings and high-frequency/low-penetrance genetic variants from an existing personalized medicine data repository. We trained and tested Bayesian networks for mammographic findings and genetic variants respectively. We found that mammographic findings had a higher discriminative ability than genetic variants for improving breast cancer risk prediction in terms of the area under the ROC curve. We compared the value of each mammographic feature and genetic variant for breast risk prediction in terms of mutual information, with and without consideration of interactions of those risk factors. We also identified the interactions between mammographic features and genetic variants in an attempt to prioritize mammographic features and genetic variants to efficiently predict the risk of breast cancer.

Introduction Technology advances in genome-wide association studies (GWAS) and successes with cost reduction in genome-sequencing have engendered optimism that we have entered a new age of precision medicine1, in which the risk of a breast cancer can be predicted on the basis of a personโ€™s genetic variants. More recently, however, the optimism of these studies has been tempered by disappointment and caution2, 3. Now it is widely agreed that phenotypic data, in concert with genetic variants will likely be necessary for advancing personalized breast cancer risk prediction. The availability of imaging findings acquired from mammography provides the opportunity to combine mammographic features and genetic variants to improve risk prediction4, 5; however, the comparative value that mammographic features and genetic variants provide for advancing risk prediction is unknown. The potential to quantify the value and prioritize phenotypic data (mammographic features) and genotypic data (germline genetic variants) for clinical decision-making presents an exciting opportunity to explore feature ranking algorithms for optimizing breast cancer risk prediction. Mutual information analysis has been widely used to rank variables by quantifying the information that each variable provides for estimating the outcomes of interest 6, 7. Prior studies have explored mutual information for genome-wide data analysis to discover association between single nucleotide polymorphisms (SNPs) and disease; however, few have considered interactions between risky SNPs8. Many believe that epistatic interactions of SNPs are important in determining susceptibility to breast cancer as well as disease mechanism2, 9 . Hence, recent studies propose to utilize mutual information analysis for joint analysis of multiple SNPs that are potentially associated with breast disease10. Coincident to those studies, mutual information analysis has also been used to identify diagnostically important mammographic features. Most of these studies selected only the top-ranked features without considering interactions among features11. Recently investigators have used multidimensional mutual

1228

information analysis to rank mammographic features by considering interactions when ranking for feature selection12. Overall, mutual information analysis has been successfully utilized to rank either SNPs or mammographic features for breast cancer risk estimation. However, to our knowledge, few studies have attempted to select the most important risk factors from a combination of mammographic features and genetic variants, and fewer have investigated interactions between mammographic features and genetic variants. In this study, we aim to compare the values of mammographic features and SNPs for selecting the most informative risk factors in the diagnosis of breast cancer. We use mutual information analysis and Bayesian reasoning by considering interactions among risk factors to select the most valuable mammographic features and SNPs in breast cancer risk estimation.

Materials and Methods The Marshfield Clinic Institutional Review Board approved the use of Marshfield Clinicโ€™s Personalized Medicine Research Project (PMRP) cohort in the research. Subjects The PMRP cohort, details of which have been previously published13, was used in this study. To summarize, Marshfield Clinic patients residing in one of 19 ZIP code areas surrounding Marshfield, WI and aged 18 years or older, were invited to participate in PMRP. Written informed consent was obtained from each participant along with a blood sample from which DNA, plasma and serum were extracted and stored. Permission was given by each participant to link their electronic health record information to biological samples for use in the research. We used PMRP to select subjects with an available DNA sample, a mammogram, a breast biopsy within 12 months after the mammogram from the Marshfield Clinic Data Warehouse. We employed a retrospective casecontrol study design. Cases were defined as women having a confirmed diagnosis of breast cancer obtained from the cancer registry, which includes either invasive breast cancer (ductal and lobular) or ductal carcinoma in situ. Controls were determined through the electronic medical records (and absence from the cancer registry) as never having had a breast cancer diagnosis. To construct case and control cohorts that were similar in age distribution, we employed an age matching strategy to ensure that the age of the matched control was within 5 years of the case. Mammography Features The American College of Radiology developed the Breast Imaging Reporting and Data System (BIRADS) lexicon to standardize mammographic findings and recommendations14. The BI-RADS lexicon consists of a number of mammographic features, including the characteristics of masses and micro-calcifications, breast composition and other associated findings, which can be organized in a hierarchy (Figure 1). In Marshfield Clinicโ€™s electronic health record, mammographic findings including breast composition were described in BI-RADS lexicon and embedded in free text clinical reports, from which we used a parser to extract 46 mammographic features15. Genetic Variants Our study focused on high-frequency/low-penetrance genetic variants that affect breast cancer risk as opposed to low frequency genetic variants with high penetrance (BRCA1 and BRCA2) or intermediate penetrance (CHEK-2). In clinic, individuals with BRCA1 and BRCA2 mutations demonstrating a high risk of breast cancer are managed with more intensive screening and have options for chemoprevention. Our study was designed for normal risk individuals, for which recommendation of screening and chemoprevention options are less clear. We included 22 genetic variants which have been identified by recent large-scale genome-wide association studies4 (Table 1). The SNPs used in Gail model16, 17 and Wacholder et al study18 were included in our study. When we built the models

1229

with the genetic variants, we coded each genetic variant as whether the subject carries the minor allele, rather than the specific genotype the subject carries.

Mammographic Features Mass

Calcifications

Density

Margins

Associated Findings

Associated Findings

Morphology

Special Cases

Breast Composition

Special Cases

Size

Shape

Circumscribed Microlobulated Obscured Indistinct Spiculated

Distribution

Architectural Distortion

Palpable*

High Equal Low Fat

Round Oval Lobular Irregular

Clustered Linear Segmental Regional Scattered

=30 mm

Coarse/popcorn Milk of calcium Rod-like Eggshell/rim Dystrophic Lucent-centered Skin Round Punctate Amorphous Pleomorphic Fine linear Vascular Suture

Skin Thickening Skin Retraction Nipple Retraction Trabecular Thickening Skin Lesion Axillary Adenopathy

Lymph Node Focal Asymmetry Tubular Density Asymmetric tissue

Architectural Distortion

Breast composition

* represents predictive features not included in BI-RADS

Figure 1. Mammographic features adopted from BI-RADS lexicon Table 1. SNPs evaluated for breast cancer risk

SNP ID rs11249433 rs4666451 rs13387042

Chromosome 1 2 2

Minor allele C A G

SNP ID rs2046210 rs13281615 rs2981582

Chromosome 6 8 10

Minor allele T G T

rs1045485 rs17468277 rs4973768 rs10941679 rs981782 rs30099 rs889312

2 2 3 5 5 5 5

C T T G G T C

rs3817198 rs2107425 rs6220 rs999737 rs3803662 rs8051542 rs12443621

11 11 12 14 16 16 16

C T G T T T G

rs2180341

6

G

rs6504950

17

A

Mutual Information Originating from Shannonโ€™s information theory7, mutual information (MI) of a variable v1 with respect to the other variable v2 is defined as the amount by which the uncertainty of v1 is decreased with the knowledge that v2 provides. The initial uncertainty of v1 is quantified by entropy H(v1). The average uncertainty of v1 given knowledge of v2 is conditional entropy H(v1|v2). The difference between initial entropy and conditional entropy represents therefore MI of v1 with respect to v2. MI is defined as follows:

1230

MI(๐‘ฃ1 ; ๐‘ฃ2 ) = ๐ป(๐‘ฃ1 ) โˆ’ ๐ป(๐‘ฃ1 |๐‘ฃ2 ) = ๏ฟฝ ๏ฟฝ ๐‘(๐‘ฃ1 , ๐‘ฃ2 )๐‘™๐‘œ๐‘” ๐‘ฃ2

๐‘ฃ1

๐‘(๐‘ฃ1 , ๐‘ฃ2 ) ๐‘(๐‘ฃ1 )๐‘(๐‘ฃ2 )

where p (v1) and p (v2) are the marginal probability of v1 and v2, and p(v1, v2) is their joint probability. In the following context, we use MI(x1; x2) to denote the information value that one risk factor x1 (either a mammographic feature or a SNP) provides for estimating the other risk factor x2 to quantify the interaction between them. We use single-dimensional mutual information SMI(x; y) to denote the information that one risk factor x provides for estimating the outcome y (breast cancer). ๐‘(๐‘ฅ, ๐‘ฆ) SMI(๐‘ฅ; ๐‘ฆ) = ๐ป(๐‘ฅ) โˆ’ ๐ป(๐‘ฅ|๐‘ฆ) = ๏ฟฝ ๏ฟฝ ๐‘(๐‘ฅ, ๐‘ฆ)๐‘™๐‘œ๐‘” ๐‘(๐‘ฅ)๐‘(๐‘ฆ) ๐‘ฆ

๐‘ฅ

where x is one of risk factors and y is the outcome. SMI does not take into account interaction effects among risk factors.

We use multidimensional mutual information MMI(x; y) to denote the information that one risk factor x provides for estimating the outcome y when interaction with other risk factors is considered. We assess MMI by an algorithm that minimizes Redundancy among risk factors while Maximizing Relevance to the outcome (mRMR)1922 . โ€œRedundancyโ€ is related to MI of risk factors with each other, and โ€œrelevanceโ€ is defined as SMI of risk factors with the outcome. Specifically, in this study, to use mRMR algorithm to rank most important risk factors, we choose the risk factor with the highest SMI as the most important one. We select subsequent important risk factors sequentially, such that each risk factor simultaneously maximizes its SMI and minimizes MI between the risk factor of interest and already selected risk factors. Specifically, we choose the next most important risk factor xi, i = 2, 3, 4, ยทยทยท, that maximizes 1 SMI(๐‘ฅ๐‘– ; ๐‘ฆ) โˆ’ ๏ฟฝ MI(๐‘ฅ๐‘– ; ๐‘ฅ๐‘— ) ๐‘–โˆ’1 ๐‘— 0.01), which is in concert with previous studies4. This sporadic association may be caused by several factors. First, GWAS has analyzed hundreds of thousands of SNPs to determine whether they are associated with breast cancer. Our study has used a very limited set of SNPs. Including a larger set of SNPs is crucial for the success of radiogenomic analysis. Second, our study considers association between one SNP and one mammographic feature only. More complex interactions of mammographic featuremammographic feature and SNP-SNP will likely be a fruitful area of future research. Third, other imaging modality like MRI has proven to be more sensitive than mammography31 or convey functional information (rather than anatomic data). It would be interesting to induce imaging features collected from MRI into our future radiogenomic studies. Limitations and Future Work There are several limitations to our study. First, the sample size is small compared with large-scale genome-wide association studies, due to the inherent difficulty of collecting a rich multi-modality dataset. Second, our study focused on discussion of predictive accuracy associated with risk factors but did not consider benefit and cost related to the decision. We plan to extend our study in this direction soon since cost-effectiveness analysis allows physicians and policymakers to compare the health gains that various decision of choosing the most important mammographic features and genetic variants can achieve. Third, our study used Bayesian networks to assess ranking results of mutual information analysis. A possible line of future research is to employ other prediction

1235

algorithms such as logistic regression, artificial neural network, or support vector machine for validating our results. Finally, we used AUC to measure the performance of our Bayesian networks. The AUC evaluates the performance of a prediction method over the full range of possible threshold levels. In practice, however, only a limited range or a single optimal threshold level may be of interest clinically. We plan to extend our study by determining the optimal threshold level to measure the predictive performance of our Bayesian networks in terms of sensitivity and specificity.

Conclusion Our study represents one of the first explorations of breast cancer risk prediction using genetic polymorphisms in combination with mammographic features. We find that genetic risk factors improve risk prediction to a statistically significant degree, which raises the possibility that stratification based on these risk factors may provide an opportunity to personalize care in clinical practice. This work confirms that disease prediction, which is narrowly focused on one data type (e.g. genomics) may miss opportunities for improved performance offered by incorporation of phenotypic data (e.g. mammographic features). We demonstrate that genetic risk factors can be combined and tuned with clinical imaging findings for better predictive performance. Moreover, considering interactions among risk factors, MMI outperforms SMI in determining the smallest set of informative risk factors. In applications where addition of risk factors incurs additional time or monetary cost, MMI may help reduce the cost of diagnostic testing. Encouraged by these promising results, we plan to further explore genotype/phenotype associations to shed light on disease processes that may, in the future, improve diagnosis and treatment.

Acknowledgements The authors gratefully acknowledge the support of the Wisconsin Genomics Initiative, NCI grant R01CA127379-01 and its ARRA supplement 3R01CA127379-03S1, NIGMS grant R01GM097618-01, NLM grant R01LM011028-01, NIEHS grant 5R01ES017400-03, the UW Institute for Clinical and Translational Research (ICTR) and the UW Carbone Cancer Center.

References 1. 2. 3. 4.

5.

6. 7. 8. 9.

Collins FS, Green ED, Guttmacher AE, Guyer MS. A vision for the future of genomics research. Nature. 2003;422. Devilee P, Rookus MA. A tiny step closer to personalized risk prediction for breast cancer. N Engl J Med. 2010 Mar 18;362(11):1043-5. Pharoah PD, Antoniou AC, Easton DF, Ponder BA. Polygenes, risk prediction, and targeted prevention of breast cancer. N Engl J Med. 2008 Jun 26;358(26):2796-803. Liu J, Page D, Nassif H, Shavlik J, Peissig P, McCarty C, Onitilo AA, Burnside ES. Genetic variants improve breast cancer risk prediction on mammograms. Proceedings of the American Medical Informatics Association Symposium (AMIA); 2013; Washington, DC. Liu J, Page D, Peissig P, McCarty C, Onitilo AA, Trentham Dietz A, Burnside ES. New genetic variants improve personalized breast cancer diagnosis. AMIA Summit on Translational Bioinformatics (AMIATBI); 2014; San Francisco, CA. Benish W. Mutual information as an index of diagnostic test performance. Methods of Information in Medicine. 2003;42(3):260-4. Shannon C, Weaver W. The Mathematical Theory of Communication. Urbana, IL: University of Illinois Press; 1949. Yuan X, Zhang J, Wang Y. Mutual information and linkage disequilibrium based SNP association study by grouping case-control. Genes & Genomics. 2011;33:65-73. Briollais L, Wang Y, Rajendram I, Onay V, Shi E, Knight J, Ozcelik H. Methodological issues in detecting gene-gene interactions in breast cancer susceptibility: a population-based study in Ontario. BMC Med. 2007;5:22.

1236

10. 11.

12.

13.

14. 15. 16. 17. 18.

19. 20. 21. 22.

23. 24. 25. 26. 27.

28. 29. 30. 31. 32.

Anunciacao O, Vinga S, Oliveira AL. Using information interaction to discover epistatic effects in complex disease. PLoS One. 2013;8(10):1-11. Wu Y, Alagoz O, Ayvaci M, Munoz del Rio A, Vanness DV, Woods R, Burnside ES. A comprehensive methodology for determining the most informative mammographic features. J Digital Imaging. 2013;26(5):941-7. Wu Y, Vanness DV, Burnside ES. Using multidimensional mutual information to prioritize mammographic features for breast cancer diagnosis. Proceedings of the American Medical Informatics Association Symposium (AMIA); 2013; Washington, DC. McCarty CA, Wilke RA, Giampietro PF, Wesbrook SD, Caldwell MD. Marshfield Clinic Personalized Medicine Research Project (PMRP): design, methods and recruitment for a large population-based biobank. Personalized Med. 2005;2(1):49-79. Breast Imaging Reporting And Data System (BI-RADSยฎ). 4th ed. Reston VA: American College of Radiology; 2003. Nassif H, Woods R, Burnside ES, Ayvaci M, Shavlik J, Page D. Information extraction for clinical data mining: a mammography case study. Proc IEEE Int Conf Data Min; 2009; Miami, Florida. Gail MH. Discriminatory accuracy from single-nucleotide polymorphisms in models to predict breast cancer risk. J Natl Cancer Inst. 2008 Jul 16;100(14):1037-41. Gail MH. Value of adding single-nucleotide polymorphism genotypes to a breast cancer risk model. J Natl Cancer Inst. 2009 Jul 1;101(13):959-63. Wacholder S, Hartge P, Prentice R, Garcia-Closas M, Feigelson HS, Diver WR, Thun MJ, Cox DG, Hankinson SE, Kraft P, Rosner B, Berg CD, Brinton LA, Lissowska J, Sherman ME, Chlebowski R, Kooperberg C, Jackson RD, Buckman DW, Hui P, Pfeiffer R, Jacobs KB, Thomas GD, Hoover RN, Gail MH, Chanock SJ, Hunter DJ. Performance of common genetic variants in breast-cancer risk models. N Engl J Med. 2010 Mar 18;362(11):986-93. Balagani K, Phoha V. On the feature selection criterion based on an approximation of multimensional mutual information. IEEE Trans Pattern Analysis and Machine Intelligence. 2010;32(7):1342-3. Battiti R. Using mutual information for selecting features in supervised neural net learning. IEEE Trans Neural Networks. 1994;5(4):537-50. Ding C, Peng H, editors. Minimum redundancy feature selection from microarray gene expression data. Proc Second IEEE Computational Systems Bioinformatics; 2003. Peng H, Long F, Ding C. Feature selection based on mutual information: criteria of max-dependency, maxrelevance, and min-redundancy. IEEE Trans Pattern Analysis and Machine Intelligence. 2005;27(8):122638. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten I. The Weka data mining software: an update. SIGKDD Explorations. 2009;11(1). Friedman N, Geiger D, Goldszmidt M. Bayesian network classifiers. Machine Learning. 1997;29:131-63. DeLong E, DeLong D, Clarke-Pearson D. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44:837-45. Mullin R. A shaky new age: researchers see a rocky path from genomics research to truly personalized medicines. Chemical & Engineering News. 2014:18-23. Wang X, Zhang L, Chen Z, Ma Y, Zhao Y, Rewuti A, Zhang F, Fu D, Han Y. Association between 5p12 Genomic Markers and Breast Cancer Susceptibility: Evidence from 19 Case-Control Studies. PLoS ONE. 2013;8(9). Gu C, Zhou L, Yu J. Quantitative assessment of 2q35-rs13387042 polymorphism and hormone receptor status with breast cancer risk. PloS One. 2013;8(7). He X, Yao G, Li F, Li M, Yang X. Risk-association of five SNPs in TOX3/LOC643714 with breast cancer in southern China. Int J Molecular Sciences. 2014;15:2130-41. Rutman AM, Kuo MD. Radiogenomics: creating a link between molecular diagnostics and diagnostic imaging. Eur J Radiology. 2009;70:232-41. Yamamoto S, Maki DD, Korn RL, Kuo MD. Radiogenomic analysis of breast cancer using MRI: a preliminary study to define the landscape. AJR Am J Roentgenol. 2012;199:654-63. Kuo MD, Jamshidi N. Behind the numbers: decoding molecular phenotypes with radiogenomics-guiding principles and technical considerations. Radiology. 2014;270:320-5.

1237

Comparing the value of mammographic features and genetic variants in breast cancer risk prediction.

The goal of this study was to compare the value of mammographic features and genetic variants for breast cancer risk prediction with Bayesian reasonin...
183KB Sizes 0 Downloads 8 Views