Accepted Manuscript Patient Electronic Health Records as a Means to Approach Genetic Research in Gastroenterology Ashwin N. Ananthakrishnan, MD, MPH, David Lieberman, MD

PII: DOI: Reference:

S0016-5085(15)00822-7 10.1053/j.gastro.2015.06.005 YGAST 59836

To appear in: Gastroenterology Accepted Date: 1 June 2015 Please cite this article as: Ananthakrishnan AN, Lieberman D, Patient Electronic Health Records as a Means to Approach Genetic Research in Gastroenterology, Gastroenterology (2015), doi: 10.1053/ j.gastro.2015.06.005. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. All studies published in Gastroenterology are embargoed until 3PM ET of the day they are published as corrected proofs on-line. Studies cannot be publicized as accepted manuscripts or uncorrected proofs.

ACCEPTED MANUSCRIPT

Patient Electronic Health Records as a Means to Approach Genetic Research in Gastroenterology Ashwin N Ananthakrishnan, MD, MPH1,2

RI PT

David Lieberman, MD3

1 Division of Gastroenterology, Massachusetts General Hospital, Boston, MA 2 Harvard Medical School, Boston, MA

Portland, OR

M AN U

Short title: Patient EHR for genetic research

SC

3 Division of Gastroenterology and Hepatology, Oregon Health and Science University,

Word count: 1,693

Sources of Funding: A.N.A is supported by funding from the US National Institutes of Health (K23 DK097142).

Financial conflicts of interest: Ananthakrishnan has received research grants from Cubist and

Corresponding Author:

TE D

Amgen, and has served on scientific advisory boards for Abbvie and Cubist.

EP

Ashwin N Ananthakrishnan, MD, MPH

Massachusetts General Hospital Crohn’s and Colitis Center 165 Cambridge Street, 9th Floor

AC C

Boston, MA 02114

Phone: 617-724-9953 Fax: 617-726-3080

Email: [email protected]

1

ACCEPTED MANUSCRIPT

Abstract Electronic health records (EHR) are being increasingly utilized and form a unique source

RI PT

of extensive data gathered during routine clinical care. Through use of codified and free text concepts identified using clinical informatics tools, disease labels can be assigned with a high degree of accuracy. Analysis linking such EHR-assigned disease labels to a biospecimen

SC

repository has demonstrated that genetic associations identified in prospective cohorts can be replicated with adequate statistical power, and novel phenotypic associations identified. In

M AN U

addition, genetic discovery research can be performed utilizing clinical, laboratory, and procedure data obtained during care. Challenges with such research include the need to tackle variability in quality and quantity of EHR data and importance of maintaining patient privacy and data security. With appropriate safeguards, this novel and emerging field of research offers

TE D

considerable promise and potential to further scientific research in gastroenterology efficiently, cost-effectively, and with engagement of patients and communities.

AC C

EP

Keywords: electronic health records; genetics; informatics; natural language processing

2

ACCEPTED MANUSCRIPT

Electronic health records (EHR) are being increasingly adopted in the United States. Only 9% of hospitals had an EHR in 2008, growing more than five-fold to 59% in 20131. Thus

RI PT

far, such data has been used primarily for care of the individual patient. However, this can also be a powerful resource to advance science. We will discuss the feasibility, benefits, and

SC

challenges of using EHR data for genetic research related to gastroenterology using examples.

Feasibility and Approach

M AN U

Standardized genetic platforms and sequencing technologies, imputation, and streamlined analytic pipelines have facilitated pooling of genetic data across populations. However, such efforts have relied on carefully curated cohorts with research teams manually identifying patients from clinical care by review of individual charts to identify eligible individuals, thereby

TE D

requiring significant personnel support and being resource intensive. In contrast, EHR-based disease cohorts can be efficiently assembled at a fraction at the effort. Can an EHR-defined cohort retain the level of accuracy required for genetic studies? The common practice of using

EP

disease codes generated administratively to define disease labels are susceptible to variability and often have low accuracy. However, a wealth of free text data present in the EHR can serve to

AC C

increase confidence in the assignment of disease labels. We present an example where this approach using clinical informatics was successfully used to define a cohort of over 11,000 patients with inflammatory bowel disease (IBD). From among all patients with at least 1 billing code for Crohn’s disease (CD) or ulcerative colitis (UC), a chart review revealed a positive predictive value of only 60% with frequent misclassification. Extraction of codified data ascertaining disease complications as well as narrative free text data comprising number of

3

ACCEPTED MANUSCRIPT

mentions of individual disease names (“Crohn’s disease”) or disease-related terms in clinical notes (“abdominal pain” “diarrhea”), radiology reports (“ileal wall thickening”), endoscopy (“ileitis” “aphthous ulcer”), and pathology (“crypt abscess”) allowed for development of a

RI PT

classification algorithm using machine learning that was able to achieve a positive predictive value of 97% 2. The addition of free text data to codified information not only improved the accuracy of identifying cases, but also increased the number of patients who could be classified

SC

as having disease. Moreover, this approach also allowed identification of phenotypes of disease such as primarily sclerosing cholangitis that are limited by lack of specific diagnostic codes or

M AN U

high frequency of use of codes for competing diagnosis (for example, cholelithiasis)3 , determining status of disease activity in relapsing and remitting disorders4, or identifying response to treatment. Natural language processing software is increasingly sophisticated to be able to distinguish positive findings (“has diarrhea”) from negative ones (“does not have

TE D

diarrhea”) mentions, assign specific contexts for occurrence of phrases (“abdominal pain” from “joint pain”), separate personal from family history (“family history of colon cancer”), and search within specific components of the note (such as indication for procedures)5. Despite the

EP

inherent variability in structure and content of EHR data and differences in quality of provider documentation across institutions, disease defining algorithms created at one institution are

AC C

portable to other institutions using distinct EHRs and retain their accuracy, key for multiinstitutional consortia such as the Electronic Medical Records and Genomics (eMERGE) Network6, 7. When linked to genetic data, such EHR algorithm defined diseases demonstrate similar effects as from previously reported prospective cohorts as demonstrated by Ritchie et al.8 and Kurreeman et al.9. Such analyses can also take advantage of routinely collected clinical or laboratory parameters, examining the association between genetics and such characteristics,

4

ACCEPTED MANUSCRIPT

allowing for insights into disease pathophysiology. In a study, we compared IBD patients recruited into a prospective registry compared to those identified by clinical informatics tool applied to the EHR. We demonstrated that not only did patients in both groups have similar

RI PT

genetic burden or distribution of individual risk alleles (suggesting comparability of populations identifying using the two approaches), but that previously demonstrated genotype-phenotype associations could be replicated. For example, homozygosity at the NOD2 locus showed a

SC

similar association with complicated CD whether defined traditionally by a physician-assigned Montreal classification phenotype (Odds ratio 1.69, 95% CI 1.04 – 2.74) or an automated

M AN U

phenotype defined using clinical informatics tools (> 1 billing code or narrative mention of fistulizing disease) (OR 1.72, 95% CI 1.05 – 2.84)10. Similar narrative text mining using clinical informatics tools could be used to define response to biologic treatment using number of mentions of diarrhea or other symptoms in the year following initiation of therapy. Thus, in spite

TE D

of the ‘non-research grade’ data captured by EHR, accurate generation of disease labels and linkage to genetic data for meaningful discovery and replication research is not only feasible, but highly accurate, efficient, and cost-effective. One can take advantage of clinical information

EP

routinely obtained during patient care integrated with genotype data to examine the genetic

AC C

contribution to disease complications and response to (or side effects from) treatments.

Advantages

Traditional registries have focused on single diseases; EHR-based approaches can

facilitate efficient application of similar methodology to define various diseases using similar clinical informatics methods, thus efficiently streamlining the process and avoiding redundancy in efforts and personnel2, 11, 12. A second benefit is that it allows exploratory research by

5

ACCEPTED MANUSCRIPT

facilitating exploration of various phenotypic manifestations of genetic polymorphisms using phenome-wide scan (PheWAS) approaches13. For example, Cronin et al. were not only ability to replicate known associations between obesity, type 2 diabetes, and the fat mass and obesity

RI PT

associated gene (FTO), but also identify novel hypothesis generating associations with

fibrocystic breast disease, non-alcoholic fatty liver disease, gram-positive bacterial infections, and chronic periodontitis14. Such analyses are not possible in cohorts where information is often

SC

gathered only on the disease of interest. A third benefit is that it allows for examination of

associations with parameters obtained routinely during clinical care but not usually measured in

M AN U

prospective cohorts. Such analyses have been as diverse as demonstrating the influence of genotype on ACPA positivity in rheumatoid arthritis9, vitamin D levels in inflammatory bowel disease10, erythrocyte sedimentation rate15, and cardiac conduction abnormalities16. EHR-based research would also allow for examination of various clinical questions that are unlikely to be

TE D

addressed in a clinical trial due to the large numbers required or ethics of assigning to a standard care arm. Examples for this include comparative effectiveness studies using observational data obtained during routine care; narrative concepts identified during NLP (for example, frequency

EP

of terms such as diarrhea, pain, fatigue, etc. indicating disease activity) can be used to define

AC C

subjective dynamic disease states such as non-response to biologic therapy in IBD.

Barriers

In spite of the considerable potential offered by EHR-based research approaches, there

are several barriers that exist. (i) Provider Barriers: Some genetic research depends on a complete family history, for example hereditary colorectal cancer. As many as 40% of patients younger than age 50 years had never

6

ACCEPTED MANUSCRIPT

been asked about their family history and nearly 50% with a strong family history did not know that they needed screening at a younger age17, 18. Family history, when obtained, may not be regularly updated as new relatives develop hereditary conditions. Structured data forms for

RI PT

obtaining relevant family history may need to be integrated into EHRs to enable data capture. (ii) System Barriers: As patients transition from one health care system to another, there may not be a smooth transition of data. Health care records from prior systems may be scanned as text or

SC

image fields, and be inaccessible as data fields and not be captured by NLP. Seamless transition of data between various different EHRs is essential to minimize such data gaps. In this setting,

in data which may not be accurate.

M AN U

the provider may often rely on hear-say evidence from the patient without confirmation, resulting

(iii) EHR Barriers: EHRs may contain incorrect information, and these errors may be perpetuated by a cut/paste culture. Suspected but not confirmed diagnoses may be included in problem lists,

TE D

and prescribed medications counted despite the patient never having filled it at the pharmacy. Data may be incomplete because medical events occurring outside of the health care system may not be captured. Additionally, biologically important parameters such as date of diagnosis,

EP

severity, extent of disease may be poorly noted in the EHR. Dynamically changing disease statuses like activity and response to treatment remain challenging to capture in the EHR.

AC C

Pathology reports may lack key information or not accurately capture changes in definition over time such as with serrated polyps. (iv) Data security barriers: Patient privacy and data protection are major concerns for EHRlinked research. This also hinders integration of EHR data with genetics which may potentially facilitate identification of individuals, particularly with rare phenotypes. It is essential to ensure

7

ACCEPTED MANUSCRIPT

that de-identification and security of data continue to receive top priority, particularly as various cloud-based approaches may be essential to manage this ‘big data’.

RI PT

Future Directions

In summary, use of patient derived EHR data to facilitate genetic research has been

hitherto underutilized in the field of gastroenterology but offers enormous promise and potential.

SC

One can readily envision this approach being applicable across a wide swath of diseases relevant to gastroenterology including colorectal polyps, gastrointestinal cancers, celiac disease,

M AN U

eosinophilic esophagitis, microscopic colitis, Barrett’s esophagus, and liver disease. All these diseases have in common varying (and often) poor accuracy of existing administrative coding based diagnoses but can be readily identified in the EHR using data (serology, pathology, endoscopy) that is a routine part of their clinical care and which can be mined using clinical

TE D

informatics tools (Figure 1). Linkage of such disease registries to biobanked genotyped samples, ensuring appropriate data protection and de-identification can be enormously valuable to advance scientific discovery. Finally, EHRs should be subject to quality assurance metrics, just

EP

as physicians. Experts in informatics should draft a manifesto of quality measures for the EHR. At the very least, these measures should include the standardization of reporting methods and

AC C

attributes, and the ability to receive structured data from outside sources. This latter plug-in capability would not only enhance the quality of care for that patient, but enable genetic research in that condition. The past decade has seen tremendous progress in making genetic research efficient and cost-effective on a very large scale. It is necessary to match this progress with corresponding advances in our application of clinical informatics and data processing tools to

8

ACCEPTED MANUSCRIPT

efficiently and accurately distil the enormous amounts of data being generated daily as part of

AC C

EP

TE D

M AN U

SC

RI PT

clinical care. This integration will allow us to further discovery research.

9

ACCEPTED MANUSCRIPT

REFERENCES

6. 7. 8.

9.

10. 11. 12. 13. 14.

15. 16. 17.

18.

RI PT

SC

5.

M AN U

4.

TE D

3.

EP

2.

Charles D, King J, Patel V, et al. Adoption of Electronic Health Record Systems among U.S. Non federal Acute Care Hospitals: 2008 - 2012. ONC Data Brief, no 9. Washington, DC: Office of the National Coordinator for Health Information Technology March 2013. Ananthakrishnan AN, Cai T, Savova G, et al. Improving case definition of Crohn's disease and ulcerative colitis in electronic medical records using natural language processing: a novel informatics approach. Inflamm Bowel Dis 2013;19:1411-20. Ananthakrishnan AN, Cagan A, Gainer VS, et al. Mortality and extraintestinal cancers in patients with primary sclerosing cholangitis and inflammatory bowel disease. J Crohns Colitis 2014;8:95663. Lin C, Karlson EW, Canhao H, et al. Automatic prediction of rheumatoid arthritis disease activity from the electronic medical records. PLoS One 2013;8:e69932. Hou JK, Chang M, Nguyen T, et al. Automated identification of surveillance colonoscopy in inflammatory bowel disease using natural language processing. Dig Dis Sci 2013;58:936-41. Carroll RJ, Thompson WK, Eyler AE, et al. Portability of an algorithm to identify rheumatoid arthritis in electronic health records. J Am Med Inform Assoc 2012;19:e162-9. Kho AN, Pacheco JA, Peissig PL, et al. Electronic medical records for genetic research: results of the eMERGE consortium. Sci Transl Med 2011;3:79re1. Ritchie MD, Denny JC, Crawford DC, et al. Robust replication of genotype-phenotype associations across multiple diseases in an electronic medical record. Am J Hum Genet 2010;86:560-72. Kurreeman F, Liao K, Chibnik L, et al. Genetic basis of autoantibody positive and negative rheumatoid arthritis risk in a multi-ethnic cohort derived from electronic health records. Am J Hum Genet 2011;88:57-69. Ananthakrishnan AN, Cagan A, Cai T, et al. Common genetic variants influence circulating vitamin D levels in inflammatory bowel diseases. Inflamm Bowel Dis 2015 (in press). Liao KP, Cai T, Gainer V, et al. Electronic medical records for discovery research in rheumatoid arthritis. Arthritis Care Res (Hoboken) 2010;62:1120-7. Xia Z, Secor E, Chibnik LB, et al. Modeling disease severity in multiple sclerosis using electronic health records. PLoS One 2013;8:e78927. Denny JC, Ritchie MD, Basford MA, et al. PheWAS: demonstrating the feasibility of a phenomewide scan to discover gene-disease associations. Bioinformatics 2010;26:1205-10. Cronin RM, Field JR, Bradford Y, et al. Phenome-wide association studies demonstrating pleiotropy of genetic variants within FTO with and without adjustment for body mass index. Front Genet 2014;5:250. Kullo IJ, Ding K, Shameer K, et al. Complement receptor 1 gene variants are associated with erythrocyte sedimentation rate. Am J Hum Genet 2011;89:131-8. Ritchie MD, Denny JC, Zuvich RL, et al. Genome- and phenome-wide analyses of cardiac conduction identifies markers of arrhythmia risk. Circulation 2013;127:1377-85. Ait Ouakrim D, Lockett T, Boussioutas A, et al. Screening participation for people at increased risk of colorectal cancer due to family history: a systematic review and meta-analysis. Fam Cancer 2013;12:459-72. Fletcher RH, Lobb R, Bauer MR, et al. Screening patients with a family history of colorectal cancer. J Gen Intern Med 2007;22:508-13.

AC C

1.

10

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

Figure 1: Algorithm for use of electronic health record data for genetic research

11

AC C

EP

TE D

M AN U

SC

RI PT

ACCEPTED MANUSCRIPT

Patient Electronic Health Records as a Means to Approach Genetic Research in Gastroenterology.

Electronic health records (EHRs) are being increasingly utilized and form a unique source of extensive data gathered during routine clinical care. Thr...
389KB Sizes 0 Downloads 6 Views