Automatic construction of a large-scale and accurate drug-side-effect association knowledge base from biomedical literature.

Accepted Manuscript Automatic construction of a large-scale and accurate drug-side-effect association knowledge base from biomedical literature Rong Xu, QuanQiu Wang PII: DOI: Reference:

S1532-0464(14)00138-5 http://dx.doi.org/10.1016/j.jbi.2014.05.013 YJBIN 2185

To appear in:

Journal of Biomedical Informatics

Received Date: Accepted Date:

9 February 2014 30 May 2014

Please cite this article as: Xu, R., Wang, Q., Automatic construction of a large-scale and accurate drug-side-effect association knowledge base from biomedical literature, Journal of Biomedical Informatics (2014), doi: http:// dx.doi.org/10.1016/j.jbi.2014.05.013

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Automatic construction of a large-scale and accurate drug-side-effect association knowledge base from biomedical literature Rong Xua,∗, QuanQiu Wangb,∗ a

Medical Informatics Program, Center for Clinical Investigation, Case Western Reserve University, Cleveland, OH 44106 b ThinTek, LLC, Palo Alto, CA 94306

Abstract Systems approaches to studying drug-side-effect (drug-SE) associations are emerging as an active research area for drug target discovery, drug repositioning, and drug toxicity prediction. However, currently available drug-SE association databases are far from being complete. Herein, in an effort to increase the data completeness of current drug-SE relationship resources, we present an automatic learning approach to accurately extract drug-SE pairs from the vast amount of published biomedical literature, a rich knowledge source of side effect information for commercial, experimental, and even failed drugs. For the text corpus, we used 119,085,682 MEDLINE sentences and their parse trees. We used known drug-SE associations derived from US Food and Drug Administration (FDA) drug labels as prior knowledge to find relevant sentences and parse trees. We extracted syntactic patterns associated with drug-SE pairs from the resulting set of parse trees. We de∗

Corresponding author Email addresses: [email protected] (Rong Xu), [email protected] (QuanQiu Wang)

Preprint submitted to Journal of Biomedical Informatics

June 6, 2014

veloped pattern-ranking algorithms to prioritize drug-SE-specific patterns. We then selected a set of patterns with both high precisions and recalls in order to extract drug-SE pairs from the entire MEDLINE. In total, we extracted 38,871 drug-SE pairs from MEDLINE using the learned patterns, the majority of which have not been captured in FDA drug labels to date. On average, our knowledge-driven pattern-learning approach in extracting drug-SE pairs from MEDLINE has achieved a precision of 0.833, a recall of 0.407, and an F1 of 0.545. We compared our approach to a support vector machine (SVM)-based machine learning and a co-occurrence statistics-based approach. We show that the pattern-learning approach is largely complementary to the SVM- and co-occurrence-based approaches with significantly higher precision and F1 but lower recall. We demonstrated by correlation analysis that the extracted drug side effects correlate positively with both drug targets, metabolism, and indications. Keywords: text mining, drug side effect, drug discovery, drug repositioning, drug toxicity prediction 1. Introduction It has been increasingly recognized that similar side effects of seemingly unrelated drugs can be caused by their common off-targets and that drugs with similar side effects are likely to share molecular targets [4]. Therefore, systems approaches to studying the phenotypic relationships among drugs and integrating this drug phenotypic data with drug-related genetic, genomic, proteomic, and chemical data can facilitate rapid drug target discovery and drug repositioning.

2

Computational approaches to predicting drug targets have often been based on chemical similarity measures and docking strategies [15, 38, 28, 37]. Similarly, many computational strategies for drug repositioning have been explored [14]. The majority of these approaches leverage on known drug properties such as chemical similarity [15], molecular activity similarity [20], molecular docking [17], and gene expression profile similarity [13, 24]. In a seminal paper, Campillos et al. [4] used phenotypic side-effect similarities among drugs to infer whether two drugs shared a target. The researchers showed that similar side effects of seemingly unrelated drugs can be caused by their common off-targets and that drugs with similar side effects are likely to share molecular targets. By systematically exploiting this correlation, the researchers were able to predict new targets for drugs and led to a patented reposition of a FDA-approved drug, Aprepitant. However, their analysis was limited to drug-SE relationships derived solely from the FDA drug labels for FDA-approved drugs. Later in this study, we will show that much of drug-SE association knowledge from biomedical literature have not been captured in FDA drug labels yet. For example, only 23.8% (30 out of 126) of reported side effects for drug irinotecan from MEDLINE are in FDA drug labels. Systems approaches to studying the phenotypic relationships among drugs can also lead to early prediction of unknown drug toxicity. Existing computational approaches to predicting drug toxicities are mainly based on known drug targets [9, 27] or chemical structures [22, 21]. The unifying assumption is that drugs with similar in vitro protein-binding profiles or chemical structures tend to exhibit similar side effects. Cami et al. have recently developed a novel phenotype-driven approach for predicting drug toxicities

3

by integrating the network structure formed by known drug-SE relationships with information on specific drugs and side effects to predict likely unknown drug side effects [3]. Atias et al. reported a novel approach to predict drug side effects by taking into account of other drugs and their side effects [2]. Similar to Campillos’s study, these two studies used drug-SE associations solely derived from FDA drug labels. The availability of a comprehensive, accurate, and machine-understandable drug-side effect(SE) relationship knowledge base is critical for computationbased drug target and drug toxicity predictions. Current drug phenotypedriven systems approaches rely exclusively on drug-SE associations extracted from FDA drug labels. However, there exists a large amount of drug-SE relationship knowledge in other data sources such as FDA spontaneous postmarketing drug safety reporting systems (FAERS), patient electronic health records (EHRs), and the large body of published biomedical literature. While the FDA drug labels, FAERS, and EHRs mainly contain side effect information for FDA-approved drugs, the vast amount of published biomedical literature contains side effect information for drugs at all clinical stages, including investigational, approved, and failed drugs. Currently, more than 22 million biomedical abstracts (or records) are available on MEDLINE, making it a rich information source of drug-SE associations. While many biomedical relationship extraction tasks have focused on extracting relationships between drugs, proteins, or genes [1, 29, 5, 30], extracting drug-SE relationships from MEDLINE has been less explored. The key issue in extracting drug-SE pairs from MEDLINE is to differentiate drug-SE (‘CAUSE’) relationship from drug-disease (‘TREAT’) treatment re-

4

lationship and from pure co-occurrences. Recently, Shetty et al. prototyped a process for applying information mining to discover major drug-SE associations from MEDLINE by developing a statistical document classifier to remove irrelevant articles [23]. Gurulingappa et al. trained and tested a supervised machine learning classifier to classify drug-condition pairs in a set of 2972 manually annotated case reports [11]. Both studies focused on a limited set of drugs, side effects or specific article types. It is unclear how these approaches can be scaled up to the whole MEDLINE in effectively building a comprehensive drug-SE relationship knowledge base. Recently, we developed an approach in boosting drug safety signal detection from FAERS using evidence from MEDLINE [32]. We also developed an automatic approach to extract anticancer drug-specific side effects from MEDLINE by developing specific filtering and ranking schemes [31]. However, those filtering and ranking methods were specifically targeted to extract anticancer drug-SE relationships and can not be generalized for other types of drugSE relationship extraction tasks. In one of our recent studies in building a disease phenotype knowledge base, we developed a pattern-based learning approach to extract disease-manifestation phenotypic relationships from biomedical literature [33]. In that study, we leveraged external knowledge of disease-manifestation from biomedical ontologies as prior knowledge to extract many disease-manifestation pairs from MEDLINE. In this study, we retarget the knowledge-driven pattern-learning approach to accurately extract a large number of drug-SE pairs from MEDLINE. Our study is based on the observation that researchers often use certain predominant patterns to describe drug-SE-specific associations. For example,

5

in sentence “UGT1A1*28 genotype and irinotecan-induced neutropenia: dose matters” (PMID: 17728214), a typical pattern “DRUG-induced SE ” is used. In this study, we first automatically learned drug-SE-specific patterns using known drug-SE pairs derived from FDA drug labels. We then used these learned patterns to extract many additional drug-SE pairs from MEDLINE. Since our ultimate goal is to develop systems approaches to exploit drug phenotype relationships for drug target discovery, drug toxicity prediction, and drug repositioning, accuracy and scalability of the algorithm are critical. Pattern-based relationship extraction approaches have the advantage of being both highly accurate and efficient. Many pattern- or parsingbased biomedical named entity recognition and relationship extraction approaches have recently developed [7, 10, 16, 34, 35, 36]. Our approach is different from these approaches in that we leveraged upon a large amount of prior knowledge in effectively finding patterns with both high precisions and recalls and that we applied the patterns to the whole MEDLINE in constructing a large-scale drug-SE relationship knowledge base. 2. Data and Methods The entire experimental process consists of the following steps: (1) Build a local MEDLINE search engine; (2) Obtain known drug-SE pairs as prior knowledge; (3) Extract drug-SE pairs from MEDLINE; (4) Evaluate; and (5) Analyze the correlation between drug side effects and drug target genes, drug metabolism genes and drug indications.

6

2.1. Build a local MEDLINE search engine We downloaded a total of 21,354,075 MEDLINE abstracts (119,085,682 sentences) published between 1965 and 2011 from the U.S. National Library of Medicine 1 . Each sentence was syntactically parsed with the Stanford Parser [18] using the Amazon Cloud computing service (a total of 3,500 instance-hours with High-CPU Extra Large Instance used). We used the publicly available information retrieval library Lucene

2

to create a local

MEDLINE search engine with indices created on both sentences and their corresponding parse trees. 2.2. Extract known drug-SE pairs as prior knowledge We downloaded a total of 100,049 known drug-SE pairs from SIDER (Side Effect Resource), a public, machine-readable side effect resource that was automatically constructed from FDA package inserts [19]. These pairs represented a total of 996 FDA-approved drugs and 4,199 adverse event terms. 2.3. Build clean SE lexicons An accurate and comprehensive SE lexicon is critical for the task of drug-SE relationship extraction. We built a clean disease (or SE) lexicon by combining and cleaning all disease terms in UMLS and in Human Disease Ontology 3 . This clean lexicon has been used on our recent study of extracting disease-manifestation relationships from MEDLINE [33]. We also 1

http://mbr.nlm.nih.gov/Download/index.shtml http://lucene.apache.org 3 http://bioportal.bioontology.org/ontologies/1009 2

7

created a SE lexicon based on the Medical Dictionary for Regulatory Activities (MedDRA) 4 , which has been used in one of our recent studies in building a cancer-drug-specific side effect knowledge base [32]. 2.4. Build a drug lexicon The drug lexicon was downloaded from DrugBank [26], and it consisted of 6,516 drugs, including both FDA-approved drugs and experimental drugs. The decision of using drug names from DrugBank instead of RxNORM or other sources was based on the fact that DrugBank contains many experimental drugs in addition to FDA-approved drugs. 2.5. Drug-SE relationship extraction The knowledge-driven drug-SE relationship extraction system is depicted in Figure 1 and consists of: (1) Pattern Extraction, wherein syntactical patterns associated with known drug-SE pairs are extracted; (2) Pattern Ranking and Selection, wherein extracted patterns are ranked and drug-SE-specific patterns are selected from top ranked patterns; (3) Pair Extraction, wherein additional pairs associated with selected patterns are extracted from MEDLINE; and (4) Pair Ranking, wherein newly extracted drug-SE pairs are ranked. The ranking scores of each pair was based on the ranking scores of their associated patterns and its frequency in MEDLINE. 2.5.1. Pattern Extraction In order to find drug-SE-specific syntactic patterns, we used known drugSE pairs as search queries to the local search engine. Sentences and their cor4

http://www.meddramsso.com

8

Figure 1: The drug-SE relationship extraction approach

responding parse trees containing any known drug-SE pairs were retrieved. The pattern “NP1 pattern NP2 ”, where the noun phrase pair NP1-NP2 is a known drug-SE pair, was extracted from the retrieved parse trees. The pattern is “DRUG pattern SE ” if a drug entity precedes a SE entity or “SE pattern DRUG” if the opposite is true. For example, from the sentence “UGT1A1*28 genotype and irinotecan-induced neutropenia: dose matters” (PMID: 17728214), we extracted the pattern “DRUG-induced SE ” between irinotecan and neutropenia. From the sentence “Bradycardia induced by irinotecan: a case report” (PMID 09861240), the pattern “SE induced by DRUG” was extracted. The extra requirement that both the drug and SE terms must be noun phrases in the retrieved sentences is to avoid extracting partial or incorrect drug-SE pairs. For example, without the NP restriction, partial drug-SE pair “acetazolamide-necrosis” will be extracted from sentence “The high alkalinity of the injection vehicle of certain parenteral solutions of acetazolamide produces necrosis of the skin upon 9

sc injection” (PMID 18809) since the input SE lexicon does not include the more complete SE term ‘necrosis of the skin.’ By requiring that both drug and SE terms must be NP phrases, no pair will be extracted from this sentence. In this way, high precision is guaranteed. 2.5.2. Pattern Ranking and Selection Not all patterns associated with known drug-SE pairs are necessarily specific in describing drug-SE semantic relationships. In order to find patterns that are associated with many known drug-SE pairs (high recall) and are also specific in describing drug-SE relationships (high precision), we first ranked extracted patterns by the number of their associated known drug-SE pairs. The ranking scheme ranked patterns highly that were associated with many known pairs (high recalls). However, not all patterns with high recalls were drug-SE-specific patterns. For example, non-specific patterns such as “DRUG and SE” can be associated with many known drug-SE pairs as well as non-drug-SE pairs. From the top-ranked patterns, we manually selected drug-SE-specific syntactic patterns such as “DRUG-induced SE” and “DRUG SE”. The pattern ranking and manual selection processes ensured the high recall (top-ranked) and high precision (manually-selected) of selected patterns. Because the pattern ranking algorithm effectively ranked highly many drug-SE-specific patterns with both high precisions and recalls, the manual selection process entailed minimal human time to scan top-ranked patterns (less than 15 minutes).

10

2.5.3. Pair Extraction The selected drug-SE-specific syntactic patterns were used as search queries to the local MEDLINE search engine. Both sentences and parse trees that contained these patterns were retrieved. We extracted drug-SE pairs from the returned parse trees using the pattern format“NP1 pattern NP2,” wherein NP1 and NP2 were both noun phrases, and the pattern was one of the selected patterns. In addition, NP1 and NP2 were drug or SE terms from the input lexicons. Drug-SE pairs that appeared in MEDLINE but not in SIDER were determined additional drug-SE pairs that were discovered by the algorithm. 2.5.4. Pair Ranking We ranked the extracted drug-SE pairs based on their associated pattern scores and on their co-occurrence frequencies. A reliable drug-SE pair is assumed to be associated with reliable patterns many times. The ranking score of a pair RS(R) is defined as following: RS(R) =

n

log(RS(Pi )) ∗ count(Pi , R)

(1)

i=0

RS(Pi ) is the score of its associated patterns (Pi)) and is defined as the number of known drug-SE pairs from FDA drug labels that are associated with the pattern. The count(Pi, R) is the number of times that the pair is associated with the pattern in the entire MEDLINE corpus. 2.6. Evaluation The main goal of this study is to extract additional pairs from MEDLINE that have not been captured in FDA drug labels. In order to evalu11

ate a MEDLINE-based drug-SE relationship extraction algorithm, we manually created three MEDLINE-specific evaluation datasets. The first evaluation set consists of irinotecan-related side effects that have been reported in MEDLINE. The other two sets consist of drug-SE pairs reported in MEDLINE for “neutropenia” or “thrombocytopenia,” respectively. Irinotecan is a commonly used chemotherapy drug. Both neutropenia and thrombocytopenia are two hematologic side effects that are commonly associated with chemotherapy drugs and many other drugs. To create an evaluation set of irinotecan-SE pairs, we first retrieved a total of 10,044 sentences containing the term “irinotecan.” Three curators manually extracted side effects in which irinotecan was implicated from the retrieved sentences. Note that curators did not use any SE lexicons. Only the pairs agreed upon by all three curators were used. The resultant evaluation set consisted of 126 irinotecanSE pairs. Similarly, we created an evaluation dataset of 236 drug-neutropenia pairs from 4,040 MEDLINE sentences that contain ‘neutropenia’ and an evaluation dataset of 319 drug-thrombocytopenia pairs from 16,618 sentences that contain ‘thrombocytopenia.’ The three evaluation datasets were created independent of the method we developed and the evaluators were not aware of the patterns we used. However, due to the intensive manual curation effort required (more than 40 hours for each of the three curators), we only created three evaluation datasets (one drug, two SEs), which may not be representative of other drugs or SEs. We used the learned patterns to extract drug-SE pairs from the same set of sentences that contain irinotecan, neutropenia, and thrombocytopenia. We evaluated the precision, recall, and F1 using the manually created evaluation

12

datasets. Precision is the number of correct pairs divided by the number of all returned pairs and recall is the number of correct pairs divided by the number of pairs that should have been returned. The F1 score is a weighted average of the precision and recall and is defined as: F1 =

2 ∗ precision ∗ recall . precision + recall

(2)

2.7. Comparison to the support vector machine (SVM), a supervised machine learning approach Support vector machines are supervised machine learning models [6] and are commonly used for biomedical relationship extraction tasks, including drug-SE relationship extraction [11]. The SVM-based approach is comprised of two steps: sentence classification and drug-SE relationship extraction. We first classified MEDLINE sentences into drug-SE-related and not related. For example, the sentence classifier’s goal was to differentiate the SE-related sentence “Associations between UGT1A1*6 or UGT1A1*6/*28 polymorphisms and irinotecan-induced neutropenia in Asian cancer patients” from the nonSE-related sentence “Prospective study of paclitaxel and irinotecan for elderly patients with unresectable non-small cell lung cancer.” We then extracted drug-SE co-occurrence pairs from positively classified sentences. For sentence classification, we trained a two-class SVM-based sentence classifier using implementation in WeKa [12]. The SVM-based sentence classifier used polynomial kernel, bag-of-words feature, TF-IDF weighting, stemming and stopwords-removal. The bag-of-words feature was used since it is often the case that the appearance of one specific word such as ‘toxicity’ can be used to determine whether a sentence is drug-SE-related. The positive training data 13

for the sentence classifier consisted of a total of 350,375 sentences, each of which contained at least one known drug-SE pairs from SIDER. The negative training data comprised of equal number of sentences that were randomly selected from the rest of MEDLINE sentences with the assumption that the majority of MEDLINE articles were not related to drug side effect reporting. The 10-fold cross validation was used in training the sentence classifier. We then used the trained SVM sentence classifier to classify sentences that were used in the evaluation, namely the sentences containing the term ‘irinotecan’ and at least one SE, sentences containing the term ‘neutropenia’ and at least one drug, and sentences containing the term ‘thrombocytopenia’ and at least one drug. The accuracy of the sentence classifier in classifying these sentence into SE-related and -unrelated is 0.846 based on manual evaluation of 1000 randomly selected sentences. We extracted drug-SE co-occurrence pairs from positively classified sentences and compared to pairs extracted from unclassified sentences. Precisions, recalls and F1 scores were used to evaluate the performance of drug-SE relationship extractions and were compared among three methods: pattern-based approach, SVM (classified sentences) and cooccurrence (unclassified sentences). One tailed Z statistics was used to test significance. 2.8. Analyze the correlation between drug side effects and drug targets, metabolism and indications 2.8.1. Correlation with drug target genes We extracted a total of 10,478 drug-gene pairs from DrugBank [26], representing 3,454 drugs and 1,816 genes. For drug-drug pairs that share SEs at different cutoffs, we calculated the average number of shared gene targets. 14

2.8.2. Correlation with drug metabolizing genes Similar to the correlation analysis above, we analyzed shared metabolism mechanism underlying drug-drug pairs with overlapping SEs. We obtained from PharmGKB [25] a total of 4,399 drug-gene pairs with subtypes of pharmacokinetics (PK) or pharmacodynamics (PD), representing 637 drugs and 859 genes. For drug-drug pairs that share SEs at different cutoffs, we calculated the corresponding average number of shared metabolism genes. 2.8.3. Correlation with drug indications We extracted a total of 52,000 drug-disease pairs from ClinicalTrials.gov (http://www.clinicaltrials.gov/), a registry of federally and privately supported clinical trials conducted in the United States and around the world (www.clinicaltrials.gov). The drug-disease pairs contain 2,035 drugs and 9,591 diseases. For drug-drug pairs that share SEs at different cutoffs, we calculated the corresponding average number of shared disease indications. 2.8.4. Case studies of drug-drug relationships based on shared SEs and shared drug indications We selected two drugs, fluorouracil and lovastatin, to examine drug-drug relationships based on shared SEs and shared drug indications. We first found all the drugs that shared any side effects with them based on the drug-SE pairs extracted from MEDLINE and drugs that shared any disease indications with them based on the drug-disease pairs from ClinicalTrials.gov. We then ranked the similar drugs based on Jaccard similarity coefficient. The Jaccard coefficient measures similarity between two drugs, and is defined as the size of the intersection divided by the size of the union of two drugs’ side 15

effects or disease indications. J(A, B) =

|A ∩ B| . |A ∪ B|

(3)

3. Results 3.1. Patterns associated with known drug-SE pairs Using known drug-SE pairs from the SIDER database, we extracted a total of 113,036 patterns in the format of “DRUG pattern SE ” and 72,713 patterns in the format of “SE pattern DRUG”. 98% of these extracted patterns were associated with only one known drug-SE pair, but one pair can be associated with multiple patterns. We ranked the patterns based on the number of their associated known drug-SE pairs. The top 10 ranked patterns along with the numbers of associated known drug-SE pairs are listed in Table 1. As shown in the table, many of the top-ranked patterns are indeed typical drugSE-specific patterns. In addition, these patterns are representative and associated with many known drug-SE pairs (since they were ranked based on the number of their associated known drug-SE pairs). For instance, the top pattern “DRUG-induced SE ” was associated with 1,743 known drug-SE pairs. Other top-ranked drug-SE-specific patterns included “DRUG-associated with SE,” “DRUG induces SE,” “SE associated with DRUG,” “SE due to DRUG,” and “SE produced by DRUG,” among others. Even though the patterns were extracted using known drug-SE pairs, many top patterns were not specific for drug-SE relationships, such as “DRUG and SE,” “DRUG in SE,” “SE with DRUG,” and “SE and DRUG”. In addition, some top patterns were in fact treatment-specific, such as “DRUG for the treatment of DISEASE,” This may be due to the fact that the SIDER 16

DRUG pattern SE

Pairs

SE pattern DRUG

Pairs

“DRUG-induced SE ”

1,743

“SE with DRUG”

781

“DRUG and SE ”

545

“SE associated with DRUG”

786

“DRUG SE ”

523

“SE induced by DRUG ”

645

“DRUG induced SE ”

390

“SE after DRUG”

450

“DRUG-associated SE ”

352

“SE due to DRUG”

427

“DRUG in SE ”

335

“SE, DRUG”

416

“DRUG on SE ”

325

“SE and SE ”

409

“Drug, SE ”

220

‘SE caused by DRUG”

340

“DRUG for SE ”

207

“SE during DRUG”

339

“DRUG in the treatment of SE ”

184

“SE following DRUG”

296

Table 1: Top ten frequent patterns and the numbers of associated known drug-SE pairs from FDA drug labels. Drug side effect-specific patterns are highlighted

dataset was constructed using text mining approaches to extract drug-SE co-occurrence pairs from the “Adverse Events” sections of FDA drug labels, which often contain both drug-treatment (patient disease characteristics) and drug-SE information. For accurate extraction of drug-SE pairs from FDA drug labels, additional manual curation will be necessary. 3.2. Extraction of many additional Drug-SE pair using selected drug-SEspecific patterns From top 100 ranked patterns, we manually selected 19 drug-SE-specific patterns in the format of “DRUG pattern SE,” including “DRUG-induced SE,” “DRUG SE,” “DRUG-associated SE,” and “DRUG-related SE.” We also selected 35 patterns in the format of “SE pattern DRUG,” including “SE 17

induced by DRUG,” “SE associated with DRUG,” and “SE due to DRUG.” Using these selected patterns, we extracted additional drug-SE pairs from MEDLINE sentences. For instance, using the pattern “DRUG-induced SE,” we extracted 5,684 distinct drug-SE pairs from MEDLINE. Amongst these pairs, only 1,743 pairs were included in SIDER, meaning that an additional 3,941 pairs were discovered with this single pattern. For each of these selected patterns, we extracted significantly more pairs from MEDLINE sentences compared to the ones captured in SIDER. We extracted a total of 15,290 distinct drug-SE pairs from MEDLINE sentences using the 19 selected patterns in the format of “DRUG pattern SE,” representing a 504% increase from their associated 2,531 known drug-SE pairs in SIDER (Figure 2). Similarly, we extracted a total of 17,707 drug-SE pairs from MEDLINE sentences using 35 selected patterns in the format of “SE pattern DRUG,” also representing a 507% increase when compared to their associated 2,915 drug-SE pairs in SIDER (Figure 3). In total, we extracted 38,817 drug-SE pairs from MEDLINE, the majority of which were not captured in the SIDER database. 3.3. Precision, recall and F1 measures of extracted drug-SE pairs We evaluated the precisions, recalls, and F1s of the extracted drugSE pairs using three manually created evaluation datasets: Irinotecan-SE (126 pairs), Drug-Neutropenia (236 pairs), and Drug-Thrombocytopenia (319 pairs). As shown in Table 2, using the “DRUG pattern SE ” patterns (19 patterns), we extracted a total of 24 irinotecan-SE pairs, with a precision of 0.875 and a recall of 0.179. Using the “SE pattern DRUG” patterns (35 patterns), we extracted a total of 35 irinotecan-SE pairs, representing a precision 18

Figure 2: Number of additional drug-SE pairs extracted from MEDLINE for each selected pattern (“DRUG pattern SE”).

19

Figure 3: Number of additional drug-SE pairs extracted from MEDLINE for each selected pattern(“SE pattern DRUG”).

20

Goldstandard

Pattern

Precision

Recall

F1

“DRUG pattern SE”

0.875

0.179

0.298

“SE pattern DRUG”

0.686

0.205

0.316

Combined

0.702

0.282

0.402


0.882

0.326

0.476

“SE pattern DRUG”

0.850

0.296

0.439

Combined

0.855

0.461

0.599


0.970

0.309

0.468

Drug-Thrombocytopenia “SE pattern DRUG”

0.941

0.357

0.517

0.943

0.479

0.635

Irinotecan-SE

Drug-Neutropenia

Combined

Table 2: Precision, recall and F1 of the extracted drug-SE pairs. Test datasets: IrinotecanSE (126 pairs), Drug-Neutropenia (236 pairs), and Drug-Thrombocytopenia (319 pairs).

of 0.686 and a recall of 0.205. By combining irinotecan-SE pairs extracted using all selected patterns (54 patterns), we obtained a total of 47 pairs, with a precision of 0.702, a recall of 0.282, and an F1 of 0.402. For the other two evaluation datasets, we obtained a precision of 0.855, a recall of 0.461, and an F1 of 0.599 for drug-neutropenia pair extraction, and a precision of 0.943, a recall of 0.479, and an F1 of 0.635 for drug-thrombocytopenia pair extraction. In summary, the drug-SE pairs extracted using the selected patterns had high precisions. However, the recalls were low compared to precisions. In this study, we only used a handful of selected patterns, which cannot possibly capture all the drug-SE pairs in MEDLINE. In order to further increase the recall, we will need to extract richer sets of patterns or complement our current method with non-pattern-based approaches.

21

Even though we extracted many drug-SE pairs using the 54 selected drugSE-specific patterns and the manually created clean SE lexicons, the precisions were still not perfect and false positives were introduced. For example, the drug-SE pair “warfarin-thrombocytopenia” was erroneously extracted using one of the top patterns (“SE while on DRUG”) from the sentence “One patient developed epistaxis in the setting of thrombocytopenia while on warfarin therapy.” (PMID 20107495). In this sentence, thrombocytopenia is an indication, not a side effect, of warfarin. In this case, dependency tree analysis, manual curation, or an automatic method augmented with prior knowledge (known drug-disease treatment pairs) would be necessary in order to ensure extraction of valid pairs or exclusion of erroneous ones. 3.4. The pattern-based approach has better precision and overall F1 but lower recall than the SVM and co-occurrence-based approaches We compared the precisions, recalls and F1 scores of the pattern-based, SVM and pure co-occurrence approaches using the irinotecan-SE evaluation dataset at five frequency cutoffs: ≥1 (126 pairs), ≥2 (63 pairs), ≥5 (30 pairs), ≥10 (15 pairs), and ≥20 (10 pairs). Comparison at different frequency cutoffs provided us a more comprehensive picture how each method worked for both rare and frequent pairs. The overall F1 scores of the pattern-based approach were significantly higher than those of either SVM or pure cooccurrence-based approaches at all five cutoffs (Table 3). For instance, at frequency cutoff of 5, the F1 score of the pattern-based approach was 0.603, which is significantly higher than 0.364 for SVM and than 0.355 for cooccurrence-based approach. The recalls and F1 scores of the pattern-based approach increased significantly as the pair frequency increased, while the 22

Frequency Cutoff Method

≥1

≥2

≥5

≥10

≥20

Precision

Recall

F1

Pattern-based

0.702

0.282

0.402

SVM

0.251

0.967

0.398

Cooc

0.238

0.989

0.383

Pattern-based

0.702

0.384

0.496

SVM

0.271

0.873

0.414

Cooc

0.262

0.889

0.404

Pattern-based

0.702

0.528

0.603

SVM

0.235

0.800

0.363

Cooc

0.225

0.833

0.355

Pattern-based

0.702

0.765

0.732

SVM

0.231

1.000

0.375

Cooc

0.208

1.000

0.345

Pattern-based

0.702

0.833

0.762

SVM

0.179

0.700

0.286

Cooc

0.167

0.800

0.276

Table 3: Comparison of the pattern-based, machine-learning and pure co-occurrence approaches at five frequency cutoffs: ≥1 (126 pairs), ≥2 (63 pairs), ≥5 (30 pairs), ≥10 (15 pairs), and ≥20 (10 pairs).

precisions remained the same. Similar trends were observed for the other two evaluation datasets (Drug-Neutropenia and Drug-Thrombocytopenia) (data not shown). These results indicate that the pattern-based approach is more effective in extracting frequent drug-SE pairs than pairs that only appear in MEDLINE rarely.

23

3.5. Drug side effects positively correlate with both drug targeted genes, drug metabolizing genes, and disease indications To show the potential of this drug-SE association knowledge base in drug target discovery and early drug toxicity prediction, we analyzed the correlations between drug-SE pairs and drug-associated gene targets and metabolizing genes. As shown in Figure 4, the number of shared gene targets increases as the number of shared SEs increases (Pearson correlation coefficient of 0.97). For example, the average number of shared gene targets for all drug-drug combinations is 0.332 (≥0). The number increased to 0.565 for drug-drug pairs sharing one or more SEs and 2.824 for drug-drug pairs with at least 30 overlapping SEs. Drug metabolism genes are important factors in drug toxicity prediction and personalized medicine (personalized drug efficacy and toxicity prediction). We examined the correlation between drug metabolism genes and drug side effects. As shown in Figure 5, there is also strong positive correlation between shared drug side effects and shared metabolism genes (Pearson correlation coefficient of 0.98). For example, the average number of shared metabolism genes for all drug-drug combinations is 0.521. The number increased to 0.738 for drug-drug pairs sharing one or more SEs and 2.665 for drug-drug pairs with at least 30 overlapping SEs. To show the potential of this drug-SE association knowledge base in drug repositioning, we also analyzed the correlations between drug-SE pairs and drug indications. We observed strong positive correlation between drug side effects and drug indications (Pearson correlation coefficient of 0.98). The average number of shared disease indications for all drug-drug combinations

24

was 0.987. The number significantly increased to 1.783 for pairs sharing one or more SEs. As the number of shared SEs increased, the number of shared disease indications increased significantly (Figure 6). At cutoff ≥30, the average number of shared disease indications is 9.923. This strong correlation between drug side effects and drug indications indicates that we can use this drug-SE association dataset in drug repositioning tasks. In summary, drug-drug pairs with overlapping side effects tend to share both gene targets, metabolism genes and disease indications. Even though this positive correlations are expected, this expected result demonstrates that we can use this dataset in subsequent phenotype-driven drug target discovery, drug toxicity prediction, and drug repositioning. 3.6. Examples of drug-drug relationships based on shared SEs and shared disease indications We selected two drugs, fluorouracil which belongs to the family of drugs called antimetabolites and is used in the treatment of colorectal cancer and pancreatic cancer, and lovastatin which is a member of the drug class of statins and used in combination with diet, weight-loss, and exercise for lowering cholesterol and reducing risk of cardiovascular disease. Table 4 shows top 10 most similar drugs for each given drug. For fluorouracil, the majority of the top ranked drugs based on both side effect profile similarity and indication profile similarity are also cancer drugs, such as capecitabine and irinotecan. By examining the top-ranked drugs for lovastatin based on side effect similarity, it is striking to see many of them are statin drugs such as simvastatin (top 1), atorvastatin (top 2) and rosuvastatin (top 4). The top ranked drugs based on disease indication similarity include not only 25

Figure 4: Correlation between drug side effects (38,871 drug-SE pairs) and drug target genes (10,478 drug-gene pairs from DrugBank).

26

Figure 5: Correlation between side effects (38,871 drug-SE pairs) and drug metabolism genes (4,399 drug-gene pairs from PharmGKB).

27

Figure 6: Correlation between drug side effects (38,871 drug-SE pairs) and drug indications (52,000 drug-disease pairs from ClinicalTrials.gov).

28

Rank

Fluorouracil

Lovastatin

SE Similarity

Indication Similarity SE Similarity

Indication Similarity

1

capecitabine

leucovorin

simvastatin

colestipol

2

doxorubicin

oxaliplatin

atorvastatin

niacin

3

carboplatin

cisplatin

bezafibrate

dipyridamole

4

paclitaxel

irinotecan

rosuvastatin

vitamin b6

5

vinblastine

docetaxel

gemfibrozil

chlorthalidone

6

docetaxel

cetuximab

levosimendan

pitavastatin

7

irinotecan

capecitabine

genistein

pyridoxine

8

metronidazole

bevacizumab

ajmaline

atenolol

9

ergonovine

gemcitabine

flutamide

ezetimibe

10

linezolid

pemetrexed

metformin

amlodipine

Table 4: Top 10 drugs drugs similar to fluorouracil or to lovastatin.

statin drugs such as pitavastatin but also other drugs used in the treatment of hyperlipidemia and cardiovascular diseases, such as colestipol (reducing cholesterol) and dipyridamole (inhibiting thrombus formation). This may be due to the fact that cardiovascular diseases are highly complex diseases and many different types of drugs, including statins, are used in the treatment, prevention and management of this complex disease. Therefore, drugs that share indications with lovastatin are not necessarily statins. 4. Discussion We have developed a pattern-based relationship extraction method and accurately extracted a large number of drug-SE relationships from MED-

29

LINE, the majority of which were not captured in existing databases. Our approach significantly outperformed the state-of-art machine learning approach while entailing minimal manual annotation. We also demonstrated that the extracted drug side effects correlate positively with drug target genes, drug metabolizing gens and drug indications. Many aspects of the current method can be improved. First, this patternbased based method was limited to extracting drug-SE pairs from sentences only. Though important pairs often appear in sentences, some drug-SE pairs may only appear in abstracts. In addition, many drug-SE pairs may only be described in full-text articles, not in the title or abstract. Second, patternbased relationship extraction approaches in general have high precisions. However, the recalls depend upon the size of the underlying text corpus, the frequencies of the pairs in the corpus, and the usage of the patterns in the text. The main goal of our study is to accurately extract many additional drug-SE pairs from MEDLINE, so we only selected a few specific patterns with high recalls to guarantee both high precision and relatively high recall. Our method will miss pairs that are not associated with these selected patterns. Further improving the recall will require either manual examination of many more patterns or use of complementary non-pattern-based approaches. Third, we only performed primitive correlation studies between drug side effects and drug-associated target genes, metabolism gens and disease indications. The natural next step of our study will be to investigate whether supplementation of drug-SE association datasets used in the two seminal studies of phenotype-driven network approaches for drug target discovery [4] and drug toxicity prediction [3] with the drug-SE pairs we extracted from

30

biomedical literature will lead to different network topologies with increased data completeness. 5. Conclusions Computational approaches to systematically studying drug-SE phenotype associations have great potential in drug target discovery, drug repositioning, and drug toxicity prediction. However, the bottleneck in such systems approaches is the lack of an accurate, comprehensive, and machineunderstandable knowledge base of drug-SE associations. In this study, we presented a large-scale relationship extraction approach to accurately extract a large number of drug-SE pairs from MEDLINE. Our approach significantly outperformed the state-of-art machine learning approach and achieved an average precision of 0.833, a recall of 0.407, and an F1 of 0.545. We have shown that the drug-SE pairs extracted from MELDINE correlate positively with drug target genes, drug metabolizing gens and drug indications. This unique drug-SE relationship dataset that we created from the MEDLINE text corpus, when combined with drug-SE associations from other sources and with other ‘omics’ data, can have profound implications in developing phenotype-driven network-based approaches to identifying new drug targets, repositioning existing drugs for new diseases, and predicting unexpected drug toxicities. Acknowledgement Xu and Wang have jointly conceived the idea, designed and implemented the algorithms and prepared the manuscript. All authors read and approved 31

the final manuscript. Funding RX is funded by CWRU/Cleveland Clinic CTSA Grant (UL1 RR024989) and the Training grant in Computational Genomic Epidemiology of Cancer (CoGEC). QW is funded by ThinTek LLC. [1] Ananiadou, S., Pyysalo, S., Tsujii, J. I., and Kell, D. B. (2010). Event extraction for systems biology by text mining the literature. Trends in biotechnology, 28(7), 381-390. [2] Atias, N., and Sharan, R. (2011). An algorithmic framework for predicting side-effects of drugs. J Comput Biol 18: 207218. [3] Cami, A., Arnold, A., Manzi, S., and Reis, B. (2011). Predicting adverse drug events using pharmacological network models. Sci Transl Med, 3(114), 114ra27. [4] Campillos, M., Kuhn, M., Gavin, A. C., Jensen, L. J., and Bork, P. (2008). Drug target identification using side-effect similarity. Science, 321(5886), 263-266. [5] Cohen, K. B., and Hunter, L. E. (2013). Text Mining for Translational Bioinformatics. PLoS computational biology, 9(4), e1003044. [6] Cortes, C., and Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273-297.

32

[7] Coulet, A., Shah, N. H., Garten, Y., Musen, M., and Altman, R. B. (2010). Using text to build semantic networks for pharmacogenomics. Journal of biomedical informatics, 43(6), 1009-1019. [8] Derry, S., Loke, Y. K., and Aronson, J. K. (2001). Incomplete evidence: the inadequacy of databases in tracing published adverse drug reactions in clinical trials. BMC Medical Research Methodology, 1(1), 7. [9] Fliri, A. F., Loging, W. T., Thadeio, P. F., and Volkmann, R. A. (2005). Analysis of drug-induced effect patterns to link structure and side effects of medicines. Nature chemical biology, 1(7), 389-397. [10] Fundel, K., Kffner, R., and Zimmer, R. (2007). RelExRelation extraction using dependency parse trees. Bioinformatics, 23(3), 365-371. [11] Gurulingappa, H., Mateen?Rajput, A., and Toldo, L. (2012). Extraction of potential adverse drug events from medical case reports. Journal of biomedical semantics, 3(1), 15. [12] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. H. (2009). The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter, 11(1), 10-18. [13] Hu, G., and Agarwal, P. (2009). Human disease-drug network based on genomic expression profiles. PLoS One, 4(8), e6536. [14] Hurle, M. R., Yang, L., Xie, Q., Rajpal, D. K., Sanseau, P., and Agarwal, P. (2013). Computational drug repositioning: From data to therapeutics. Clinical Pharmacology and Therapeutics. 33

[15] Keiser, M. J., Setola, V., Irwin, J. J., Laggner, C., Abbas, A. I., Hufeisen, S. J., ... and Roth, B. L. (2009). Predicting new molecular targets for known drugs. Nature, 462(7270), 175-181. [16] Kilicoglu, H., and Bergler, S. (2009, June). Syntactic dependency based heuristics for biological event extraction. In Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task (pp. 119-127). Association for Computational Linguistics. [17] Kinnings, S. L., Liu, N., Buchmeier, N., Tonge, P. J., Xie, L., and Bourne, P. E. (2009). Drug discovery using chemical systems biology: repositioning the safe medicine Comtan to treat multi-drug and extensively drug resistant tuberculosis. PLoS computational biology, 5(7), e1000423. [18] Klein, D., and Manning, C. D. (2003, July). Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1 (pp. 423-430). Association for Computational Linguistics. [19] Kuhn, M., Campillos, M., Letunic, I., Jensen, L. J., and Bork, P. (2010). A side effect resource to capture phenotypic effects of drugs. Molecular systems biology, 6(1). [20] Lamb, J., Crawford, E. D., Peck, D., Modell, J. W., Blat, I. C., Wrobel, M. J., ... and Golub, T. R. (2006). The Connectivity Map: using geneexpression signatures to connect small molecules, genes, and disease. Science Signalling, 313(5795), 1929. 34

[21] Pauwels, E., Stoven, V., and Yamanishi, Y. (2011). Predicting drug side-effect profiles: a chemical fragment-based approach. BMC bioinformatics, 12(1), 169. [22] Pouliot, Y., Chiang, A. P., and Butte, A. J. (2011). Predicting adverse drug reactions using publicly available PubChem BioAssay data. Clinical Pharmacology and Therapeutics, 90(1), 90-99. [23] Shetty, K. D., and Dalal, S. R. (2011). Using information mining of the medical literature to improve drug safety. Journal of the American Medical Informatics Association, 18(5), 668-674. [24] Sirota, M., Dudley, J. T., Kim, J., Chiang, A. P., Morgan, A. A., SweetCordero, A., ... and Butte, A. J. (2011). Discovery and preclinical validation of drug indications using compendia of public gene expression data. Sci Transl Med, 3(96ra), 77. [25] Whirl-Carrillo,M., McDonagh, E.M., Hebert, J.M., Gong, Li., Sangkuhl, K., Thorn, C.F., Altman, R.B., and Klein, T.E. (2012) “Pharmacogenomics Knowledge for Personalized Medicine” Clinical Pharmacology and Therapeutics, 92(4): 414-417. [26] Wishart, D. S., Knox, C., Guo, A. C., Shrivastava, S., Hassanali, M., Stothard, P., ... and Woolsey, J. (2006). DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic acids research, 34(suppl 1), D668-D672. [27] Xie, L., Li, J., Xie, L., and Bourne, P. E. (2009). Drug discovery using chemical systems biology: identification of the protein-ligand binding 35

network to explain the side effects of CETP inhibitors. PLoS computational biology, 5(5), e1000387. [28] Xie, L., Evangelidis, T., Xie, L., and Bourne, P. E. (2011). Drug discovery using chemical systems biology: weak inhibition of multiple kinases may contribute to the anti-cancer effect of nelfinavir. PLoS computational biology, 7(4), e1002037. [29] Xu, R., and Wang, Q. (2012). A knowledge-driven conditional approach to extract pharmacogenomics specific drug-gene relationships from free text. Journal of Biomedical Informatics. Volume 45, Issue 5, Pages 827834. [30] Xu, R., and Wang, Q. (2013). A semi-supervised approach to extract pharmacogenomics-specific drug-gene pairs from biomedical literature for personalized medicine. Journal of Biomedical Informatics, Volume 46, Issue 4, August 2013, Pages 585593. [31] Xu, R., and Wang, Q. (2013). Automatic construction and integrated analysis of a cancer drug side effect knowledge base. Journal of the American Medical Informatics Association doi:10.1136/amiajnl-2012001584. [32] Xu, R., and Wang, Q. (2014). Automatic signal prioritizing and filtering approaches in detecting post-marketing cardiovascular events associated with targeted cancer drugs from the FDA Adverse Event Reporting System (FAERS). Journal of Biomedical Informatics, (in press).

36

[33] Xu, R., Li, L., and Wang, Q. (2013). Towards building a diseasephenotype relationship knowledge base: large scale extraction of diseasemanifestation relationship from literature. Bioinformatics, 2013; doi: 10.1093/bioinformatics/btt359. [34] Xu, R. et al. (2008). Unsupervised method for automatic construction of a disease dictionary from a large free text collection. In AMIA Annual Symposium Proceedings (Vol. 2008, p. 820). American Medical Informatics Association. [35] Xu, R. et al. (2009). Investigation of unsupervised pattern learning techniques for bootstrap construction of a medical treatment lexicon. In Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing (pp. 63-70). Association for Computational Linguistics. [36] Xu, R. et al. (2009). Unsupervised method for extracting machine understandable medical knowledge from a large free text collection. In AMIA Annual Symposium Proceedings (Vol. 2009, p. 709). American Medical Informatics Association. [37] Yang, L., Wang, K., Chen, J., Jegga, A. G., Luo, H., Shi, L., ... and He, L. (2011). Exploring off-targets and off-systems for adverse drug reactions via chemical-protein interactomeclozapine-induced agranulocytosis as a case study. PLoS computational biology, 7(3), e1002016. [38] Yildirim, M. A., Goh, K. I., Cusick, M. E., Barabsi, A. L., and Vidal, M. (2007). Drugtarget network. Nature Biotechnology, 25(10), 1119. 37

(IGHLIGHTS

¾ Systematic studies of drug side effect associations can facilitate drug discovery. ¾ Current databases of drug-side effect associations are largely incomplete. ¾ We developed an automatic approach to extract drug-SE pairs from MEDLINE ¾ We built a large-scale and accurate drug-SE association knowledge base. ¾ The drug-SE pairs correlate positively with drug targets, metabolism and indications.

'RAPHICAL!BSTRACTS

DeepDive: Declarative Knowledge Base Construction.

Incremental Knowledge Base Construction Using DeepDive.

Mindtagger: A Demonstration of Data Labeling in Knowledge Base Construction.

Non-redundant association rules between diseases and medications: an automated method for knowledge base construction.

Automatic background knowledge selection for matching biomedical ontologies.

BEST: Next-Generation Biomedical Entity Search Tool for Knowledge Discovery from Biomedical Literature.

dRiskKB: a large-scale disease-disease risk relationship knowledge base constructed from biomedical text.

Immune modulators in disease: integrating knowledge from the biomedical literature and gene expression.

Towards an Obesity-Cancer Knowledge Base: Biomedical Entity Identification and Relation Detection.

Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach.

Knowledge-based extraction of adverse drug events from biomedical text.

Knowledge Discovery from Biomedical Ontologies in Cross Domains.

Diversifying the knowledge base.

A Biomedical Research Permissions Ontology: Cognitive and Knowledge Representation Considerations.

Robust, accurate and fast automatic segmentation of the spinal cord.

Panorama: a targeted proteomics knowledge base.

A broader knowledge base for medicine.

A Radio-Map Automatic Construction Algorithm Based on Crowdsourcing.

Fast and Accurate Construction of Confidence Intervals for Heritability.

Sortal anaphora resolution to enhance relation extraction from biomedical literature.

Extracting microRNA-gene relations from biomedical literature using distant supervision.

Supervised Learning Based Hypothesis Generation from Biomedical Literature.

Accurate automatic exposure controller for mammography: design and performance.

Network fingerprint: a knowledge-based characterization of biomedical networks.