G Model

ARTICLE IN PRESS

ARTMED-1391; No. of Pages 8

Artificial Intelligence in Medicine xxx (2015) xxx–xxx

Contents lists available at ScienceDirect

Artificial Intelligence in Medicine journal homepage: www.elsevier.com/locate/aiim

A semi-supervised learning framework for biomedical event extraction based on hidden topics Deyu Zhou ∗ , Dayou Zhong School of Computer Science and Engineering, Key Laboratory of Computer Network and Information Integration, Ministry of Education, Southeast University, Nanjing, Jiangsu Province 210096, China

a r t i c l e

i n f o

Article history: Received 21 July 2014 Received in revised form 4 January 2015 Accepted 25 March 2015 Keywords: Semi-supervised learning Biomedical event extraction Latent Dirichlet allocation K nearest neighbor

a b s t r a c t Objectives: Scientists have devoted decades of efforts to understanding the interaction between proteins or RNA production. The information might empower the current knowledge on drug reactions or the development of certain diseases. Nevertheless, due to the lack of explicit structure, literature in life science, one of the most important sources of this information, prevents computer-based systems from accessing. Therefore, biomedical event extraction, automatically acquiring knowledge of molecular events in research articles, has attracted community-wide efforts recently. Most approaches are based on statistical models, requiring large-scale annotated corpora to precisely estimate models’ parameters. However, it is usually difficult to obtain in practice. Therefore, employing un-annotated data based on semi-supervised learning for biomedical event extraction is a feasible solution and attracts more interests. Methods and material: In this paper, a semi-supervised learning framework based on hidden topics for biomedical event extraction is presented. In this framework, sentences in the un-annotated corpus are elaborately and automatically assigned with event annotations based on their distances to these sentences in the annotated corpus. More specifically, not only the structures of the sentences, but also the hidden topics embedded in the sentences are used for describing the distance. The sentences and newly assigned event annotations, together with the annotated corpus, are employed for training. Results: Experiments were conducted on the multi-level event extraction corpus, a golden standard corpus. Experimental results show that more than 2.2% improvement on F-score on biomedical event extraction is achieved by the proposed framework when compared to the state-of-the-art approach. Conclusion: The results suggest that by incorporating un-annotated data, the proposed framework indeed improves the performance of the state-of-the-art event extraction system and the similarity between sentences might be precisely described by hidden topics and structures of the sentences. © 2015 Elsevier B.V. All rights reserved.

1. Introduction In the molecular biology domain, it is crucial to get detailed views on the behavior of bio-molecules. Their behavior is often described in the form of their inter-play in molecular events presented in texts. Descriptions about molecular events spread all over the life science literature. Nevertheless, due to the lack of explicit structure, literature in life science prevents computer-based systems from accessing. Therefore, biomedical event extraction, automatically acquiring information of molecular events in texts, has attracted community-wide efforts recently. Several evaluation tasks, such as BioNLP’09 [1], BioNLP’11 [2] and BioNLP’13 [3]

∗ Corresponding author. Tel.: +86 25 52090861. E-mail addresses: [email protected] (D. Zhou), [email protected] (D. Zhong).

shared tasks, have been held in recent years to allow researchers to develop and compare their approaches for biomedical events extraction. In general, a biomedical event, as a detailed description of biomolecules interactions, is represented as an event trigger with one or more arguments. For example, “the excessive production of ROS...” describes two events, one is the synthesis event, and the other is the positive regulation event which is signaled by the word “excessive” and takes the first synthesis event as its argument. In a typical biomedical event annotation, these two events are represented as: E1 (event type:synthesis:production, theme:ROS) E2 (event type:positive regulation:excessive theme:E1) Biomedical event extraction aims to extract such events from literature and reformats these extracted information in structures as represented by the two annotations presented above. By extracting detailed behaviors of biomolecules, biomedical event

http://dx.doi.org/10.1016/j.artmed.2015.03.004 0933-3657/© 2015 Elsevier B.V. All rights reserved.

Please cite this article in press as: Zhou D, Zhong D. A semi-supervised learning framework for biomedical event extraction based on hidden topics. Artif Intell Med (2015), http://dx.doi.org/10.1016/j.artmed.2015.03.004

G Model ARTMED-1391; No. of Pages 8 2

ARTICLE IN PRESS D. Zhou, D. Zhong / Artificial Intelligence in Medicine xxx (2015) xxx–xxx

extraction can be used to support the development of biologyrelated databases. To extract events from texts, most of biomedical event extraction systems were based on the pipeline procedure [4–7]. It usually consists of three modules, biomedical term identification, event trigger word identification and event extraction. Approaches for the event extraction module can be further categorized into two types: rule (or pattern) based and machine learning based. Rule based methods require manual efforts to construct suitable patterns. Instead, machine learning based methods can automatically learn event patterns without manual intervention. Moreover, as seen in recent BioNLP shared task competitions, the best performance was usually achieved by machine learning based systems. For example, in BioNLP’09 shared task, the UTurku system [8] which is based on multiple support vector machines (SVMs), achieved an F-score of 51.95% on the event extraction task, the best result among all the participants. Moreover, more than half of the participating systems for BioNLP’13 shared task employed machine learning algorithms [9]. However, the best results achieved in these shared tasks, around 55% of F-score, are still relatively low [9]. Two main reasons might contribute to that: one is the large variety of biomedical events. The nested form as mentioned in the above example is quite common in biomedical events since biomedical events are frequently connected by causal relationships and the occurrences of biomedical events are closely inter-connected. The other is the limited size of the training data. The small size of training data is not enough to estimate the large number of parameters of statistical models. Therefore, employing un-annotated data for biomedical event extraction based on semi-supervised learning is a feasible solution and attracts more interests. To the best of our knowledge, [10] is the first attempt at semisupervised learning for biomedical event extraction. Sentences in medical literature analysis and retrieval system online (MEDLINE) [11] containing protein names were collected as unlabeled data. With the help of massive amounts of unlabeled data, the correlation between two sorts of original features in the labeled data is calculated. New features are derived using the feature coupling generalization strategy. When applied on the BioNLP’11 shared task, it reached 54.17% on F-score, an improvement of 0.87% over the baseline approach. Instead of deriving new features, MacKinlay et al. [12] employed the self-training procedure. Besides the manually constructed corpus, external data were employed through integration of two different sources. When applied on BioNLP’13 shared task, the approach is effective even the training data are full of noise. In this paper, we present a semi-supervised learning framework for biomedical event extraction based on hidden topics. In this framework, sentences in the un-annotated corpus are elaborately and automatically assigned with event annotations. The assignment is based on the distance between the sentences in the annotated corpus and ones in the un-annotated corpus. More specifically, not only the structures of sentences, but also the hidden topics embedded in sentences are employed for describing the distance. The rationale behind is that sentences with similar structures might share the similar event annotation. Moreover, sentences with similar word co-occurrences (hidden topics) might refer to the similar event annotation. Experimental results on the golden standard corpus show that more than 2.2% improvement on F-score on event extraction is achieved by the proposed framework when compared to the state-of-the-art approach, demonstrating the effectiveness of the proposed framework. The contributions are summarized as below. • A semi-supervised learning framework based on hidden topics for biomedical event extraction is presented. In this framework, the distance between two sentences is calculated based on not

only the sentences’ structures but also the hidden topics embedded in the sentences. Moreover, the hidden topics embedded in the sentences are inferred automatically using latent Dirichlet allocation (LDA) model [13]. • More than 2.2% improvement on F-score for biomedical event extraction is achieved by the proposed framework when compared to the state-of-the-art approach. The rest of the paper is organized as follows. Section 2 discusses the general pipeline procedure for biomedical events extraction and explains how the task can be casted into the classification problem. Then, the proposed semi-supervised learning framework is fully described in Section 3, followed by the experimental results in Section 4. Finally, Section 5 concludes the paper. 2. Biomedical event extraction As mentioned before, each biomedical event consists of a trigger and one or more arguments. And for most events, biomedical terms work as event arguments. Therefore, biomedical event extraction system generally consists of three modules: biomedical term identification (identification of candidate arguments), event trigger identification and event extraction (linking of triggers and arguments). The system usually process one sentence at a time since it was observed that more than 96% of all events in the annotated corpus are fully within a single sentence. The simplification step greatly eases the problem without causing great decreases of the performance. Given a sentence, how the system works is illustrated in Fig. 1. It works in the following way. Firstly, biomedical terms in the sentence are identified. After that, event triggers in the sentence are identified and classified into pre-defined categories. Finally, the system extracts the events consisting of triggers and participant candidates by examining each trigger whether it has relations with participants. For these events with more than one argument, multiple relations are combined as shown in Fig. 1. As high performance has been achieved in biomedical term identification step [14], most systems for biomedical event extraction focus on the second two modules. How to employ the classification approach in the two modules will be discussed in the following sections in details. 2.1. Problem definition Given a sentence S = w1 , w2 , . . ., wn , event trigger identification [15] in the biomedical domain can be seen as the task of classifying words into certain event types. A classification function is defined as



f ((wi ), c) =

+1,

If wiisthetriggerwordwithtypec

−1, Otherwise

,

(1)

where (wi ) is the set of features related to wi , which can be extracted from the sentence S, and the function f(.) decides whether the word wi is a trigger word with type c or not. Machine learning based approaches differ in the choice of the classification model f(.) and the feature set (wi ). In [8], multi-class SVMs were utilized for event trigger identification. Various types of features have been employed, such as features derived from words, named entities and dependency parse results. In [15], to improve the performance of trigger word identification, biomedical knowledge from a large text corpus built from MEDLINE is learned and embedded into word features using neural language modeling. The embedded features are further combined with the well-designed syntactic and semantic context features using the multiple kernel learning method for classifier training. Based on the number of the arguments involved in events, biomedical event can be further divided into two categories: simple event and complex event. Simple event contains only one argument

Please cite this article in press as: Zhou D, Zhong D. A semi-supervised learning framework for biomedical event extraction based on hidden topics. Artif Intell Med (2015), http://dx.doi.org/10.1016/j.artmed.2015.03.004

G Model ARTMED-1391; No. of Pages 8

ARTICLE IN PRESS D. Zhou, D. Zhong / Artificial Intelligence in Medicine xxx (2015) xxx–xxx

3

Fig. 1. An example of how the event extraction system works.

and does not taking other events as arguments. Given a sentence S = w1 w2 ...t1 ...wj ...e1 ...t2 ...e2 ...tm ...el ...wn where e1 , e2 , . . . , el are event trigger words and t1 , t2 , . . ., tm are biomedical terms identified in previous modules, it is straightforward to cast simple event extraction into the classification problem. The classification function f((ei , tj )) is constructed to output 1 when ei and tj form a biomedical event, otherwise 0. The input to the function f is (ei , tj ), the set of features extracted from S related to ei and tj . Based on f, simple events can be extracted from the candidate events such as t1 , e1 , t1 , e2 , . . ., t1 , el , . . ., tm , el . Complex events are events consisting of more than one argument or taking another event as the argument. For complex events taking another event as the argument, it is straightforward to apply classification in the same way as employed for simple event extraction. For events consisting of more than one argument, a feasible way is to combine the binary relations (the relation between the event trigger with one argument) together to form the complex event. An example of how simple events and complex events are extracted can be found in Fig. 1. The details of how the task of event extraction is casted into the classification problem can be found in [16]. 2.2. Features for classification To achieve good classification performance, it is crucial to design a proper feature set. All the features employed in our framework are extracted based on [16]. The syntactic and semantic features employed in the framework are generated from the outputs of GDep (a dependency parser) [17] and Enju parser (a syntactic parser) [18]. An example sentence together with its corresponding GDep and Enju parsing results is illustrated in Fig. 2. Features representing a word for event trigger identification or a pair of words for event extraction are described as following: • Lexical and syntactic features of the word itself. The features such as whether the current word has a capital letter, whether it is at the

beginning of the sentences, whether it has a number, whether it has a symbol, whether it is in a trigger word dictionary, whether it is in a protein base form, its part-of-speech (POS) tag, n-grams of characters (n = 2, 3, 4), are extracted. • Local context features of the word or the pair of words. For the sequence of three words before or after the current word, n-grams (n = 1, 2, 3, 4) are employed. For example, for the word “retarget” in the sentence as shown in Fig. 2, the word sequence “is sufficient to retarget protein from the” is used to generate the relevant n-grams. Its trigrams are “is sufficient to”, “sufficient to retarget”, “to retarget protein”, “retarget protein from” and “protein from the”. And each word is represented by its base form, the POS tag and the relative position (before or after) to the word. For the pair of words, the sequence of three words before the first word in the pair and three words after the last word in the pair is extracted and n-grams (n = 1, 2, 3, 4) are employed. For example, for the pair retarget, protein in the sentence as shown in Fig. 2, the word sequence “is sufficient to retarget protein from the cytoplasm” is used to generate the relevant n-grams. Also, each word is represented by its base form, the POS tag and relative position (before, in or after) to the pair. • Local dependency features of the word. The two-depth path started from the current word in the dependency tree generated from the GDep parser is identified. Then features are extracted from the path such as n-grams (n = 2) of dependencies, n-grams (n = 2, 3) of words represented by their base forms and the POS tags and n-grams (n = 2, 3, 4) of dependencies and words. For word tokens not having two-depth paths, such as the root node or the direct children of the root node, these types of features are ignored. N-grams (n = 2) of dependencies are represented as dependency1−dependency2. Similarly, n-grams (n = 2, 3) of words or n-grams (n = 2, 3, 4) of dependencies and words are represented as word1−word2−word3 or word1−dependency1−word2 and so on. For example, for the word “retarget” in the sentence given in Fig. 2, its two-depth path

Please cite this article in press as: Zhou D, Zhong D. A semi-supervised learning framework for biomedical event extraction based on hidden topics. Artif Intell Med (2015), http://dx.doi.org/10.1016/j.artmed.2015.03.004

G Model

ARTICLE IN PRESS

ARTMED-1391; No. of Pages 8

D. Zhou, D. Zhong / Artificial Intelligence in Medicine xxx (2015) xxx–xxx

4

Output of G-Dep Parser

The binding of I kappa B/MAD-3 to NF-kappa B p65 is sufficient to retarget NF-kappa B p65 from the nucleus to the cytoplasm

is PRD

SUB

sufficient

binding

AMOD

NMOD NMODNMOD

The

The binding of protein to protein is sufficient to retarget protein from the nucleus to the cytoplasm

of

retarget

to

PMOD

PMOD

protein

protein

VMOD OBJ

to

VMOD

protein

to

NMOD

PMOD

from

cytoplasm

NMOD

NMOD

nucleus

the

NMOD

the

S NP

Output of Enju Parser VP

DP NX DT NX The NX PP NN PX NP binding IN NX of NN

VX ADJP VBX ADJX CP is JJ CX VP PX NP sufficient TO VP PP TO NX to VX NP PX NP to NN VB NX TO DP NX retarget NX PP to DT NN protein protein NN PX PP the cytoplasm protein IN DP NX from DT NN the nucleus PP

Fig. 2. A sentence and its corresponding GDep and Enju parsing results.

Table 1 An example of several sentences sharing the same event annotation. Event annotation

Binding:trigger word theme:biomeidcal term theme:biomedical term

Sentences

...molecules mediating the interaction of cancer and endothelial cells in tumor angiogenesis were investigated... ...high-affinity binding of somatostatin-14, somatostatin-28 and octreotide... ...mediated the binding of a soluble NRP1 dimer to cells expressing KDR... ...binding of the cleaved Tie1 45 kDa endodomain to Tie2...

“retarget → AMOD → sufficient → PRD → is” is shown in bold in Fig. 2. Its n-grams (n = 2) of dependencies are given as “AMOD PRD”. • Shortest path features of the pair of words. The shortest path which is a directed path between the pair of the words is retrieved from the dependency parse generated from GDep parser. The vertex walks, edge walks, n-grams (n = 2, 3, 4) of dependencies, n-grams (n = 2, 3, 4) of words represented as base forms plus POS tags, and the length of path are extracted as the path features. For example, for the pair retarget protein in the sentence given in Fig. 2, its shortest path is “retarget← OBJ ←protein”. The length of path which is 1, edge walks as retarget← OBJ ←protein, vertex walks as OBJ can be extracted. The reason of employing shortest path is that an event trigger word and its closest proteins are much more likely to be involved in a biomedical event.

similar sentence structure. Also, those sentences share the similar word co-occurrences or word frequencies. Based on the above observation, we speculate that by elaborately defining the similarity between sentences, sentences in un-annotated corpus can be employed to improve the performance of the biomedical event extraction system. Assume two sets EL = {s1 , a1 , s2 , a2 , · · · , s|L| , a|L| } and EU = {S|L|+1 , S|L|+2 , · · · , S|L|+|U| } with Si being a sentence and ai being we want to build a statistical its corresponding event annotation,  model based on the E = EL EU . Firstly, we present a probabilistic framework for describing the nature of sentences and their event annotations [19]. Assuming that (1) the data are produced by |G| probability models where |G| is the number of distinct event annotations in the labeled set EL , and (2) there is a one-toone correspondence between probability components and classes, considering each individual event annotation as a class, we can get the likelihood of a sentence Si , P(Si |) = P(ai = gj |)P(Si |ai = gj , ), where gj is the event annotation of the sentence Si and  is the complete set of parameters in the statistical model. If we rewrite the class labels of all the sentences-represented as the matrix of binary indicator variables Z, zi = zi1 , ·· · , zi|G| , where zij = 1 if ai = gj else

|G|

z P(gj |)P(Si |gj , ), where zij = 0, then we could get P(Si |) = j=1 ij zi is known for the sentences in EL and unknown for the sentences in EU . Therefore, the probability of all the data is: P(E|, Z) =

|G| 

zij P(gj |)P(Si |gj , ).

(2)

si ∈E j=1

3. Semi-supervised learning framework By investigating the event annotations in the annotated corpus, we found that several sentences share the same event annotation. An example is presented in Table 1. Based on further observation, we found some sentences with the same event annotation have the

The complete log likelihood of the parameters, lg (E|, Z) can be expressed: lg (E|, Z) =

|G| 

zij log[P(gj |)P(Si |gj , )]

(3)

Si ∈E j=1

Please cite this article in press as: Zhou D, Zhong D. A semi-supervised learning framework for biomedical event extraction based on hidden topics. Artif Intell Med (2015), http://dx.doi.org/10.1016/j.artmed.2015.03.004

G Model

ARTICLE IN PRESS

ARTMED-1391; No. of Pages 8

D. Zhou, D. Zhong / Artificial Intelligence in Medicine xxx (2015) xxx–xxx

5

Fig. 3. The architecture of the proposed framework for biomedical event extraction.

To maximize P(E|), a classifier based on the distance measure between the sentences in EU and those in EL is constructed to automatically assign event annotations for the sentences in EU . Therefore, the proposed semi-supervised framework for biomedical event extraction works as follows, which is illustrated in Fig. 3. Firstly, all trigger words in the training data are collected to build a trigger dictionary. Biomedical publications from MEDLINE are crawled and filtered to form the un-annotated corpus by removing the sentences not containing the trigger words in the trigger dictionary. Then the sentences in the un-annotated corpus are assigned with event annotation based on the k nearest neighbor (KNN) classifier. The k nearest neighbors are selected based on the distance measure to be discussed in Section 3.1. How event annotations are automatically assigned using the KNN classifier will be presented in Section 3.2. The training data and the newly added data form the new training data. Then SVM classifiers are constructed based on the new training data for biomedical event extraction using the features described in Section 2.2. 3.1. Distance measure As mentioned before, it was observed that several sentences with the same event annotation have the similar sentence structure and the similar word co-occurrences or word frequencies. By describing similar word co-occurrences as hidden topic using topic modeling and representing sentences’ structures as parsing trees, the distance between two sentences Si , Sj is defined as follows: D(Si , Sj ) = ˛Ds (Si , Sj ) + ˇDt (Si , Sj ),

(4)

distance between two sentences is calculated as 1 − 2 ∗ 1 + 3 ∗  4/ (22 + 32 + 12 ) ∗ (12 + 42 + 12 + 12 ) = 0.14.

n

h (T )h (T ) l=1 l i l j

n  

can be further calculated as

Il (ni )Il (nj ) =

ni ∈Ni nj ∈Nj l=1



C(ni , nj ),

(6)

ni ∈Ni nj ∈Nj

where Ni and Nj are the set of nodes in trees Ti and Tj respectively,

Il (ni ) =

⎧ ⎨ 1, Ifsub − tree lisrootedatnodeni ⎩ 0, Otherwise

,

(7)

n

I (n )I (n ). C(ni , nj ) can be further calculated and C(ni , nj ) = l=1 l i l j more efficiently using some calculation rules. To calculate Dt (Si , Sj ), LDA algorithm is employed. LDA is a generative graphical model originally proposed for topic discovery [13]. Assuming that each document is represented as an unordered collection of words without considering grammar and word order, the LDA model can be employed for grouping the words in similar topics in an unsupervised way. Each word in the LDA model is represented as probability distribution over topics and documents are represented as random mixtures over latent topics. In our framework, the sentences in EL and EU are considered as documents, which are mixtures over latent topics. The generative process of LDA is shown below.

where Ds (Si , Sj ) is the distance between two sentences’ parsing trees, Dt (Si , Sj ) is the distance between two sentences’ hidden topics, and ˛, ˇ are the coefficients describing the importance of the two factors. Tree kernel algorithm [20] is employed to calculate Ds (Si , Sj ). We firstly employ Stanford parser [21] to generate the parsing trees T for each sentence in EU and EL . Then, Ds (Si , Sj ) is calculated in the following way:

• Draw the topic distribution S ∼Dirichlet(˛) for each sentence Si . i • Draw the topic distribution  w ∼Dirichlet(ˇ) for each word wj . j • For each word position j in sentence Si , – Choose a topic zi,j ∼Multinomial(Si ), – Choose a word wi,j ∼Multinomial( zi,j ).

• Tree representation. Assume that n is the number of all tree fragments in EL and EU , T is represented as an n-dimensional vector h1 (T), h2 (T), . . ., hn (T) where hi (T) is the number of occurrences of the i’th tree fragment in tree T. For example, assume that there are 100 tree fragments such as t1 , t2 , . . ., t100 , t1 occurs 2 times, t2 occurs 3 times, and t1 00 occurs 1 time in T, T can be represented as 2, 3, 0, . . ., 0, 1. • Distance calculation. To make sure that the structure distance between two sentences with similar tree fragments is short, the distance Ds (Si , Sj ) is defined as

defined as

n

Ds (Si , Sj ) = 1 −



h (T )h (T ) l=1 l i l j

n h (T )2 l=1 l i

×

n

.

(5)

h (T )2 l=1 l j

For example, two sentences’ parse trees are represented as 2, 3, 0, 1, 0, 0 and 1, 4, 0, 0, 1, 1. The structure

For each sentence in EL and EU , the LDA model outputs a topic vector z1 , z2 , . . ., zt  where t is the number of the topics. Dt (Si , Sj ) is



t (z k=1 k,Si

− zk,Sj )2 . For example, two sentences’ topic

vectors are 0.1, 0.4, 0.6, 0.7 and 0.7, 0.2, 0.1, 0.3. The Dt between two sentences can be calculated as 0.9.

0.62 + 0.22 + 0.52 + 0.42 =

3.2. KNN-based classifier We applied the KNN algorithm [22] to perform classification. The training data consist of N pairs (S1 , A1 ), (S2 , A2 ), . . ., (SN , AN ), where Si denotes a sentence, and Ai denotes its corresponding event annotation. Given a query sentence S, the KNN algorithm first outputs the k sentences Si , i = 1, . . ., k in EL closet in distance to S, and then classifies among the k neighbors. In the implementation here, the distance between two sentences is derived based on the above discussion. Also, instead of majority voting, some rules are defined to classify a sentence among

Please cite this article in press as: Zhou D, Zhong D. A semi-supervised learning framework for biomedical event extraction based on hidden topics. Artif Intell Med (2015), http://dx.doi.org/10.1016/j.artmed.2015.03.004

G Model ARTMED-1391; No. of Pages 8

ARTICLE IN PRESS D. Zhou, D. Zhong / Artificial Intelligence in Medicine xxx (2015) xxx–xxx

6

its k neighbors as shown in Algorithm 1. The rationale behind that is (1) there are so many event annotations (classes) and only a small amount of training data are available for each class. Majority voting would require a large amount of training data for each type of event extraction to get reliable results; (2) Even the sentence S has the same event annotation with one of its neighbors, some further processing needs to be conducted such as mapping trigger words and the arguments in event annotation with words in S. For example, given the sentence “...molecules mediating the interaction of cancer and endothelial cells in tumor angiogenesis...”, the event annotation of one of its neighbors “...high-affinity binding of somatostatin-14, somatostatin-28 and octreotide...” is E1 (event type:binding:binding, theme:somatostatin-14 theme: somatostatin-28) which needs to be changed to E1 (event type:binding:interaction, theme:cancer theme: endothelial cells) and then assigned for it. Algorithm 1. The procedure of automatically assigning event annotation for the selected sentence. Input: A sentence S, its k neighbors {S1 , S2 , · · · , Sk } which are sorted in descending order based on their respective distances to S and their corresponding event annotations Ai , (i = 1, . . ., k), and the threshold c = 2. Output: S’s event annotation As ; 1: Calculate the number of event triggers ce and the number of biomedical terms ct in S; 2: As = NULL; 3: for each i ∈ [1, k]do 4: Calculate the number of event triggers cei and the number of biomedical terms cti in Si ; 5: if cei == ce and |cti − ct | ≤ cthen 6: A i = Ai ; 7: for each event trigger ej in Ai do 8: Identify the biomedical term tm in Si which is linked to ej ;

to t in S by greedy 9: Find the corresponding biomedical term tm m searching;

in A ; 10: Replace tm with tm i 11: end for 12: As = A i ; 13: end if / NULLthen 14: if As = 15: Break; 16: end if 17: end for 18: return As ; 4. Experiments In this section, we present experiments to evaluate the effectiveness of the semi-supervised learning framework. We first describe how the annotated and un-annotated corpus EL and EU are constructed. Then, experimental results obtained on event trigger identification and event extraction in comparison to state-of-theart approach are discussed. Further analysis by varying the number of added sentences and the number of hidden topics is presented finally. 4.1. Experimental setup We used the multi-level event extraction (MLEE) corpus [16] as training data EL for our experiments. MLEE corpus encompasses all levels of biological organization from the molecular to the whole organism. It is generated from 262 PubMed abstracts on angiogenesis, which involves a tissue/organ-level process closely associated with cancer and other organism-level pathologies. Texts in that

Table 2 The statistics of the MLEE corpus. Data

Train

Test

Total

Document Sentence Event

175 1728 4471

87 880 2206

262 2608 6677

Table 3 The comparison of the performance of event trigger identification. Method

Recall (%)

Precision (%)

F-score (%)

Baseline Proposed

81.69 82.26

70.79 72.17

75.84 76.89

Table 4 The comparison of the performance of event extraction. Method

Recall (%)

Precision (%)

F-score (%)

Baseline Proposed

49.56 59.16

62.28 55.76

55.20 57.41

domain represent a good test case for event extraction across multiple levels of biological organization. The annotation follows the guideline formalized in the BioNLP 2009 Shared Task on event extraction. The overall statistics of the data is given in Table 2. The events are divided into four groups such as “anatomical”, “molecular”, “general”, “planned”. These four groups are further divided into several event categories. It is worth noting that we used a combination of training and development datasets of the MLEE corpus for training, and the test set for testing. We constructed an un-annotated corpus from MEDLINE because of its broad coverage of topics in the biomedical domain. Abstracts of biomedical literature published in 2011 were retrieved to build the corpus. More than 9 million sentences were preprocessed such as lowercasing and stemming. Also, the sentences not containing the trigger words in the constructed trigger dictionary were filtered. Altogether, 5143 sentences are kept to form the unannotated corpus EU . In our experiments, the number of topics in LDA model is varied {50, 100, 200, 500} using the Stanford topic modeling toolbox.1 The optimal topic number is chosen using the perplexity measure on the 10% held-out set from our EL and EU corpus. The final results are reported by setting the topic number to 100. How the performance changes with the number of topics is presented in details in Section 4.3. In our experiments, we empirically set ˛ and ˇ to 0.4, 0.6 correspondingly. In our framework, the one-versus-rest SVMs are used for trigger word classification. To alleviate the unbalanced classification problem, we boosted the positive examples by placing more weights on them during training. SVMs are used for event extraction. The baseline was constructed following the approach proposed in [16], which achieved the state-of-the-art performance on trigger word identification and event extraction. Both the baseline approach and our framework employed the features extracted from the syntactic and semantic parsing results as described in Section 2.2. 4.2. Results We conducted experiments on the MLEE corpus and compared our framework with the baseline approach. For event trigger identification task, Table 3 lists the recall, precision, and F-score obtained on the test set of the MLEE corpus. The overall improvement on F-score is 1.05%. The improvement is somewhat modest by

1

http://nlp.stanford.edu/downloads/tmt/tmt-0.4/

Please cite this article in press as: Zhou D, Zhong D. A semi-supervised learning framework for biomedical event extraction based on hidden topics. Artif Intell Med (2015), http://dx.doi.org/10.1016/j.artmed.2015.03.004

G Model ARTMED-1391; No. of Pages 8

ARTICLE IN PRESS D. Zhou, D. Zhong / Artificial Intelligence in Medicine xxx (2015) xxx–xxx

7

Table 5 The comparison of the proposed framework with different definitions of distance. Distance measure

Recall (%)

Precision (%)

F-score (%)

Event extraction

Using sentences’ structures Using hidden topics Using both Using sentences’ structures Using hidden topics Using both

58.90 54.65 59.16 81.19 83.09 82.26

55.37 59.39 55.76 72.45 70.93 72.17

57.08 56.92 57.41 76.57 76.53 76.89

incorporating the un-annotated data. However, the result indicates that the proposed semi-supervised learning framework does work even the baseline approach has achieved very good performance. We can further infer that the proposed framework does work even under the limited improvement space. For event extraction task, Table 4 lists the recall, precision, and Fscore obtained on the test set of the MLEE corpus. The overall event extraction performance, 57.4% in F-score, is a very promising result given the task challenging aspects; most obviously that it involves more than 10 distinct event types. For point of comparison, using the proposed framework based on hidden topics and sentences’ structures, the performance of event extraction system is improved significantly with nearly 10% on recall. The overall improvement on F-score is around 2.2%. To further investigate how the improvement is achieved, we analyzed the experimental results of the baseline approach and the proposed framework. We found that events identified correctly by the baseline approach are still extracted correctly by the proposed framework in 95.9% of cases. Out of the events not extracted correctly by the baseline, 6.6% are correctly extracted by our framework. 4.3. Further analysis To investigate the effectiveness of distance measure we designed in Section 3.1, experiments were conducted by modifying the distance measure. Table 5 shows the results of the proposed framework for event extraction and trigger word identification under different distant measures. It can be observed that the best performance on both event extraction and event trigger identification is achieved using the distance measure based on both sentences’ structures and hidden topics. It confirms our speculation that both sentences’ structures and hidden topics are crucial for describing the similarity between two sentences. It can be also observed from Table 5 that sentences’ structures are slightly more important comparing to hidden topics in describing the sentences’ similarities. That explains why researchers usually employ sentences structure for calculating sentences’ similarities when with limited resources. To explore the effectiveness of automatically assigning event annotation for sentences in EU , we compared the performance of the framework with different number of un-annotated sentences added in EL . Fig. 4 shows the event extraction performance versus the number of un-annotated sentences added in EL . It can be observed that F-score value increases gradually when increasingly adding un-annotated sentences from EU . When adding 1500 sentences in EL , F-score reaches 57.4%, the best performance. Then the system’s performance decreases slightly. It can be explained that at the begin, the parameters of the statistical model were estimated more accurately with more training data and then the performance decreases as more and more noisy data are added into EL . The number of hidden topics t is set in advance. If t is too small, sentences with different topics might share the same topic vector. On the contrary, if t is too big, it becomes expensive to train the topic model. To explore whether and how the number of hidden topics impacts event extraction performance of the proposed framework, experiments were conducted under four different topics numbers.

0.65 0.63 F−score

0.61 0.59 0.57 0.55 0.53 0.51 0.49 0.47 0.45 0

500

1000 1500 2000 Number of Added Sentences

2500

Fig. 4. Performance of the proposed framework for event extraction versus the number of un-annotated sentences added into training set.

Performance of the Proposed Framework for Event Extraction

Trigger identification

Performance of the Proposed Framework for Event Extraction

Task

0.6

0.59

0.58

0.57

0.56

0.55 50

F−score Precision Recall 100

150

200

250 300 350 Number of Topics

400

450

500

Fig. 5. Performance of the proposed framework for event extraction versus the number of hidden topics employed in distance measure.

Fig. 5 shows the event extraction performance versus the number of hidden topics employed in LDA model. It can be observed that the best performance is achieved when the number of hidden topics is set to 100. It further confirms that too small number of hidden topics is not suitable for describing the topics in sentences. 5. Conclusion and future work In this paper, we have presented a semi-supervised learning framework for biomedical event extraction. In this framework, sentences in the un-annotated corpus are elaborately assigned with event annotations based on their distances with the sentences in the annotated corpus. In particular, not only the structures of the sentences, but also the hidden topics embedded in the sentences are employed for describing the distance. The sentences and

Please cite this article in press as: Zhou D, Zhong D. A semi-supervised learning framework for biomedical event extraction based on hidden topics. Artif Intell Med (2015), http://dx.doi.org/10.1016/j.artmed.2015.03.004

G Model ARTMED-1391; No. of Pages 8

ARTICLE IN PRESS D. Zhou, D. Zhong / Artificial Intelligence in Medicine xxx (2015) xxx–xxx

8

newly assigned event annotations, together with the annotated corpus, are employed for training. Experimental results on the golden standard corpus show that more than 2.2% improvement on F-score on biomedical event extraction is achieved by the proposed framework when compared to the state-of-the-art approach, demonstrating the effectiveness of the proposed framework. In the future work we will investigate the combination of semisupervised learning with active learning to further improve the performance of the event extraction system. Acknowledgments The authors thank the anonymous reviewers for their insightful comments. This research was supported by the National Science Foundation of China (61103077), Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry and the Cultivation Program for Young Faculties of Southeast University. References [1] Kim J-D, Ohta T, Pyysalo S, Kano Y, Tsujii J. Overview of bionlp’09 shared task on event extraction. In: Tsujii J, editor. Proceedings of the workshop on BioNLP. Stroudsburg, PA, USA: Association for Computational Linguistics; 2009. p. 1–9. [2] Kim J-D, Nguyen N, Wang Y, Tsujii J, Takagi T, Yonezawa A. The genia event and protein coreference tasks of the BioNLP shared task 2011. BMC Bioinform 2012;13(Suppl 11). S1. [3] Nédellec C, Bossy R, Kim J-D, Kim J-J, Ohta T, Pyysalo S, et al. Overview of bionlp shared task 2013. In: Nédellec C, Bossy R, Kim J-D, Kim J-J, Ohta T, Pyysalo S, Zweigenbaum P, editors. Proceedings of the BioNLP shared task 2013 workshop. Stroudsburg, PA, USA: Association for Computational Linguistics; August 2013. p. 1–7. [4] Pyysalo S, Ohta T, Cho HC, Sullivan D, Mao C, Sobral B, et al. Towards event extraction from full texts on infectious diseases. In: Matsumoto Y, Mihalcea R, editors. Proceedings of the 2010 workshop on biomedical natural language processing. Stroudsburg, PA, USA: Association for Computational Linguistics; 2010. p. 132–40. [5] Bui Q-C, Sloot PMA. Extracting biological events from text using simple syntactic patterns. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. Stroudsburg, PA, USA: Association for Computational Linguistics; 2011. p. 143. [6] Le Minh Q, Truong SN, Bao QH. A pattern approach for biomedical event annotation. In: Tsujii J, Kim J-D, Pyysalo S, editors. Proceedings of BioNLP shared task 2011 workshop. Stroudsburg, PA, USA: Association for Computational Linguistics; June 2011. p. 149–50.

[7] Zhou D, He Y. Biomedical events extraction using the hidden vector state model. Artif Intell Med 2011;53(3):205–13. [8] Bjorne J, Heimonen J, Ginter F, Airola A, Pahikkla T, Salakoski T. Extracting complex biological events with rich graph-based feature sets. In: Tsujii J, editor. Proceedings of the workshop on BioNLP. Stroudsburg, PA, USA: Association for Computational Linguistics; 2009. p. 10–8. [9] Kim J-D, Wang Y, Yasunori Y. The genia event extraction shared task, 2013 edition - overview. In: Nédellec C, Bossy R, Kim J-D, Kim J-J, Ohta T, Pyysalo S, Zweigenbaum P, editors. Proceedings of the BioNLP shared task 2013 workshop. Stroudsburg, PA, USA: Association for Computational Linguistics; August 2013. p. 8–15. [10] Wang J, Xu Q, Lin H, Yang Z, Li Y. Semi-supervised method for biomedical event extraction. Proteome Sci 2013;11(Suppl 1):S17. [11] Medline. http://www.ncbi.nlm.nih.gov/pubmed [accessed 25.11.14]. [12] MacKinlay A, Martinez D, Yepes AJ, Liu H, John Wilbur W, Verspoor K. Extracting biomedical events and modifications using subgraph matching with noisy training data. In: Nédellec C, Bossy R, Kim J-D, Kim J-J, Ohta T, Pyysalo S, Zweigenbaum P, editors. Proceedings of the workshop on BioNLP. Stroudsburg, PA, USA: Association for Computational Linguistics; 2013. p. 35–44. [13] Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. J Mach Learn Res 2003 March;3:993–1022. [14] Zweigenbaum P, Demner-Fushman D, Cohen KB. Frontiers of biomedical text mining: current progress. Brief Bioinform 2007;8:358–75. [15] Zhou D, Zhong D, He Y. Event trigger identification for biomedical events extraction using domain knowledge. Bioinformatics 2014;30:1587–94. [16] Pyysalo S, Ohta T, Miwa M, Cho H-C, Tsujii J, Ananiadou S. Event extraction across multiple levels of biological organization. Bioinformatics 2012;28(18):i575–81. [17] Sagae K, Tsujii J. Dependency parsing and domain adaptation with lr models and parser ensembles. In: Eisner J, editor. Proceedings of the 2007 joint conferences on empirical methods in natural language processing and computational natural language learning, vol. 1. Stroudsburg, PA, USA: Association for Computational Linguistics; 2007. p. 1044–50. [18] Miyao Y, Tsujii J. Feature forest models for probabilistic HPSG parsing. Comput Linguist 2008;34(1):35–80. [19] Zhou D, He Y, Kwoh C-K. Semi-Supervised Learning of the Hidden Vector State Model for Extracting Protein-Protein Interactions. Artif Intell Med 2007;41:209–22. [20] Collins M, Duffy N. Convolution kernels for natural language. In: Dietterich TG, Becker S, Ghahramani Z, editors. Advances in neural information processing systems, vol. 14. MIT Press; 2001. p. 625–32. [21] Marneffe M, Maccartney B, Manning C. Generating typed dependency parses from phrase structure parses. In: Calzolari N, Choukri K, Gangemi A, Maegaard B, Mariani J, Odijk J, Tapias D, editors. Proceedings of the fifth international conference on language resources and evaluation (LREC-2006). Genoa, Italy, ACL Anthology Identifier: L06-1260: European Language Resources Association (ELRA); 2006. [22] Altman NS. An introduction to kernel and nearest-neighbor nonparametric regression. Am Stat 1992;46(3):175–85.

Please cite this article in press as: Zhou D, Zhong D. A semi-supervised learning framework for biomedical event extraction based on hidden topics. Artif Intell Med (2015), http://dx.doi.org/10.1016/j.artmed.2015.03.004

A semi-supervised learning framework for biomedical event extraction based on hidden topics.

Scientists have devoted decades of efforts to understanding the interaction between proteins or RNA production. The information might empower the curr...
1MB Sizes 3 Downloads 5 Views